Just to help you recall the structure of the flow in StatwolfML, it consists of input.json and test_data.json files for both train and test functions, respectfully. The structure is depicted on Figure 1.
Figure 1: Flow structure of StatwolfML. |
input.json file consist of following sections:
And test_data.json:
The input.json file we used for the preprocessing has the following structure:
{
“acquire”: {...},
“preprocessing”: [...],
“models”: [..],
“models_block”: [...]
}
We have already seen how to add the pre-process functions to the flow, and now we are focusing on the models section. Here, a user can list various models from the package. The general structure is given by:
“models”: [{
“name”: “...”,
“target_name”: “...”,
“feature_names”: “...”
},{
“name”: “...”,
“target_name”: “...”,
“feature_names” : “...”,
“options” : {...}
}]
The parameters for each model are the following:
models_block allows a user to use a list of models on the same set of features and target. The structure of it is the following:
“models_block”: [{
“feature_names”: [...],
“target_name”: ...,
“accuracy_metrics”: [...],
“models”: [...]
}]
Hence we can list various models in one block. A user can also define a list of evaluation metrics.
For the Titanic dataset, we will consider three different models: Logistic Regression, Support Vector Machine Classifier (SVC), and Decision Tree Classifier.
models_block": [{
"target_name": "Survived",
"accuracy_metrics": ["f1_score"],
"models": [{
"name": "logistic_regression",
"label": "Logistic Regression"
},
{
"name": "svc",
"label": "SVC"
},
{
"name": "decision_tree_classifier",
"label": "Decision Tree Classifier"
}]
}]
Since we dropped unnecessary columns earlier in the pre-process section, we do not need specify the feature names - we will use the entire dataset.
The test_data.json has the following structure:
{
" acquire " : {
" source " : " filesystem" ,
" type " : " csv " ,
" dataset_name " : " titanic " ,
" address " : " example / titanic / test . csv " ,
" index_col " : 0
},
" save " : [ ]
}
There are three main parts:
Let us briefly explain the recall and precision metrics, and why they are important. Imagine as if we have a classification problem; we train a classifier on a dataset and then test it on unseen data.
Category | Label | Prediction |
---|---|---|
True Positive | 1 | 1 |
False Positive | 0 | 1 |
False Negative | 1 | 0 |
True Negative | 1 | 1 |
There are three different metrics available in order to estimate your classifier:
Accuracy measures how many samples that the algorithm correctly identifies.
In the Titanic user-case, the accuracy relates to the number of survived and non-survived values, which were predicted correctly and divided by the total number of samples.
Precision measures how many samples among those predicted as positive (TP+FP) are correct.
In our Titanic use-case, precision will tell you how many survivors are correctly identify by our model, over the total predicted survivors.
Recall measures how many samples that were classified as positve (TP+FN), are actually positive.
In the Titanic case, recall would identify how many survivors were identified correctly over the total number of survivors.
Why do we need to define three different metrics when we could just rely on precision? Well, imagine as if we dealt with skewed-data, when the number of negative values is much higher than the number of positive ones:
Figure 2: Example of unbalanced dataset with two different values of labels. |
On Figure 2 we have:
If the classifier predicts most of the samples to be blue, it will have good accuracy:
Figure 3: Resulting classifier with high accuracy, where most of pink points are predicted as blue. |
On Figure 3, we see an example of a classifier. Let’s compute all of the components for the metrics:
And the metrics are:
Accuracy gets a high value, while precision is low. This means that we must take into account that possibility that only accuracy might be misleading when talking about the algorithm performance.
That is why precision and recall are more balanced metrics. A good algorithm has both high recall and high precision.
The metric that combines the two is called F1 score and it is given by:
Going back to our example, a good classifier will give something like this:
Figure 4: Example of a good classifier with high precision and high recall. |
Let us use this metric for the Titanic problem:
“accuracy_metrics”: [“f1_score”]
In a save section, we add the following:
" save " : [ {
" path " : " temp / titanic_final . csv " ,
" format " : " csv " ,
" outcome " : " filesystem "
} ,... ]
This declaration basically saves the dataset table with additional columns, the names of which are the model labels and the predicted values. And the last thing we want to put in the save section, are the plots:
" save " : [ ... ,
{
" outcome " : " filesystem " ,
" format " : " json " ,
" path " : " temp / log_reg . json " ,
" transform " : " plotly " ,
" figure " : [ {
" chartType " : " histogram " ,
" model " : " Logistic Regression "
}]
},
{
" outcome " : " filesystem " ,
" format " : " json " ,
" path " : " temp / dtc . json " ,
" transform " : " plotly " ,
" figure " : [ {
" chartType " : " histogram " ,
" model " : " Decision Tree Classifier "
}]
},
{
" outcome " : " filesystem " ,
" format " : " json " ,
" path " : " temp / svc . json " ,
" transform " : " plotly " ,
" figure " : [ {
" chartType " : " histogram " ,
" model " : " SVC "
}]
}]
In the output file, we can find the accuracy score for each model in order to compare the performance of each algorithm. The results have shown that the SVC model has the highest F1 score (≈ 0.76, see the Figure 5).
Figure 5: The results of three binary classifiers: (a) Logistic Regression, f1 ≈ 0.71, (b) Decision Tree Classifier, f1 ≈ 0.74 and (c) Support Vector Machine Classifier f1 ≈ 0.76. |
We just concluded our first journey in StatwolfML, and we’ve seen:
We tried to give you an overview on just how important it is for a data scientist to have a tool with:
We hope you enjoyed this introductory series of posts about StatwolfML, and we also hope to see you on the next series regarding Time Series analysis and forecast!