In this section, we will work on the pre-processing aspect. It’s very simple with StatwolfML, since it has a very intuitive syntax and most of the function names are the same as in Panda’s and Sklearn Python libraries.
Let’s start with NaNs
From the explorative analysis, we know that columns “Age,” “Cabin,” and “Embarked,” contain null values. We won’t be using “Cabin” in our analysis due to the very large number of null values in it. Hence, we can just leave it as it is.
Note that “Embarked” is a categorical variable, while “Age” is the continuous one.
There are different strategies to fill the NaN values in the column. The simplest one is to just remove the lines containing NaNs. Another way is to impute these values. In StatwolfML, there are several ways to do that:
As it has been shown in the exploration part, the “Embarked” variable has three values: S, Q, and C, with S being most frequent. We can impute the NaN values for these columns as follows:
StatwolfML permits us to perform simple feature engineering. In particular, one can create new columns combining the existing ones. Let’s see how it works.
First, we create a “Family” column, which is the combination of “Parch” and “SibSp.” Recall that these two columns describe family relations. “Parch,” is the parent-child columns and “SibSp,” is the sibling and spouse column. We sum them up in order to have one feature that describes familial relations:
Basically, the user has to specify the name of the new column (“new col”) and add the operations (“value”).
The final two steps we need to do for the Titanic problem, is to encode the new column, “Family,” in a similar manner as we did for “Age” and “Fare”; then, we must remove the columns we do not need for the next steps.
So, for “Family,” we want to set the value at 1 if a passenger has a family member on board and 0 for all other cases: