Saturday, September 14, 2019

Machine Learning Model Pipeline: Feature Engineering

Feature engineering includes following tasks.



Why do we need to engineer our features?

There are variates of problem in the data we collected thus far. This includes absence of values for certain observations(usually rows) with a variables(usually columns). Second aspect is the presence of labels in categorical variables, meaning the value of the variables are strings rather than numbers, and we cannot use them as such in ml model. Third consideration is the distribution of the variables
for numerical variables and specifically they follow normal/gausssian distribution or are rather skewed. For some algorithms presence of outliers is also detrimental. Outliers are generally values that are extremely high or extremely low compare to majority of all other values for same variable.

Missing Data

Missing values for certain observations within a variable.

Affects all ml models
  Scikit-learn.

There are variety of reason, why there can be missing value, value can be lost, or not store properly during data storage or the value does not exist or data was obtained from survey and person refuses to offer certain question. So we need to be prepared to fill in those values with certain numbers.


Labels in categorical variables

The problem come in 3 flavours.

Cardinality: high number of labels/categories
that a variable can take. Variables with big
number of categories tend to dominate with
variables with smaller number of categories
at the time for building tree based ml models.
Tree based ml models tends to overfit with high
cardinality categorical values.

Rare labels: infrequent categories. They present

operational problem because they are rare.
Because some of them will only appear in
training set and some of them will appear
only in test set. So model will not know what
to do with labels that are present in test set.
So this step is important to tackle unseen values
before feeding to ml model.

Categories: strings, we need to address the string values before we fed the data to model.


Distributions


For numerical variable, we consider the distribution of variables.

Linear model assumption:

 Variables follow a Gaussian distribution.
 So if variables in out linear model are not
gaussian, we may chose to apply some
transformation. Models like SVM, neural network do not make any variable assumptions, however better the spread of values benefits the performance of these algorithms.


Outliers





Outlier affects certain ml models, and certainly linear regression. In the above diagram you can see the line seems to deviate from the majority of points due to a outlier in the dataset. Other algorithms like Adaboost is also sensitive to outliers, because this algorithms put tremendous weights to outliers
to try and correct the result of previous iterations. This tends to cause overfitting and bad generalisation.

Feature Magnitude- Scale

Magnitude of the variables also affects the model performance.

For example if length is in meters, and we change it to km, the coefficients that multiply that variable in a linear model.

For example, if we are trying to predict house price, one variables is the area which is tens of
square m and different variables is the number of rooms that varies from 1 to 10, in a linear model the variable that has larger value will have predominant role over the house price, which might not
be correct most of the times. This is why we do normalization.



















No comments:

Post a Comment