Feature Engineering Explained


What is feature engineering

data science pipeline

Feature engineering involves preparing a given dataset such that it is suitable to be fed into a machine learning model. While there are many variations of the data science pipeline depending on the requirements, feature engineering usually sits between the exploratory data analysis step and the modelling step. Ideally, the processed data has been sufficiently analyzed to give the analyst an idea of the transformations to be applied such that:

  • The machine learning model is able to ingest the resulting dataset
  • The model fits the dataset reasonably well
  • The resulting dataset remains representative of real world phenomena
The word ideally is used here as in reality, there is a significant amount of trial and error involved in the process, and the analyst will find himself/herself iterative through the processing, feature engineering and modelling steps within the pipeline as shown in the workflow above.

Why do feature engineering

The main reasons ultimately lead back to the "why" of feature engineering and can be summarized as follows. The first 2 are fairly straight forward, and usually requires a more fixed set of steps to achieve. The last one however, is a little more involved and where most of the discussion will revolve around. To get data into the "right" format To ensure that the data is representative of real world phenomena To "help" the model perform better Though at this point, we've theoretically gotten the data into a usable state, where (for example):

  • Missing values have been imputed
  • Duplicated values have been removed
  • Column formats have been configured correctly
  • Aligned naming conventions and typos
  • Analyzed and removed outliers
  • and many others...
Our dataset may not be in the optimal condition for the modelling, but what does this mean? Explore graphing bowl WIDGET here, maybe add the polar coords one? Note that in reality, one would not be able to easily visualize the feature set's relationship with the response as most datasets span more than 2 features, or in otherwords span more than 3-dimensions, which would be rather troublesome to graph meaningfully. In essence, we're trying to stretch, twist, rotate and transform our dataset into a form that makes it easier to fit whatever kind of "line" we're trying to impose on it.

Feature Engineering Techniques

Scaling

Scaling of data is always a good habit in general. However, one must understand why we scale and how to choose between the different scaling methods to get the best performance out of a selected model. In very simple terms, we do scaling because of 2 main reasons.

  • We do not want a change in the value of a feature to impact model results drastically just because of its large magnitude
  • We want our features to follow some kind of distribution
  • INSERT EXAMPLE HERE?? We essentially need to scale data for algorithms that require any calculation of distance within its steps. A machine learning algorithm does not understand the concept of units of measure when trying to learn. What this means is that a value of 10 kilograms means exactly the same thing numerically as 10 cents when put into the same dataframe before training. Now, imagine our model is in the midst of training, and it is now at the step of calculating the error between a predicted 9c and the 10c in the response caused by a change of weight from 9kg to 10kg. So we can obviously infer that (keeping all else constant), a change in X of 1 causes a change in Y of 1. Now, if we had our initial weights in grams instead of kilograms, this would drastically change the statement we are making. We'd be saying now that an increase in X of 1000, causes an increase in Y of 1. If we were to change X by 1 now, it would seem alot less effective than it is in the first setting wouldn't it? However, we as humans know that this is not the case and we can attempt to communicate this through the scaling of features in our dataset. Also, which is larger, 10kg or 9c? The more important question is who cares? Scale the data and that problem goes away. We ultimately want to get to a point where we can say statements like "feature \(X_1\) is more important than \(X_2\)", and having it make some sense. Algorithms that rely on scaling
    • K-NN and K-Means for normalizing euclidean distance calculations
    • PCA for normalizing variance calculations
    • Neural networks for faster convergence
    • Regularization
    Algorithms that do not rely on scaling, rely on rules
    • Tree-based models
    • Linear regression using the normal equation
    Algorithms that have built-in functionality to handle scaling
    • Naive Bayes (go through maths briefly)
    • Linear Discriminant Analysis
    Popular Scaling Techniques
    • Standard Scaler
    • Min Mix Scaler
    • Log Scaler
    Other Scaling Techniques
    • Max Absoulate Scaler
    • Robust Scaler
    • Quantile Transformer
    • Unit Vector Scaler
    INSERT WIDGET ON SCALERS HERE

    Variable Encoding
    Machine learning models can only ingest numerical values.
    Binning
    Higher order features + interactions

    Creating high-order features or multiplying feature values together may seem unintuitive or make little sense, but this is best explained with an interactive widget. If we wanted to create a model to separate the orange and blue points, sure we can use a more complex non-linear model, but before jumping ahead, let's see what happens to the shape of our data when we try to add a 3rd feature column to it. Try the different options and see which techniques end up being useful to allow us to separate the 2 categories even with a linear model.