“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning is basically feature engineering.” — Prof. Andrew Ng.
Data Scientists spend 80% of their time doing feature engineering because it's a time-consuming and difficult process. Understanding features and the various techniques involved to deconstruct this art can ease the complex process of feature engineering. So, let's get started.
Feature engineering is the ‘art’ of formulating useful features from existing data following the target to be learned and the machine learning model used. It involves transforming data to forms that better relate to the underlying target to be learned. When done right, feature engineering can augment the value of your existing data and improve the performance of your machine learning models. On the other hand, using bad features may require you to build much more complex models to achieve the same level of performance.
This is the reason feature Engineering has found its place as an indispensable step in the machine learning pipeline. Yet, when it comes to applying this magical concept of Feature Engineering, there is no hard and fast method or theoretical framework, which is why it has maintained its status as a concept that eludes many.
This article will try to demystify this subtle art while establishing the significance it bears despite its nuances and finally get started on our journey with a fun feature engineering Python example you can follow along!
To understand what feature engineering is at an intuitive level and why it is indispensable it might be useful to decipher how humans comprehend data. Humans have an ability, leaps ahead of that of a machine, to find complex patterns or relations, so much so that we can see them even when they don’t actually exist. Yet even to us, data presented efficiently could mean a lot more than that which is presented randomly. If you haven’t experienced this already, let’s try to drive this home with a ‘sweet’ feature engineering example!
Say you have been provided the following data about candy orders:
You have also been informed that the customers are uncompromising candy-lovers who consider their candy preference far more important than the price or even dimensions (essentially uncorrelated price, dimensions, and candy sales). What would you do when you are asked to predict which kind of candy is most likely to sell the most on a particular day?
Then, I think you’d agree that the variety of candy ordered would depend more on the date than on the time of the day it was ordered and also that the sales for a particular variety of candy would vary according to the season.
Now that you instinctively know what features would most likely contribute to your predictions, let's go ahead and present our data better by simply creating a new feature Date from the existing feature Date and Time.
The table you have obtained as a result should definitely make it at least a tad bit simpler for you to predict that Sour Jellies are most likely to sell, especially around the end of October (Halloween!) given the very same input data…
In addition, if you wanted to know more about the weekend and weekday sale trends, in particular, you could categorize the days of the week in a feature called Weekend with 1=True and 0=False
With this, you could predict that it would be best to have your shelves stocked on the weekends!
This short example should have emphasized how a little bit of Feature Engineering could transform the way you understand your data. For a machine, however, such linear and straightforward relationships could do wonders.
Now that you have wrapped your head around why Feature Engineering is so important, how it could work, and also why it can’t be simply done mechanically, let’s explore a few feature engineering techniques that could help!
While understanding the data and the targeted problem is an indispensable part of Feature Engineering in machine learning, and there are indeed no hard and fast rules as to how it is to be achieved, the following feature engineering techniques are a must know:
Imputation deals with handling missing values in data. While deleting records that are missing certain values is one way of dealing with this issue, it could also mean losing out on a chunk of valuable data. This is where imputation can help. It can be broadly classified into two types. Namely:
Notice how the technique of imputation given above corresponds with the principle of normal distribution (where the values in the distribution are more likely to occur closer to the mean rather than the edges) which results in a fairly good estimate of missing data. A few other ways to go about this include replacing missing values by picking the value from a normal distribution with the mean and standard deviation of the corresponding existing values or even replacing the missing value with an arbitrary value.
However, one must be reasonably cautious when using this technique because retention of data size with this technique could come at the cost of deterioration of data quality. For example, say in the above candy problem you were given 5 records instead of one with the ‘Candy Variety’ missing. Using the above technique you would predict the missing values as ‘Sour Jelly’ resulting in possibly predicting the high sales of Sour Jellies all through the year! Therefore, it is wise to filter out records that have greater than a certain number of missing values or certain critical values missing and apply your discretion depending on the size and quality of data you are working with.
Discretization involves essentially taking a set of values of data and grouping sets of them together in some logical fashion into bins (or buckets). Binning can apply to numerical values as well as to categorical values. This could help prevent data from overfitting but comes at the cost of loss of granularity of data. The grouping of data can be done as follows:
Categorical encoding is the technique used to encode categorical features into numerical values which are usually simpler for an algorithm to understand. One hot encoding(OHE) is a popularly used technique of categorical encoding. Here, categorical values are converted into simple numerical 1’s and 0’s without the loss of information. As with other techniques, OHE has its own disadvantages and has to be used sparingly. It could result in a dramatic increase in the number of features and result in the creation of highly correlated features.
Besides OHE there are other methods of categorical encodings, such as 1. Count and Frequency encoding- captures each label's representation, 2. Mean encoding -establishes the relationship with the target and 3.Ordinal encoding- number assigned to each unique label.
Splitting features into parts can sometimes improve the value of the features toward the target to be learned. For instance, in this case, Date better contributes to the target function than Date and Time.
Outliers are unusually high or low values in the dataset which are unlikely to occur in normal scenarios. Since these outliers could adversely affect your prediction they must be handled appropriately. The various methods of handling outliers include:
Variable transformation techniques could help with normalizing skewed data. One such popularly used transformation is the logarithmic transformation. Logarithmic transformations operate to compress the larger numbers and relatively expand the smaller numbers. This in turn results in less skewed values especially in the case of heavy-tailed distributions. Other variable transformations used include Square root transformation and Box cox transformation which is a generalization of the former two.
Feature scaling is done owing to the sensitivity of some machine learning algorithms to the scale of the input values. This technique of feature scaling is sometimes referred to as feature normalization. The commonly used processes of scaling include:
It is necessary to be cautious when scaling sparse data using the above two techniques as it could result in additional computational load.
Feature creation involves deriving new features from existing ones. This can be done by simple mathematical operations such as aggregations to obtain the mean, median, mode, sum, or difference and even product of two values. These features, although derived directly from the given data, when carefully chosen to relate to the target can have an impact on the performance(as demonstrated later!)
While the techniques listed above are by no means a comprehensive list of techniques, they are popularly used and should definitely help you get started with feature engineering in machine learning.
We have gone over what Feature Engineering is, some commonly used feature engineering techniques, and its impact on our machine learning model’s performance. But why just take someone’s word for it?
Let’s consider a simple price prediction problem for our candy sales data –
We will use a simple linear regression model to predict the price of the various types of candies and experience first-hand how to implement python feature engineering.
Let’s start by building a function to calculate the coefficients using the standard formula for calculating the slope and intercept for our simple linear regression model.
Now we build our initial model without any Feature Engineering, by trying to relate one of the given features to our target. From observing the given data we know that it is most likely that the Length or the Breadth of the candy is most likely related to the price.
Let us start by trying to relate the length of the candy with the price.
We observe from the figure that Length does not have a particularly linear relation with the price.
We attempt a similar prediction with the Breadth to get a somewhat similar outcome. (You can execute this by simply replacing ‘Length by ‘Breadth in the above code block.)
Finally, it’s time to apply our newly gained knowledge of Feature Engineering! Instead of using just the given features, we use the Length and Breadth feature to derive a new feature called Size which (you might have already guessed) should have a much more monotonic relation with the Price of candy than the two features it was derived from.
We now use this new feature Size to build a new simple linear regression model.
If you thought that the previous predictions with the Length(or Breadth) feature were not too disappointing, the results with the Size feature you will agree are quite spectacular!
We have demonstrated with this example, that by simply multiplying the Length and Breadth features of a pack of candy you can achieve the Price predictions well beyond what you would with the much less efficient relationship of Prices to Length (or Breadth). However, when working with real-life data, the way you use Feature Engineering could be the difference between a simple model that works perfectly well and a complex model that doesn’t.
Candies aside, the takeaway from this should be that simple but well-thought-out Feature Engineering could be what brings us to the tipping point between a good machine learning model and a bad one. It is important to remember that the activities involved in Feature Engineering, owing to their nature, need not always be straightforward. It could involve an iterative process of brainstorming, creating features, building models, and doing it all over again from the top. It cannot be exaggerated enough that there is no ultimate approach, just one that is right for your purpose.
But rest assured, with practice it definitely gets easier. The aspects covered in this article should definitely help you get started on your journey towards simpler models and better predictions.
And when in doubt, still choose to trust the process of Feature Engineering, for as Ronald Coase rightly said ‘If you torture the data long enough, they will confess anything.’