July 16, 2024


A simple yet highly practical feature

Photo by Nick Fewings on Unsplash

Scikit-learn is one of the most frequently used Python libraries in the data science ecosystem. It serves as a complete tool for machine learning tasks from data preprocessing to model evaluation.

The features in a tabular dataset are rarely used as they appear in the raw data. They usually require an extra step of data preprocessing before being used as input to a machine learning model.

OneHotEncoder is an example of these transformations. It encodes categorical features one-hot numeric arrays. The drawing below illustrates how one-hot encoding works on a categorical feature.

(image by author)

This is a required operation for some machine learning algorithms as they expect all the features to be numeric. Unless we work on tree-based algorithms such as random forests, the categorical features need to be converted to numeric ones.

As we see in the above illustration, one-hot encoding creates a column for each category. In the case of having categorical variables with a lot of distinct values, we will have a very high dimensional feature space.

If there are some categories that have very few occurrences compared to the other categories, it is better to group them. Creating a separate column for such categories is likely to add computation and memory burden to the model without providing significant value.

The new OneHotEncoder that comes with Scikit-learn 1.1 allows for grouping the infrequent categories. Let’s do an example to demonstrate how it is used.

Before we begin, let’s make sure you have the correct version.

import sklearn
# output

If you have a version prior to 1.1, you can update it using pip.

pip install --upgrade scikit-learn

Let’s create a sample DataFrame with two categorical features.

import pandas as pddf = pd.DataFrame({  "City": ["Houston"] * 25 + ["Rome"] * 30 + ["Madrid"] * 3 + 
["London"] * 2,

"Division": ["A"] * 30 + ["B"] * 25 + ["C"] * 1 + ["D"] * 4

# output
Rome 30
Houston 25
Madrid 3
London 2
Name: City, dtype: int64
# output
A 30
B 25
C 1
D 4
Name: Division, dtype: int64

In the city column, there are 3 Madrid and 2 London values which are much less than the other two cities. The distribution of values in the division column is also similar.

The next step is to encode these two features.

from sklearn.preprocessing import OneHotEncoder# create an encoder and fit the dataframe
enc = OneHotEncoder(sparse=False).fit(df)
encoded = enc.transform(df)
# convert it to a dataframe
encoded_df = pd.DataFrame(
(image by author)

Since we did not group the infrequent values, there are 8 columns in the encoded data. Let’s try it with grouping, which can be done using the min_frequency parameter. The categories that have less occurrences than the number of specified minimum frequency will be grouped.

# create an encoder and fit the dataframe
enc = OneHotEncoder(min_frequency=5, sparse=False).fit(df)
encoded = enc.transform(df)
# convert it to a dataframe
encoded_df = pd.DataFrame(
(image by author)

We now have a DataFrame with 6 features. The difference will be more significant when working with categories with a high number of distinct values.

Consider a feature that has 20 distinct values and 95% of the values belong to 4 distinct values. The remaining 5% belong to the 16 distinct values combined. In this case, the efficient strategy would be grouping those 16 values into a single group.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

Scikit-learn 1.1 Comes with an Improved OneHotEncoder Republished from Source https://towardsdatascience.com/scikit-learn-1-1-comes-with-an-improved-onehotencoder-5a1f939da190?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed




Source link