July 16, 2024


Identify important features associated with models in Python using SHAP and Sci-Kit Learn

Image taken from Unsplash by Joshua Golde

This article will focus on providing the intuition and Python implementation associated with the various ways to identify important features associated to machine learning models. The following is the structure of the article.

Table of Contents

  • What is Feature Importance in Machine Learning?
  • Coefficient Based Feature Importance
  • Permutation Feature Importance
  • Tree Feature Importance
  • SHAP Feature Importance
  • Implementation
    – Requirements
    – Synthesize Data & Generate Model
    – Coefficient
    – Permutation
    – Tree
    – SHAP
  • Concluding Remarks
  • Resources

Feature importance is an integral component in model development. It highlights which features passed into a model have a higher degree of impact for generating a prediction than others. The results from identifying important features can feed directly into model testing and model explainability. There are various ways to calculate feature importance, such as:

  1. Coefficient based feature importance
  2. Permutation based feature importance
  3. Tree feature importance
  4. SHAP

Do be advised that not all methods of calculating feature importance are applicable to all types of models. These methods are primarily applicable to most models in supervised classical machine learning problems like classification and regression.

Coefficient based feature importance is probably the simplest to understand out of them all. Intuitively, coefficient based feature importance refers to models which generate predictions as a weighted sum of the input values.

Calculating feature importance for this is not available for all models. This is applicable to linear models like linear regression, logistic regression, ridge regression, support vector machine (only when the kernel is linear). The main thing that these types of models have in common is that they identify a weights associated with a set of coefficients which we can interpret as feature importance. You can rank the features associated with these coefficients from highest to lowest, the highest being the most important features and the lowest being the least important.

Permutation based feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. [2]

– http://scikit-learn.org/stable/modules/permutation_importance.html

The results of the permutation feature importance can range from being negative and positive. When the permutation score is negative for certain feature(s), it indicates that the predictions generated from the shuffled data happen to be more accurate than the real data. This happens when the model thinks the feature should have importance but doesn’t. It occurs because random chance caused the predictions generated from the shuffled data to be more accurate. This implies that these feature(s) are noisy.

Tree based models from sci-kit learn like decision tree, random forest, gradient boosting, ada boosting, etc. have their own feature importance embedded into them. They calculate their importance scores based on the reduction in the criterion used to select split points like Gini or entropy [1]. You can reference these scores through the feature_importances_ property available after training the model.

SHAP is a popular research paper with the focus on model explainability. SHAP feature importance is an alternative method to permutation feature importance [3]. The difference between the permutation method and SHAP is that SHAP looks at the magnitude of feature attributions whereas permutation looks at the decrease in model performance [3]. The SHAP library has a series of explainer classes built into it. It supports explainability for linear models, kernels, trees, deep learning models, etc.



Synthesize Data & Generate Model

Be advised that since we’re using randomly generated data, the results of the feature importance will be pretty much meaningless. This is just to showcase how you can implement these various types of feature importance metrics through code for the problem you’re working on. The script below will synthesize a random dataset with 5 features and train 2 regression based models from it. Namely, the Gradient Boosting Regressor and Support Vector Regressor.


Coefficient feature importance. Image provided by the author.


Permutation feature importance. Image provided by the author.


Tree based feature importance. Image provided by the author.


SHAP feature importance on SVM regression model with linear kernel. Image provided by the author.
SHAP feature importance on Gradient Boosting Regressor. Image provided by the author.

The summary plot feature in the SHAP library allows you to visually see the most important features for the model based on their SHAP values. The first plot shows the distribution of impact each feature has, while the second plot is a bar plot generated by the MAE (mean absolute value) of the SHAP values.

Based on the intuitive understanding of how various approaches to identifying feature importance associated with the data a model was trained on, and the associated code provided, you should be able to implement this for the models you are working on to see which features have the highest impact. This can be very useful for model testing and explainability.

Feel free to checkout the repository associated with this tutorial on my GitHub page here.

If you’re looking to transition into the data industry and want mentorship and guidance from seasoned mentors then you might want to check out Sharpest Minds. Sharpest Minds is a mentorship platform where mentors (who are seasoned practicing data scientists, machine learning engineers, research scientists, CTO, etc.) would aid in your development and learning to land a job in data. Check them out here.

Feature Importance in Machine Learning, Explained Republished from Source https://towardsdatascience.com/feature-importance-in-machine-learning-explained-443e35b1b284?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed




Source link