[ad_1]
This article will aim to provide the intuition and implementation associated with data drift in Python. It will cover the implementation and differences between two approaches to calculating drift namely, cross entropy and KL divergence. The following is the outline of this article.
Table of Contents
- What is Data Drift?
- Drift Metrics
– Cross Entropy
– KL Divergence - Solution Architecture
– Requirements - Implementation
– Generate Data
– Train Model
– Generate Observations
– Calculate Drift
– Visualize Drift Over Time - Obstacles of Calculating Drift
- Concluding Remarks
- Resources
MLOps is an integral component of building successful machine learning models and deploying them into production. Data drift can fall under the category of model monitoring in MLOps. It refers to quantifying the changes in the observed data with respect to the training data. The impact of these changes over time can have a drastic impact on the quality of the predictions generated by the model, often for the worse. Tracking the drift associated with training features and the prediction should be integral towards model monitoring and identifying when the model should be retrained.
You can reference my other article for more details on the concept and architecture behind monitoring machine learning models in production environments here.
The only instance where you might not want to monitor drift associated with your models predictions / features would be when you’re retraining the model on a basis as regular as you’re generating predictions. This might be a common occurrence associated with many applications of time series models. However, there are various other things which you can track to identify the quality of the models you’re generating. This article will mainly focus on models associated with classical machine learning (classification, regression and clustering).
Both metrics outlined below are statistical measures quantifying how similar a pair of probability distributions are.
Cross Entropy
Cross entropy can be defined by the following formula :
- p : True probability distribution
- q : Estimated probability distribution
From an information theory point of view, entropy reflects the amount of information needed for removing the uncertainty [3]. Be aware that the cross entropy of distributions A and B will be different from the cross entropy of distributions B and A.
KL Divergence
Kullback Leibler divergence, also known as KL divergence, can be defined through the following formula :
- P : True probability distribution
- Q : Estimated probability distribution
The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of P using a code optimized for Q rather than one optimized for P [1]. Be aware that the KL divergence of distributions A and B will be different from the KL divergence of distributions B and A.
Let it be known that both metrics are not distance metrics. This is because of the lack of symmetry the metrics have.
entropy / KL divergence of A,B != entropy / KL divergence of B,A
Solution Architecture
The following diagram outlines how the machine learning lifecycle operates while incorporating model monitoring. As illustrated in the requirements above, to monitor the performance of a model, a variety of data should be saved from the training phase. Namely, the features and target data used to train the model. This will provide a ground truth data source to compare new observations against.
Requirements
The following modules and versions were used to run the code shown in the implementation locally. All of them are famous libraries for data science / analytics / machine learning so installation of specific versions should not be a problem for most users.
Python=3.9.12
pandas>=1.4.3
numpy>=1.23.2
scipy>=1.9.1
matplotlib>=3.5.1
sklearn>=1.1.2
I’m going to showcase how to calculate data drift over time through an example using synthetic data. Be advised that the values generated for me will not be consistent with the values you generate as they are randomly generated. Furthermore, there aren’t really any meaningful results to be interpreted from the visualizations / data provided since it’s randomly generated. The aim is to provide reusable and reconfigurable code for you to use in your applications.
Generate Data
The script above will generate a synthetic dataset which consists of 1,000 rows and the columns uuid, feature1, feature2, feature3, target
. This will be our ground truth data for which the model will be trained on.
Train Model
For the purposes of this tutorial, the script above will allow you to create a random forest regression model based on the features and target which we’ve generated above. Assume that this model will get pushed into a production environment and will be called on a daily basis.
Generate Observations
The script above will generate observations associated with the features for the first day that the model is in production and gets called upon. We can now visualize the difference in the ground truth training data with respect to the observation data.
As you can see from the graph above, on the first day that the model is in production there are more observations for the feature than the ground truth. This is a problem because we can’t compare two lists of values which are not the same length. If we did try to compare two arrays with varying lengths, it would yield misinformative results. Now to calculate the drift, we need to make the length of the observations equivalent to the length of the ground truth data. We can do this through creating N buckets, and identifying the frequency of observations in each bucket. Essentially creating a histogram. The snippet below will visualize these distributions against each other.
Now that the two datasets are of the same size, we can compare the drift in the two distributions.
Calculate Drift
The script above outlines how you can calculate the drift (using the entropy
implementation in scipy
) associated with the observation data with respect to the training data. It begins by normalizing the size of the input vectors to be of the same length through the hist
method in matplotlib
, converting those values into probabilities through the softmax
function then finally calculating the drift through the entropy
function.
Visualize Drift Over Time
Based on the results you generate for your dataset, it would be advised to identify some threshold where if the drift score exceeds that threshold for the majority of the important features which the model relies on, it would be a strong indicator for retraining the model. Feature importance can be identified easily for tree based models through sklearn
or through things like SHAP.
There are various obstacles you might encounter when it comes to calculating data drift of the machine learning models you’re using.
- Dealing with features / predictions with the value 0. This will yield a division by zero error associated with both the drift implementations. A quick and easy work around to this problem would be to replace zero with a small value very close to zero. Since monitoring drift can be a case by case basis, be proactive in understanding what impact this can have on the problem you’re working on.
- Comparing a pair of distributions which are not the same length. Suppose that you trained the model on 1,000 observations associated with each feature and target. However, when you’re generating predictions on a daily basis, the amount of observations for the features and target vary from 1,000–10,000 based on the amount of traffic the platform gets. This is problematic because you cannot compare two distributions with different sizes. To resolve this problem, you can use the binning method done in the implementation above to bin the training data and observations into groups of the same size and then calculate the drift on top of that data. This can be done easily through the
histogram
method in thematplotlib
library. - Getting
NaN
values when using thesoftmax
function to convert frequency into probabilities. This is because thesoftmax
function relies on exponential. You will receiveNaN
results in the output of the softmax because the computer cannot calculate the exponential of a large number. If this is the case, you might want to look into another implementation which isn’t softmax, or look into normalizing the values you are passing in such that softmax would work.
This article focused on going into detail on how to calculate data drift in the application of classical machine learning. I went over the intuition and implementation behind common drift calculation metrics like KL Divergence and Cross Entropy.
Furthermore, this article outlined common obstacles individuals would encounter when trying to calculate drift. Namely, the division by zero error when there are zero values and the issue regarding comparing a pair of distributions which aren’t the same size. Be advised that this article is primarily helpful to those who aren’t frequently retraining your model. Model monitoring will act as a means to measure the success of a model and identify when it is regressing in its performance due to drift. Both are indicators of retraining or reiterating on the model development phase all together.
Feel free to checkout the repository associated with this tutorial on my GitHub page here.
If you’re looking to transition into the data industry and want mentorship and guidance from seasoned mentors then you might want to check out Sharpest Minds. Sharpest Minds is a mentorship platform where mentors (who are seasoned practicing data scientists, machine learning engineers, research scientists, CTO, etc.) would aid in your development and learning to land a job in data. Check them out here.
Calculating Data Drift in Machine Learning Republished from Source https://towardsdatascience.com/calculating-data-drift-in-machine-learning-53676ff5646b?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed
<!–
–>
[ad_2]
Source link