Data leakage represents, together with over/underfitting, the main cause of failure of machine learning projects that go into production
Data leakage is undoubtedly a threat that preys on data scientists, regardless of the level of seniority.
It is that phenomenon that can affect everyone — even professionals with years of experience in the sector.
Together with over/underfitting, it represents the main cause of failure of machine learning projects that go into production.
Data leakage occurs when information present in the training set leaks into the evaluation set (whether validation or test set)
But why does data leakage claim so many victims?
Because even after many experiments and evaluations in the development phase, our models can fail spectacularly in a production scenario.
Avoiding data leakage is not easy. I hope that with this article you’ll understand why and how to avoid it in your projects!
Here’s an example that can be useful for you to understand what data leakage is.
Imagine that we are developers of applied AI and we are employed by a company that manufactures children’s toys in series.
Our task is to create a machine learning model to identify if a toy will be subject to a refund request within 3 days of its sale.
We receive the data from the factory, in the form of images capturing the toy before canning.
We use these images to train our model which performs very well in cross validation and on the test set.
We deliver the model and for the first month the customer reports only 5% defective toy refund requests.
In the second month we prepare for the retraining of the model. The factory sends us more photographs, which we use to…