At this stage, the model is automatically retrained based on the trigger from the monitoring system. This process of retraining is also known as continuous learning. The objectives of continuous learning are:
- Combat sudden data drifts that may occur, ensuring the model remains effective even when faced with unexpected changes in the data.
- Adapt to rare events such as Black Friday, where patterns and trends in the data may significantly deviate from the norm.
- Overcoming the cold start problem, which arises when the model needs to make predictions for new users lacking historical data
Microsoft and Google are major players in the cloud computing market, with Azure holding a 22% market share and Google at 10%. They offer a wide range of services, including computing, storage, and development tools, which are essential components for building advanced ML infrastructure.
Like any business, they main goal is to generate revenue by selling these services. This is partially why their blogs emphasize advancement and automation. However, a higher level of maturity doesn’t guarantee better results for your business. The optimal solution is the one that aligns with your company’s specific needs and right tech stack.
While maturity levels can help to determine your current advancement, they shouldn’t be followed blindly since Microsoft and Google’s main incentives are to sell their services. The example is specifically their push for automated retraining. This process requires a lot of computation, but it’s often unnecessary or harmful. Retraining should be done when needed. What’s more important for your infrastructure is having a reliable monitoring system and an effective root cause analysis process.
Monitoring should start from the manual level
A limited monitoring system appears at level 2 in the described maturity levels. In reality, you should monitor your model as soon as business decisions are taken based on its output, regardless of maturity level. It allows you to reduce the risk of failure and see how the model performs regarding your business goals.
The initial step in monitoring can be as simple as comparing the model’s predictions to the actual values. This basic comparison is a baseline assessment of the model’s performance and a good starting point for further analysis when the model is failing. Additionally, it’s important to consider the evaluation of data science efforts, which includes measuring the return on investment (ROI). This means assessing the value that data science techniques and algorithms bring to the table. It’s crucial to understand how effective these efforts are in generating business value.
Evaluating ROI gives you insights and information that can help you make better decisions regarding allocating resources and planning future investments. As infrastructure evolves, the monitoring system can become more complex with additional features and capabilities. However, you should still pay attention to the importance of applying a basic monitoring setup to the infrastructure at the first level of maturity.
Risks of retraining
In the description of level 5, we listed the benefits of automatic retraining in production. However, before adding it to your infrastructure, you should consider the risks related to it:
- Retraining on delayed data
In some real-world scenarios, like loan-default prediction, labels may be delayed for months or even years. The ground truth is still coming, but you are retraining your model using the old data, which may not represent the current reality well.
2. Failure to determine the root cause of the problem
If the model’s performance drops, it doesn’t always mean that it needs more data. There could be various reasons for the model’s failure, such as changes in downstream business processes, training-serving skew, or data leakage. You should first investigate to find the underlying issue and then retrain the model if necessary.
3. Higher risk of failure
Retraining amplifies the risk of model failure. Besides the fact that it adds complexity to the infrastructure, the more frequently you update, the more opportunities the model has to fail. Any undetected problem appearing in the data collection or preprocessing will be propagated to the model, resulting in a retrained model on flawed data.
4. Higher costs
Retraining is not a cost-free process. It involves expenses related to:
- Storing and validating the retraining data
- Compute resources to retrain the model
- Testing a new model to determine if it performs better than the current one
ML systems are complex. Building and deploying models in a repeatable and sustainable manner is tough. In this blog post, we have explored five MLOps maturity levels based on the Google and Microsoft best practices in the industry. We have discussed the evolution from manual deployment to automated infrastructures, highlighting the benefits that each level brings. However, it is crucial to understand that these practices should not be followed blindly. Instead, their adaptation should be based on your company’s specific needs and requirements.