Machine Learning is great at solving certain complex problems, usually involving difficult relationships between features and outcomes that cannot be easily hard coded as heuristics or if-else statements. However, there are some limitations or things to have in mind when deciding if ML is a good solution for a given problem at hand. In this post we’ll deep dive into the topic “to use or not to use ML,” first understanding this for “traditional” ML models, and afterwards discussing how this picture is changing with the progress of Generative AI.
To clarify some of the points, I’ll use as an example the following initiative: “As a company, I want to know if my clients are satisfied and the main reasons for dissatisfaction”. A “traditional” ML based approach to solve this could be:
- Obtain comments clients write about you (app or play store, twitter or other social networks, your website…)
- Use a sentiment analysis model to classify the comments into positive / neutral / negative.
- Use topic modeling on the predicted “negative sentiment” comments to understand what they are about.
In supervised ML models, training data is necessary for the model to learn whatever it needs to predict (in this example, sentiment from a comment). If data has low quality (a lot of typos, missing data, errors…), it will be really hard for the model to perform well.
This is typically known as the “garbage in, garbage out” problem: if your data is garbage, your model and predictions will be garbage too.
Similarly, you need to have enough volume of data for the model to learn the different casuistry that impact whatever needs to be predicted. In this example, if you only have a case of negative comment label with concepts like “useless”, “disappointed” or similar, the model won’t be able to learn that these words usually appear when the label is “negative.”
Enough volume of training data should also help ensure you have a good representation of the data you will need to perform predictions on. For example, if your training data has no representation of a particular geographical area or a particular segment of the population, it is more likely the model will fail to perform well for those comments at prediction time.
For some use cases, having enough historic data is also relevant, to ensure we are able to compute relevant lagging features or labels (e.g. “customer pays the credit during the next year or not”).
Again, for traditional supervised ML models, you’ll need a labeled dataset: examples for which you know the final outcome of what you want to predict, to be able to train your model.
The definition of the label is key. In this example, our label would be the sentiment associated with the comment. We could think we only can have “positive” or “negative” comments, and then argue we might have “neutral” comments as well. In this case from a given comment, it will usually be clear if the label needs to be “positive”, “neutral” or “negative”. But imagine we had the labels “very positive”, “positive”, “neutral”, “negative” or “very negative”… For a given comment, would it be that easy to decide if it is “positive” or “very positive”? This lack of clear definition of the label needs to be avoided, as training with a noisy label will make it harder for the model to learn.
Now that the definition of the label is clear, we need to be able to get this label for a sufficient and quality set of examples, which will form our training data. In our example, we could consider manually tagging a set of comments, be it within the company or team, be it externalizing the tagging to professional annotators (yes, there are people working full time labelling dataset for ML!). Costs and feasibility associated with the obtention of these labels needs to be considered.
To reach final impact, the predictions of the ML model need to be usable. Depending on the use case, using the predictions might require specific infrastructures (e.g. ML Platform) and experts (e.g. ML Engineers).
In our example, as we want to use our model for analytical purposes we could run it offline and exploiting the predictions would be quite simple. However, if we wanted to automatically respond to a negative comment in the next 5 minutes it is published, this would be another story: the model would need to be deployed and integrated to make this possible. Overall, it is important to have a clear idea of what the requirements to use the predictions will be, to ensure it will be feasible with the team and tools available.
ML models will always have a level of error in their predictions. Actually, it is a classic in ML to say:
If the model has no errors, then there is definitely something wrong with the data or the model
This is important to understand, as if the use case doesn’t allow for these errors to happen, then it might not be a good idea to use ML. In our example, imagine instead of comments and sentiment, we were using the model to classify emails from customers into “pressing charges or not”. It wouldn’t be a good idea to have a model that can misclassify an email that is pressing charges against the company due to the terrible consequences this might have for the company.
There have been many proven cases of predictive models that discriminated based on gender, race and other sensitive personal attributes. Because of this, ML teams need to be careful on the data and features they are using for their projects, but also on questioning if automating certain types of decision actually makes sense from an ethical perspective. You can check my previous blog post on the topic for further details.
ML models act somehow as a black box: you input some information, and they magically output predictions. The complexity behind the models is what is behind this black box, especially if we compare to simpler algorithms from statistics. In our example, we might be okay not being able to understand exactly why a comment was predicted as “positive” or as “negative”.
In other use cases, explainability might be a must. For example, in strongly regulated sectors like insurances or banks. A bank needs to be able to explain why it is granting (or not) a credit to a person even if that decision is based on a scoring predictive model.
This topic has a strong relationship with the ethics one: if we are not able to fully understand the models decisions, it is really hard to know if the model has learned to be discriminatory or not.
With the progress on Generative AI, a variety of companies are offering webpages and APIs to consume powerful models. How is this changing the limitations and considerations I was mentioning before about ML?
- Data related topics (quality, quantity and labels): for use cases that can leverage existent GenAI models, this is definitely changing. Huge volumes of data are already used to train GenAI models. Quality of the data hasn’t been controlled in most of these models, but this seems to compensate with the huge volume of data they use. Thanks to these models, it might be the case (again, for very specific use cases), that we no longer need training data. This is known as zero-shot learning (e.g. “ask ChatGPT what is the sentiment of a given comment”) and few-shot learning (e.g. “provide some examples of positive, neutral and negative comments to ChatGPT, then ask it to provide the sentiment for a new comment”). A good explanation on this can be found in the deeplearning.ai newsletter.
- Deployment feasibility: for the use cases that can leverage existent GenAI models, deployment becomes much easier, as many companies and tools are offering easy to use APIs to those powerful models. If those models need to be fine-tuned or brought in-house for privacy reasons, then deployment will of course get much harder.
Other limitations or considerations are not changing, regardless of leveraging GenAI or not:
- High stakes: this will keep being a problem, as GenAI models have a level of error in their predictions too. Who hasn’t seen GhatGPT hallucinating or providing answers that don’t make sense? What is worse, it is harder to evaluate these models, as responses sound always confident regardless of the degree of accuracy they have, and evaluation turns subjective (e.g. “does this response make sense to me?”).
- Ethics: still as important as before. There are proofs GenAI models can be biased due to the input data they were used to train with (link). As more companies and functionalities start using these types of models, it is important to have the risks this might bring clear.
- Explainability: as GenAI models are bigger and more complex than “traditional” ML, explainability on their predictions gets even harder. There is ongoing research to understand how this explainability could be achieved, but it is still very immature (link).
In this blog post we saw the main things to consider when deciding whether to use or not to use ML and how that is changing with the progress from Generative AI models. The main topics discussed were quality and volume of data, label obtention, deployment, stakes, ethics and explainability. I hope this summary is useful when considering your next ML (or not) initiative!
 Beyond Test Sets: how prompting is changing machine learning development, by deeplearning.ai
 Large Language models are biased, can logic help save them?, by MIT News.
 OpenAIs OpenAI’s attempts to explain language models behaviors, TechCrunch