Missing Data is an interesting data imperfection since it may arise naturally due to the nature of the domain, or be inadvertently created during data, collection, transmission, or processing.
In essence, missing data is characterized by the appearance of absent values in data, i.e., missing values in some records or observations in the dataset, and can either be univariate (one feature has missing values) or multivariate (several features have missing values):
Let’s consider an example. Let’s say we are conducting a study on a patient cohort regarding diabetes, for instance.
Medical data is a great example for this, because it is often highly subjected to missing values: patient values are taken from both surveys and laboratory results, can be measured several times throughout the course of diagnosis or treatment, are stored in different formats (sometimes distributed across institutions), and are often handled by different people. It can (and most certainly will) get messy!
In our diabetes study, a the presence of missing values might be related to the study being conducted or the data being collected.
For instance, missing data may arise due to a faulty sensor that shuts down for high values of blood pressure. Another possibility is that missing values in feature “weight” are more likely to be missing for older women, which are less inclined to reveal this information. Or obese patients may be less likely to share their weight.
On the other hand, data can also be missing for reasons that are in no way related to the study.
A patient may have some of his information missing because a flat tire caused him to miss a doctors appointment. Data may also be missing due to human error: for instance, if the person conducting the analysis misplaces of misreads some documents.
Regardless of the reason why data is missing, it is important to investigate whether the datasets contain missing data prior to model building, as this problem may have severe consequences for classifiers:
- Some classifiers cannot handle missing values internally: This makes them inapplicable when handling datasets with missing data. In some scenarios, these values are encoded with a pre-defined value, e.g., “0” so that machine learning algorithms are able to cope with them, although this is not the best practice, especially for higher percentages of missing data (or more complex missing mechanisms);
- Predictions based on missing data can be biased and unreliable: Although some classifiers can handle missing data internally, their predictions might be compromised, since an important piece of information might be missing from the training data.
Moreover, although missing values may “all look the same”, the truth is that their underlying mechanisms (that reason why they are missing) can follow 3 main patters: Missing Completely At Random (MCAR), Missing Not At Random (MNAR), and Missing Not At Random (MNAR).
Keeping these different types of missing mechanisms in mind is important because they determine the choice for appropriate methods to handle missing data efficiently and the validity of the inferences derived from them.
Let’s go over each mechanism real quick!
Missing Data Mechanisms
If you’re a mathy person, I’d suggest a pass through this paper (cof cof), namely Sections II and III, which contains all the notation and mathematical formulation you might be looking for (I was actually inspired by this book, which is also a very interesting primer, check Section 2.2.3. and 2.2.4.).
If you’re also a visual learner like me, you’d like to “see” it, right?
For that matter, we’ll take a look at the adolescent tobacco study example, used in the paper. We’ll consider dummy data to showcase each missing mechanism:
One thing to keep in mind this: the missing mechanisms describe whether and how the missingness pattern can be explained by the observed data and/or the missing data. It’s tricky, I know. But it will get more clear with the example!
In our tobacco study, we are focusing on adolescent tobacco use. There are 20 observations, relative to 20 participants, and feature
Age is completely observed, whereas the
Number of Cigarettes (smoked per day) will be missing according to different mechanisms.
Missing Completely At Random (MCAR): No harm, no foul!
In Missing Completely At Random (MCAR) mechanism, the missingness process is completely unrelated to both the observed and missing data. That means that the probability that a feature has missing values is completely random.
In our example, I simply removed some values randomly. Note how the missing values are not located in a particular range of
Number of Cigaretters values. This mechanism can therefore occur due to unexpected events happening during the study: say, the person responsible for registering the participants’ responses accidentally skipped a question of the survey.
Missing At Random (MAR): Look for the tell-tale signs!
The name is actually misleading, since the Missing At Random (MAR) occurs when the missingness process can be linked to the observed information in data (though not to the missing information itself).
Consider the next example, where I removed the values of
Number of Cigarettes for younger participants only (between 15 and 16 years). Note that, despite the missingess process being clearly related to the observed values in
Age, it is completely unrelated to the number of cigarettes smoked by these teens, had it been reported (note the “Complete” column, where a low and high number of cigarettes would be found among the missing values, had they been observed).
This would be the case if younger kids would be less inclined to reveal their number of smoked cigarettes per day, avoiding to admit that they are regular smokers (regardless of the amount they smoke).
Missing Not At Random (MNAR): That ah-ha moment!
As expected, the Missing Not At Random (MNAR) mechanism is the trickiest of them all, since the missingness process may depend on both the observed and missing information in the data. This means that the probability of missing values occurring in a feature may be related to the observed values of other feature in the data, as well as to the missing values of that feature itself!
Take a look at the next example: values are missing for higher amounts of
Number of Cigarettes, which means that the probability of missing values in
Number of Cigarettes is related to the missing values themselves, had they been observed (note the “Complete” column).
This would be the case of teens that refused to report their number of smoked cigarettes per day since they smoked a very large quantity.
Along our simple example, we’ve seen how MCAR is the simplest of the missing mechanisms. In such scenario, we may ignore many of the complexities that arise due to the appearance of missing values, and some simple fixes such as case listwise or casewise deletion, as well as simpler statistical imputation techniques, may do the trick.
However, although convenient, the truth is that in real-world domains, MCAR is often unrealistic, and most researchers usually assume at least MAR in their studies, which is more general and realistic than MCAR. In this scenario, we may consider more robust strategies than can infer the missing information from the observed data. In this regard, data imputation strategies based on machine learning are generally the most popular.
Finally, MNAR is by far the most complex case, since it is very difficult to infer the causes for the missingess. Current approaches focus on mapping the causes for the missing values using correction factors defined by domain experts, inferring missing data from distributed systems, extending state-of-the-art models (e.g., generative models) to incorporate multiple imputation, or performing sensitivity analysis to determine how results change under different circumstances.
Also, on the topic of identifiability, the problem doesn’t get any easier.
Although there are some tests to distinguish MCAR from MAR, they are not widely popular and have restrictive assumptions that do not hold for complex, real-world datasets. It is also not possible to distinguish MNAR from MAR since the information that would be needed is missing.
To diagnose and distinguish missing mechanisms in practice, we may focus on hypothesis testing, sensitivity analysis, getting some insights from domain experts, and investigating vizualization techniques that can provide some understanding of the domains.
Naturally, there are other complexities to account for which condition the application of treatment strategies for missing data, namely the percentage of data that is missing, the number of features it affects, and the end goal of the technique (e.g., feed a training model for classification or regression, reconstruct the original values in the most authentic way possible?).
All in all, not an easy job.
Let’s take this bit by bit. We’ve just learned an overload of information on missing data and its complex entanglements.
In this example, we’ll cover the basics of how to mark and visualize missing data in a real-world dataset, and confirm the problems that missing data introduces to data science projects.
For that purpose, we’ll use the Pima Indians Diabetes dataset, available on Kaggle (License — CC0: Public Domain). If you’d like to follow along the tutorial, feel free to download the notebook from the Data-Centric AI Community GitHub repository.
To make a quick profiling of your data, we’ll also use
ydata-profiling, that gets us a full overview of our dataset in just a few line of codes. Let’s start by installing it:
Now, we can load the data and make a quick profile:
Looking at the data, we can determine that this dataset is composed by 768 records/rows/observations (768 patients), and 9 attributes or features. In fact,
Outcome is the target class (1/0), so we have 8 predictors (8 numerical features and 1 categorical).
At a first glance, the dataset does not seem to have missing data. However, this dataset is known to be affected by missing data! How can we confirm that?
Looking at the “Alerts” section, we can see several “Zeros” alerts that indicate us that there are several features for which zero values make no sense or are biologically impossible: e.g., a zero-value for body mass index or blood pressure is invalid!
Skimming through all features, we can determine that pregnancies seems fine (have zero pregnancies is reasonable), but for the remaining features, zero values are suspicious:
In most real-world datasets, missing data is encoded by sentinel values:
- Out-of-range entries, such as
- Negative numbers where the feature has only positive values, e.g.
- Zero-values in a feature that could never be 0.
In our case,
BMI all have missing data. Let’s count the number of zeros that these features have:
We can see that
BMI have just a few zero values, whereas
Insulin have a lot more, covering nearly half of the existing observations. This means we might consider different strategies to handle these features: some might require more complex imputation techniques than others, for instance.
To make our dataset consistent with data-specific conventions, we should make these missing values as
This is the standard way to treat missing data in python and the convention followed by popular packages like
scikit-learn. These values are ignored from certain computations like
count, and are recognized by some functions to perform other operations (e.g., drop the missing values, impute them, replace them with a fixed value, etc).
We’ll mark our missing values using the
replace() function, and then calling
isnan() to verify if they were correctly encoded:
The count of
NaN values is the same as the
0 values, which means that we have marked our missing values correctly! We could then use the profile report agains to check that now the missing data is recognized. Here’s how our “new” data looks like:
We can further check for some characteristics of the missingness process, skimming through the “Missing Values” section of the report:
Besided the “Count” plot, that gives us an overview of all missing values per feature, we can explore the “Matrix” and “Heatmap” plots in more detail to hypothesize on the underlying missing mechanisms the data may suffer from. In particular, the correlation between missing features might be informative. In this case, there seems to be a significant correlation between
SkinThicknes : both values seem to be simultaneously missing for some patients. Whether this is a coincidence (unlikely), or the missingness process can be explained by known factors, namely portraying MAR or MNAR mechanisms would be something for us to dive our noses into!
Regardless, now we have our data ready for analysis! Unfortunately, the process of handling missing data is far from being over. Many classic machine learning algorithms cannot handle missing data, and we need find expert ways to mitigate the issue. Let’s try to evaluate the Linear Discriminant Analysis (LDA) algorithm on this dataset:
If you try to run this code, it will immediately throw an error:
The simplest way to fix this (and the most naive!) would be to remove all records that contain missing values. We can do this by creating a new data frame with the rows containing missing values removed, using the
… and trying again:
And there you have it! By the dropping the missing values, the LDA algorithm can now operate normally.
However, the dataset size was substantially reduced to 392 observations only, which means we are losing nearly half of the available information.
For that reason, instead of simply dropping observations, we should look for imputation strategies, either statistical or machine-learning based. We could also use synthetic data to replace the missing values, depending on our final application.
And for that, we might try to get some insight on the underlying missing mechanisms in the data. Something to look forward to in future articles?