Although there are many resources to learn technical skills, there are few that go in-depth on how to clean data — an essential skill needed for data analysts. You may think you can apply the same set of rules to clean data but that’s not always the case. Today I’d like to share what I’ve learned over the years as a data analyst on how to approach cleaning different types of data for data analysis and reporting.
By numeric values, I mean those useful for data analysis and reporting. A good rule of thumb is if the average would be useful. For example, the average of a numeric order number field is meaningless. However, the average revenue amount is useful.
Numbers Stored as Numeric Fields
For numbers saved in numeric fields apply the following cleaning rules:
- Calculate the minimum, maximum, median, 99th percentile, and average. If the minimum is negative but values should be zero or higher, replace them with zero if applicable. In the sample sales data below, notice the large difference between the median of $800 in row 13 and the average of $20,560 in row 12. If the median and average or the maximum and the 99th percentile have big differences, I typically check for outliers especially if I’m unfamiliar with the data. If you were to report average sales including the $100,000 in row 4, it would be $20,560 instead of $560 if you excluded the $100,000. This is why it’s good to check for outliers and exclude them especially if you plan to report averages or use the data to build machine learning models as outliers can affect model results. Prioritize your time accordingly on the most important fields because you may have dozens to check and it would be too time-consuming to review all of them.
2. Count missing and non-missing values. If the number of missing outnumber the non-missing, then the numeric field may not be usable in your analysis. For example, if you have 1,000 records but 900 of them are missing then the 100 non-missing values probably won’t be useful in your analysis. If the count of missing is less than the non-missing, then replace missing with zero when applicable. If missing means a value…