4 Important Statistical Ideas You Should Understand in a Data-Driven World | by Murtaza Ali | Jul, 2023


You don’t have to be an expert in statistics to navigate the modern world, but here are some basic ideas you should understand.

Photo by Anne Nygård on Unsplash

There is no use avoiding reality. Data science, and more broadly, data-driven structures, are at the center of the society we are currently building.

When the computer science craze first hit in the early 2000s, many noted that computer science would become an integral part of every field. This proved to be true. Companies across industries — healthcare, engineering, finance, etc. — began to hire software engineers for various forms of work. Students of these fields began to learn how to code.

I would argue the new data science surge takes this a step further. With computer science, one could get away with just hiring software engineers. A business manager or a sales expert did not necessarily need to understand what these folks did.

But data science is broader and more encompassing. Since it is a mix of fields [1], its ideas are relevant even for those who may not be day-to-day data scientists.

In this article, I’ll give a high-level overview of four important statistical ideas that everyone should understand, regardless of official job title. Whether you’re a project manager, recruiter, or even a CEO, some level of familiarity with these concepts is sure to help you in your work. Furthermore, outside of a work context, familiarity with these concepts will give you a sense of data literacy that is indispensable for navigating modern society.

Let’s get into it.

Just a big, bad sample

Back as an undergraduate, the first data science course I took consisted of an immense number of students — nearly 2000. The course, Foundations of Data Science, was one of the most popular on campus, as it was designed to be accessible to students across departments. Rather than immediately getting into advanced mathematics and programming, it focused on high-level ideas which could impact students across fields.

During one of our early lectures, the professor made a statement that has stuck with me through the years, coming back whenever I work on anything even remotely data related. She was discussing random sampling, a broad term which has to do with choosing a subset of a study population in a way that represents the entire population. The idea is that studying the subset should enable one to draw conclusions about the entire population.

She pointed out that having a good sample was of the utmost importance, since no amount of mathematical finagling and fancy techniques could make up for a subset that isn’t actually representative of the population one wishes to emulate. In making this point, she mentioned that many people assume that if a starting sample is bad, then a reasonable solution is to stick with the same approach, but collect a larger sample.

“Then, you’ll just have a really big, really bad sample,” she said to the giant lecture hall full of college students.

Understanding this foundational point — and its broader implications — will enable you to make sense of many sociopolitical phenomena that folks take for granted. Why are presidential polls often inaccurate? What makes a seemingly powerful machine learning model fail in the real world? Why do some companies make products that never see the light of day?

Often, the answer lies in the sample.

“Error” does not mean “mistake”

This topic is implicit in most courses involving data or statistics, but my discussion here is inspired by Alberto Cairo’s emphasis of this point in his excellent book, How Charts Lie.

The premise of Cairo’s book is to outline the various ways in which data visualizations can be used to deceive people, both unintentionally and maliciously. In one chapter, Cairo expounds upon the challenges of visualizing uncertainty in data, and how this in itself can lead to misleading data visualizations.

He opens with some discussion on the idea of error in statistics. He makes note of a crucial point: While in standard English, the term “error” is synonymous with “mistake,” this is not the case at all within the realm of statistics.

The concept of statistical error has to do with uncertainty. There will almost always be some form of error in measurements and models. This is related to earlier point about samples. Because you don’t have every data point for a population you wish to describe, you will by definition face uncertainty. This is further accentuated if you are making predictions about future data points, since they do not exist yet.

Minimizing and addressing uncertainty is an essential part of statistics and data science, but it is far beyond the scope of this article. Here, the primary point you should internalize is that just because a statistical finding is given to you with a measure of uncertainty does not mean it is mistaken. In fact, this is likely an indicator that whoever produced the findings knew what they were doing (you should be skeptical of statistical claims made without any reference to the level of uncertainty).

Learn the right way to interpret uncertainty in statistical claims [2], rather than writing them off as incorrect. It’s an essential distinction.

You can’t always just “make a model for it”

Among the general population, there seems to be this idea that artificial intelligence is some kind of magical tool that can accomplish anything. With the advent of self-driving cars and realistic virtual assistants but no similar acceleration in general data literacy, it is unsurprising that this mindset has developed.

Unfortunately, it couldn’t be further from the truth. AI is not magic. It is heavily dependent on good data, and its results can actually be quite misleading if the underlying data is of poor quality.

I once had a colleague who was assigned to a project in which her task was to build a machine learning model for a specific goal. It was meant to classify future events into certain categories based on historical data.

There was just one problem: She didn’t have any data. Others on the project (who, notably, were not familiar with data science) kept insisting that she should just make the model even though she didn’t have the data, because machine learning is super powerful and this should be doable. They didn’t grasp that their request simply wasn’t feasible.

Yes, machine learning is powerful, and yes, we’re getting better at doing cooler and better tasks with it. However, as things stand, it’s not just a magic solution for everything. You would do well to remember that.

The Numbers Do Lie

People throw around the phrase “numbers don’t lie” like it’s confetti.

Oh, if only they knew. Numbers do in fact lie. A lot. In some settings, even more often than they tell the truth. But they do not lie because they are actually wrong in raw form; they lie because the average person does not know how to interpret them.

There are countless examples of how numbers can be twisted, manipulated, changed, and transformed in order to support the argument one wants to make. To drive the point home, here I’ll cover one example of how this can be done: failing to take into account underlying population distributions when making blanket statements.

That’s a bit vague on its own, so let’s take a look at an example. Consider the following scenario, often posed to medical students:

Suppose a certain disease affects 1 out of every 1000 people in a population. There is a test to check if a person has this disease. The test does not produce false negatives (that is, anyone who has the disease will test positive), but the false positive rate is 5% (there is a 5% chance that a person will test positive even if they do not have the disease). Suppose a randomly selected person from the population takes the test and tests positive. What is the likelihood that they actually have the disease?

At a glance, a reasonable answer, given by many folks, is 95%. Some might even go so far as to suspect that it isn’t quite mathematically accurate to just use the false positive rate to make this determination, but they’d probably still guess that the answer is somewhere close.

Unfortunately, the correct answer is not 95%, or anywhere near it. The actual probability that this randomly selected person has the disease is approximately 2%.

The reason most people are so far off from the correct answer is because while they pay attention to the low false positive rate, they fail to take into account the underlying prevalence of the disease within the population: Only 1/1000 (or 0.1%) of people in the population actually have this disease. As a result, that false positive rate of 5% actually ends up impacting many individuals because so few of them have the disease to begin with. In other words, there are many, many opportunities to be a false positive.

The formal math behind this is beyond the scope of this particular article, but you can see a detailed explanation here if you’re interested [3]. That said, you don’t really need to dive into the math to grasp the main point: One could imagine using the scenario above to scare a person into believing that they are much more at risk for a disease than they really are. Numbers alone can often be misrepresented and/or misinterpreted to promote false beliefs.

Be vigilant.

Final Thoughts and Recap

Here’s a little cheat sheet of important takeaways from this article:

  1. A big sample ≠ A good sample. It takes more than quantity to ensure accurate representation of a population.
  2. In statistics, “error” does not mean “mistake.” It has to do with uncertainty, which is an unavoidable element of statistical work.
  3. Machine learning and artificial intelligence aren’t magic. They rely heavily on the quality of the underlying data.
  4. Numbers can be misleading. When someone makes a statistical claim, especially in a non-academic (read: in the news) context, review it carefully before accepting the conclusions.

You don’t have to be an expert in statistics to navigate this data-driven world, but it would do you well to understand some foundational ideas and know what pitfalls to avoid. It is my hope that this article helped you take that first step.

Until next time.

References

[1] https://towardsdatascience.com/the-three-building-blocks-of-data-science-2923dc8c2d78
[2] https://bookdown.org/jgscott/DSGI/statistical-uncertainty.html
[3] https://courses.lumenlearning.com/waymakermath4libarts/chapter/bayes-theorem/



Source link

Leave a Comment