What is synthetic data?. A field guide to the various species of… | by Cassie Kozyrkov | Jun, 2023


A field guide to the various species of fake data: Part 1

Synthetic data is, to put it bluntly, fake data. As in, data that’s not actually from the population you’re interested in. (Population is a technical term in data science, which I explain here.) It’s data that you’re planning to treat as if it came from the place/group you wish it came from. (It didn’t.)

Synthetic data is, to put it bluntly, fake data.

Artificial data, synthetic data, fake data, and simulated data are all synonyms with slightly different heydays as the term du jour, so they carry poetic connotations from different eras. These days, the cool kids prefer the synthetic data buzzword, perhaps because investors need to be convinced that something new has been invented, rather than rediscovered. And there is something slightly new in play here, but (in my opinion) not new enough for all the old ideas to be irrelevant.

Let’s dive in!

All image rights belong to the author.

(Note: the links in this post take you to explainers by the same author.)

If you’ve suffered through a graduate course on advanced probability and measure theory like I have (my therapist and I are still working through it over a decade later), you’ll be superfluously aware that there are infinite real numbers. Among other things, infinite means that if you try to enumerate them all, I can swoop in like a jerk and find you a new one, for example by adding 1 to your largest number, taking the average of your two closest numbers, or popping a digit on the back of the number with the longest series of digits after the decimal point.

This also means that if you give me the list of all the numbers ever recorded by humans over the history of humankind, I can still make a brand new one. Boom! The power.

Where am I going with this, besides providing fodder for your next beery debate on whether there’s such a thing as true originality (ugh)?

Let’s say you have a dataset full of human heights. Between any two measurements (say 173cm and 174cm, the interval wherein you’ll find my height) there are infinite possibilities for a number you could write down. Just keep lengthening the decimal place beyond the reasonable ability of our measuring tools. Beyond subatomic particles. Beyond common sense. There are still plenty of numbers I could make up, like: 173.4335524095820398502639008342984598739874944444443842397593645873649572850263894458092843956389479592489586232342349832842849687394208287645545352525353353826482384724628732648732799999992323…

The rules governing the creation of this stupid number are thoroughly out there beyond the realm of what’s useful and practical, so when you ask me to give you a number that could represent a human height that you could add to your dataset, how might I approach your request?

Real world data

One option is to give you real data from a real human. I look around the room, spot my bff Heather (true story, she says hi), and measure her for your dataset. If your population of interest was all humans, her height would a legit datapoint for your dataset if (and that’s big if) I measured it according to the rules you laid out for how your population should be measured.

Noisy data

If I measure Heather’s height in laptops (I didn’t bring a tape measure to our weekend retreat, sorry) to the nearest 13 inches while you measured heights in millimeters using one of those meter rulers, we’ll have problems.

When we say noisy data, we mean there’s nondeterministic error in there that hides the true answer. And that’s exactly what’ll happen if I get it into my head to measure Heather in laptops. (Or Smoots.)

Any measurement you’ll get from me will have random error built in that’s of a different profile from what’s in the rest of your data. To deal with the can of worms we’re potentially opening up here, be sure to include a record of the source of the data. (Who collected it — you or me?) You can always nuke my entries later… as long as they’re not hiding among your legit contributions.

When collecting data from the real world, it’s surprisingly easy to mess up. To learn more, check out my series on data design and data collection:

Handcrafted data

Let’s say there was no one to measure but you wanted another datapoint anyway? (Why might you want to do this and what are the pros and cons? See my next blog post!)

Then you’re saying you’re okay with synthetic data. (If you allow synthetic data into your project, always keep a record of which datapoints are synthetic and how they were made!)

I could also give you a height datapoint by making up a number following no rules at all. If I’m especially perverse, I might even throw out a complex number like -5 + 60*sqrt(-1) just to mess with you. Did you say I couldn’t? You should. If you’re letting me make stuff up, you need to constrain my creativity.

No imaginary numbers? Okay, how about -100?

Oh, it has to be within the range of actual human heights? How about that 173.43355240… number from earlier?

Too many decimal places because human measuring instruments aren’t that sensitive? Fine, how about 173.5cm?

We might call this handcrafted data, since I, a human, came up with it by handcrafting an example that appeals to me.

But what if you wanted more than one new height for your dataset? And you tell me to be reasonable and round my choices to the nearest millimeter?

Well, I might come up with: 173.5cm, 182.4cm, 175.1cm, 190.2cm, 180.1cm

These are all plausible human measurements, but they’re on the tallish side. They likely don’t represent your population of interest very well. They’re biased by my ideas of what good entries into your dataset look like. And what do I know about human heights anyways? You can do better.

So let’s do better in Part 2, where we’ll go on a journey that covers:

  • duplicated data
  • resampled data
  • bootstrapped data
  • augmented data
  • oversampled data
  • edge case data
  • simulated data
  • univariate data
  • bivariate data
  • multivariate data
  • multimodal data

Or help yourself to my one of my other data taxonomy guides here:

If you had fun here and you’re looking for an entire applied AI course designed to be fun for beginners and experts alike, here’s the one I made for your amusement:

Enjoy the course on YouTube here.

P.S. Have you ever tried hitting the clap button here on Medium more than once to see what happens? ❤️

All image rights belong to the author.



Source link

Leave a Comment