Why Data Is *Not* the New Oil and Data Marketplaces Have Failed Us | by Clemens Mewald | Jul, 2023


Distinguishing between 1st and 3rd Party Data

No one I know argues against the importance of data. But even though the narrative of “data is an asset” has become quite common, data is probably one of the most underutilized and, as a result, undervalued goods.

When most businesses think about data, they think about data they own. This 1st party data (1PD) is usually collected from websites, CRM/ERP systems, correspondence with customers, etc. Some 1st party datasets are more valuable than others: Google’s trove of search and click history is part of their 1PD corpus.

Image by author

What should be obvious is that the amount of 3rd party data (3PD) in existence, which is data you don’t directly own, is several orders of magnitudes larger than your 1PD. The argument I will make is that most people don’t realize the value of 3PD to their business. Let’s use an example to illustrate this point.

Detecting email spam (and why your 1PD alone may not be as valuable as you think)

What do you think is the most predictive signal in detecting email spam? The most common answers include: typos, grammar, or mention of specific keywords like v1agra. A slightly better answer is “if the sender is part of your contacts or not” — not because it’s true (there are more valid senders of non-spam off your contacts than on it), but because it considers a data source outside of the email itself: your contacts.

If only for the purpose of this anecdote, let’s say that the most important signal in detecting email spam is actually the age of the domain of the sender. Once stated this seems intuitive: Spammers frequently register new domains that, in short notice, get blocked by email providers.

Why don’t most people think of this answer? Because the age of the domain of the sender is not part of your “1st party dataset”, which only contains things like the sender’s and recipient’s emails, the subject, and the email body. But everyone who knows something about domain names will tell you that this information is not only readily available but also free. Take the domain, go to a domain registrar, and you can find out when it was registered (e.g. gmail.com was registered on August 13th 1995).

As it turns out, the data you own (1PD) is probably much more valuable to you if it is augmented with data someone else owns (3PD).

Image by author

From email spam to quant trading (and beyond?)

Extrapolating from the idea that you can detect email spam better simply by augmenting your dataset by the age of the domain of the sender, you can imagine that there are infinite ways you can apply the same principle. Below is a simple example of the data you can find from an address (at least in the US).

Image by author

Of course, this is not a new idea. Hedge funds have been using ‘’alternative data’’ for decades. RenTech was one of the first companies utilizing alternative data like satellite imagery, web scraping, and other creatively sourced datasets to give them an edge in trading. UBS used satellite imagery to monitor the parking lots of big retailers and correlate car traffic with quarterly revenue, allowing more accurate predictions of earnings before they were released.

You can probably guess where this is going. There are over 300k data providers in the US alone, and likely billions of datasets. Many of them could give you a competitive advantage in whatever you are trying to predict or analyze. The only limit is your creativity.

The (subjective) value of using external data

While the value of external data to quant trading firms is immediate and significant, executives in other industries have been slow to come to the same realization. A thought experiment helps: Consider some of the most important predictive tasks for your business. For Amazon, that could be which product a given customer is most likely to purchase next. For an oil exploration company, it could be where to discover the next oil reservoir. For a grocery chain, it might be the demand for specific products at any given point in time.

Next, imagine you had a magic dial that you could turn to improve the performance of that predictive task and the resulting value to your business. Grocery chains lose an approximate 10% of their food to spoilage. If only they could predict demand better, they could improve their supply chain and reduce that spoilage. At about 20% gross margin, every percentage point reduction in spoilage would improve their gross margin by 0.8pp. So, for a company like Albertsons, every percentage point improvement in predicting demand could be worth an estimated $640M per year. Alternative data could help with that.

The same data that saves a grocery chain hundreds of millions of dollars may be worth even more to a commercial real estate developer. However, data marketplaces haven’t been able to extract that value (through price discrimination) because they are far away from the actual business application. They have to put a generic price on their inventory, independent of its eventual use.

Yet, external data has managed to become an estimated $5B market growing at 50% year-over-year, and the marketplaces that trade those data represent another $1B market. This represents only a small fraction of the potential market size for at least two reasons: (1) Although every single company should be able to benefit from 3PD, only the most analytically mature companies know how to leverage 3PD to their advantage. (2) Those who dare to try are slowed down by the antiquated process to discover and purchase 3PD. Let’s take a quick detour into the ad buying process to illustrate that point.

The evolution of the ad buying process

Not too long ago, in 2014, programmatic ad buying represented less than half of digital ad spend. How did people buy ads? They told an agency what kind of audience they wanted to reach. Then the agency looked at the publishers they worked with and their “inventory” (magazine pages, billboards, TV ad slots, …), and put together a plan of where to run a campaign to meet those requirements. After some negotiations the company and the agency eventually signed a contract. Ad creative would be developed, reviewed, and approved. Insertion orders would be submitted and eventually the ad campaign would run. A few months later the company would get a report on how the agency thought it went (based on a small sampled dataset).

Along came Google who (among others) popularized what is known as programmatic ad buying. Google created its own ad exchange (AdX) that connected the inventory from multiple publishers with different ad networks. As users performed search or visited websites, it ran a real time auction (yes, within the time it takes to load a webpage) that pitched all advertisers against each other and picked the highest bidder (actually, 2nd highest) to display their ads.

And just like that, ad buying went from a months-long ordeal with lots of humans involved and very little transparency, to a real-time transaction that both set prices (through the auction) AND gave instant measurement of impressions (and sometimes even conversions). This level of velocity, liquidity, and transparency led to an explosion in the online advertising market and programmatic ad buying now represents close to 90% of digital advertising budgets.

The antiquated data buying process

As it turns out, buying data today is even more painful than buying ads 20 years ago.

Image by author

Discovery: First, you need to gain awareness of the fact that 3PD could be extremely valuable to you. Remember the email spam example? Next, you need the creativity to think of all of the possible 3PD that you could use to augment your 1PD. Would you have considered satellite images of parking lots to predict retailer’s revenues? Then you have to go to all of the data providers and search for what you think you need. You will find that most “data marketplaces” are basically just free text search over descriptions. Next you’ll have to look at the schema of the data to see if it contains what you are looking for, at the granularity that you need (e.g. sometimes you need foot traffic minute-by-minute as opposed to just hourly), and with the right coverage (e.g. for the right date range or geo region).

Procurement: Once you find what you think you need, you have to figure out how to procure that data. You’ll be surprised that it’s not always a simple “click-to-buy” affair. You have to go talk to a data provider, learn about data licenses (can you even use this data for the intended purpose?), negotiate terms, and sign a contract. You repeat that process several times for different 3PD from different providers who all have different contracts, terms, and licenses. You wait to receive the data on floppy disks in your mailbox (just kidding).

Integration: Finally you have the data you wanted. You wait a couple of weeks while your data engineering teams join it with your 1PD, just you learn that it’s not actually as useful as you had hoped. The time and money you spent are wasted and you never try again. Or, even more agonizingly, you find out that the 3PD does give you a meaningful improvement and you go on to productionize your predictive models, just to find out that you need fresh data on an hourly basis and that one of the data sources you used is only updated weekly. If you ever try again, you now know that, in addition to checking granularity based on the schema, you have to consider refresh rates.

This process can take anywhere from several months to more than a year. In an attempt to build a faster horse, some consulting firms are suggesting that the solution is to hire entire “data sourcing teams” and create relationships with data aggregators.



Source link

Leave a Comment