Location data can provide unique insights but comes with costs and privacy issues. ML can overcome these drawbacks and improve location data products.
The location data industry is fast growing but still in its technical infancy. Most products based on location data are technically relatively simple and can be seen as a form of implemented descriptive statistics (e.g., the average amount of devices seen inside a store) or, worst case, those products are the raw location data themselves. Machine learning can bring a lot of value to this industry by saving costs, increasing product quality, and enhancing privacy.
This story aims to provide a high-level and intuitive overview of how machine learning can provide more robust location data products while reducing costs and enhancing privacy.
The location data industry and privacy
The location data industry is a vastly growing business area offering products that can provide unique insights for their customers. Specific products based on location data allow companies to analyze, for instance, how many people go to a competitor’s store, where their customers are coming from, how many people moved from one area to another, and many more. However, working with location data is far from trivial and comes with one massive problem: privacy!
Besides other technical and data-related issues that need to be addressed when working with location data, individual privacy is the most important and, in the long run, probably the most challenging one for the industry. It does not matter if the location data in question is GPS data coming from mobile phones, Telco data, or Satellite imagery. Since the whole point of location data is to reveal a location, simple products (raw data or aggregated) do not rule out the possibility of reverse engineering and, thus, violating someone’s privacy.
Even “privacy-friendly” data transformations, like hashing the unique identifier, obfuscating the latitude and longitude, and aggregating data are hardly making reverse engineering impossible. In addition, even if a third-party company is aggregating that location data in a perfect privacy-friendly way, the individually identifiable data has already been sent digitally to that company and, with that, this sensitive data is not controlled by the first-party data owner or the individuum anymore.
Therefore, the future of the location data industry lies in a combination of two things: the early aggregation of data on the 1st party data side in a non-identifiable format and utilizing machine learning on top of these aggregates to create high-quality human mobility insights.
The current state of the art in the location data industry
Most products based on location data serve insights into human mobility and are based on fairly simple technical methods. For instance, a common workflow for a product that estimates foot traffic to a store can look like this:
More sophisticated products within the industry bring more context like home and work or area demographics into the metric. However, the flow is always the same: first pre-process the raw data, cluster individual data points to a dwelling event, correct for technical problems of the data, and aggregate all dwelling events in an area.
This approach is simple but effective. It allows for very accurate estimates of foot traffic, especially when someone is interested in patterns over time. The technical sophistication, and mostly proprietary part, lies in the supply correction for a simple aggregation would be highly affected by the underlying issues in supply. Even small changes in the supply volume can have a massive negative impact on an aggregated data product without proper correction. Therefore, automated supply correction is key to a quality data product.
However, even though supply correction works, it still comes with significant limitations. Some of those are:
- Supply is constantly changing and requires improvements and new product versions continuously.
- Acquiring and storing all the device-level data over time comes with high costs.
- More and more location data is manipulated, “replayed”, or even falsified and impacting product quality.
- Public reputation for working with the data is low and, due to privacy reasons, the volume of available data is decreasing.
Therefore, the general setup of buying location data in its raw form and re-selling it as some sort of derivative is not a viable path in the future and will decrease the robustness and quality of existing location data products.
Aggregating data on 1st party side solves the limitations above a present a win-win for everyone, but: how can we build a product based on already aggregated data? How do we deal with data de-duplication, assignment of data to locations, or estimating foot traffic to a store? The answer is machine learning!
What is Machine Learning
There are various great introductions to the basics of AI and machine learning (like this one) and a simple internet search (or asking a LLM) will provide a better answer here than this story will. However, to make it super intuitive and easy:
Machine learning allows an artificial system to learn relationships between data without human interaction.
A simple real-life comparison can be classic conditioning where a dog learns to raise their paw when it just often enough receives a reward for doing so. This relationship between “raise paw” and “reward” is, simplified, what machines learn in an artificial system (although a dog is way more intelligent than any AI system humans have built so far).
It is important to note that the number of input features is not limited to just one. In fact, machine learning usually uses a lot of features to train robust relationships. The benefits are manifold. For instance, when we think about our aggregated data problem coming from 1st party data providers, machine learning would allow us to learn relationships between those aggregates and a given target we would like to estimate (e.g., foot traffic to a store).
Estimate foot traffic to a store
To make things more intuitive, a case study is chosen here using GPS data coming from mobile devices. The aim is to develop a reliable and qualitative product that informs customers how many people visited a specific store on a daily basis. This is a very useful insight for companies who are interested in their competitor’s store performance or site selection.
The current state-of-the-art methodology
As of today, companies who estimate store traffic based on GPS data are either doing that directly based on raw GPS data or by aggregating that raw data and correcting for supply fluctuations. However, as seen below, these two approaches do only work after there is data observed within the store of interest.
When the product comes with high enough data volumes, both product methodologies (device level and aggregation) do work and major concerns are more about data privacy, supply fluctuations, cost, and trust in the data supply.
However, when the data volume is low or the store is located in an area with a generally low market share, simple aggregation does not allow for a product since it would always end up with “0” counts. Given the general decrease in available location data, this is already a problem for the industry.
Estimate foot traffic using a machine-learning model
Keeping in mind the conditioning example from before, a machine-learning model does simply learn relationships between conditions. Similar to the dog learning that raising a paw leads to a reward, a machine learning model can learn that if more people are close to the venue, there are most likely also more people inside the venue.
In other words, the purpose of machine learning is to train a relationship (or model) that describes how foot traffic inside a store changes based on fluctuations in traffic outside the store. For example, imagine that on a given Saturday there is a grand opening that leads to the situation that twice as many people are close to the store as on a regular Saturday. In that case, it is very likely that also more people make their way into the store.
Of course, the relationship between foot traffic outside the store to the inside must not be linear. But that is also not the only relationship for a model to learn. Just think about it, what else affects foot traffic to a store that can be measured? Because, essentially, every data that relates to the store traffic improves the quality of the model. A few datasets that enhance those relationships are precipitation, area population, demographics, day of the week, holidays, and many more.
Machine learning is capable of using all these different datasets and combining them into a single model that describes the relationship of how foot traffic inside a store changes based on data describing the surroundings.
Even though machine learning offers a lot of opportunities, it is not something that can solve everything and comes with limitations that need to be addressed.
Historical bias. The relationships trained are usually based on some historical ground truth. That means that the end product is to a large extent influenced by historical relationships. However, if relationships change models require re-training to ensure that predictions stay up-to-date and are not drifting.
Some things are unpredictable. Even though current developments in AI make machine learning look like the solution to almost everything, it is important to keep in mind that a lot of things are unpredictable. There is no model that can foresee a pandemic and predict the pandemic’s impact on stores. In addition, a model can only learn existing relationships within the data. Events or behavior that was either not in the training data or does not have a relationship within that data is unpredictable.
The shift in mindset. Even though the resulting products might look the same, they are coming from a fundamentally different methodology. That leads to challenges for both the commercial side and the product user to ensure the benefits and disadvantages are properly addressed.
However, when we openly address machine learning shortcomings and educate those properly, the benefits will outweigh those disadvantages.
Ethical and privacy-friendly. Combining machine learning with aggregated data on the 1st-party side will allow for building future-proof privacy-friendly products following strict ethical standards.
Robust and quality product. Building a location data product that is not directly dependent on GPS data sources will make the product way more robust and trustworthy. In addition, since the product can be based on various high-quality data sources the end product can come on average with higher quality.
Less data volume and costs. Machine learning can work on way less data compared to what is needed currently to build location data products. This allows independency of supply sources but also removes unnecessary storage of vast amounts of data. In addition, costs for data processing and maintenance are comparably cheaper with a machine learning infrastructure.
New product innovation. After the improved privacy, maybe one of the biggest advantages is the possibility for new product innovation. Machine learning in its nature does combine different datasets and contexts and, thus, allows for building products that are currently unavailable in the location data industry.
The location data industry is rapidly growing but still in its early stages. Most products based on location data are simple, not robust, and lacking in privacy. Methods based on machine learning have the potential to bring additional value to this industry by reducing costs, increasing product quality, and enhancing privacy. We at Unacast believe that the future of the location data industry lies in a combination of early data aggregation in a non-identifiable format with machine learning techniques on top of these aggregates and, with that, create high-quality human mobility insights products.
— — —
All images, unless otherwise noted, are by the author.
If you want to know more about me and what I am writing about, please take a look here and feel free to follow me.