Everyone heard it before. “Data is the new oil”, or “gold”, or “currency”. The variations are manifold, but they all imply one thing: Data is extremely valuable. While this is certainly true in many cases, the reality is a bit more complicated. Most of the data created has so little value that the effort to monetize it is too high in relation to the expected return. And on top of that, often the value is so hard to estimate that the owners are reluctant to try it at all.
Nevertheless, the demand for relevant and high-quality datasets is indeed growing faster than ever before. Besides traditional data buyers like marketers and quantitative investors, more and more companies are scaling their data science capabilities, and data-driven solutions are developed for almost every area of our lives. This opens a range of opportunities for any company that is able to generate the data needed for these new applications and a one-year license for a good dataset can easily come with a five to six-digit dollar price tag.
In order to identify the most promising datasets and avoid those that only have a very low value, we can evaluate them based on the seven dimensions outlined in this article. The stronger each of these dimensions for a given dataset, the higher the expected value. This very framework does not consider potential legal or moral issues with certain types of data as this would increase its complexity significantly. This doesn't mean, however, that these aspects are less important. We strongly believe in using data for creating real value and that the privacy of the individual has to be considered beyond explicit legal requirements whenever we work with data.
We developed this framework at Alpha Affinity for our internal use and as we find it quite helpful, we decided to share it with the data community. It can be used not only for assessing the monetization potential for a dataset but also as a tool for optimizing the data strategy of almost any business. Nevertheless, this is just a heuristic and in the end, only the market feedback will reveal the actual value of a dataset.
Although we can encounter applications that don’t necessarily require a long history, having consistent data over 3 years or more creates many interesting use cases for a dataset. Especially with the Corona-shock that has an impact on almost every area of our lives, it is important to understand how things changed compared to the years before. Just having a long history of data, however, is not enough. When it comes to extracting insights from a time series, consistency is key. Unknown structural breaks lead to incorrect conclusions, or even worse, wrong decisions.
Structural breaks can be caused by changes in the way the data is collected, or by indirect external effects. An example of an unobvious external influence could be an ad campaign that drives “artificial” traffic or sales to certain categories for a limited period. This way, you don’t capture the “pure” market behavior anymore and thus can’t easily use the data for generating objective insights on that market.
Theoretically, more granular data, i.e. less aggregated data, is more valuable than highly aggregated one, as it generally contains more detailed information. You can always add aggregation steps if needed. However, reverting them is not possible in most cases. In order to illustrate this, let’s assume you need to decide where to build new charging stations for electric vehicles and want to install them where many EVs are passing by. You could use official traffic statistics to select the ideal spots, but the issue is that usually this data is highly aggregated. It doesn’t differentiate between EVs and combustion engine cars and thus might be misleading. In some areas, you might have a lot of cars passing by, but the share of EVs is exceptionally small or vice versa. Having specific information on EVs or different models would let you pick the locations more reliably. You could even use this information to decide on the size of the parking spots or the charging standard.
On the other hand, In practice, we need to consider two things. First of all, the data might contain so much noise, that the insights you can get on a higher level of detail are limited or even misleading. In our EV example, some newly built measuring points might be able to differentiate between EVs and traditional cars while the old ones still can’t do this. Thus, the granular data might contain large gaps and could lead to incorrect conclusions. Aggregation and adjustment of the data could make it more robust while sacrificing some level of detail.
Secondly, often the data buyers are not even interested in granular data and would rather prefer a certain level of aggregation. This way, they don’t have to perform this step themselves. Reports are a classic example where additional value is created via aggregation. The city mayor will most likely not be interested in knowing the hourly numbers of EVs for each street in the city. Instead, he will often rely on a few KPIs when deciding about new projects.
Nevertheless, given sufficiently high quality, higher granularity usually significantly increases the value of a data set by enabling to retrieve detailed insights at scale.
As long as you can’t just measure “everything”, i.e. every entity, every transaction, and alike in the domain of interest, it is crucial to know what your data represents and how qualitative this representativeness is. Whenever you gather data, you make many conscious and unconscious decisions that cause selection bias. And the regional focus is among the obvious ones. If you build a marketplace that only works in German, you will naturally have a strong bias in your user base towards German-speaking countries.
But also other, less obvious business activities can lead to biases. Marketing via Facebook reaches different demographics than marketing via Snapchat or LinkedIn. Generally, this doesn’t mean that a focus on a certain niche necessarily reduces the value of your data. Quite the exact opposite can be the case if this niche is a valuable market with little available data. It is crucial, however, that you know about the properties of your sample and maintains the representativeness over time.
The structure of a dataset strongly determines how and how easily the latent value of the data can be realized. This can be illustrated with an example of firmographic information. Let’s assume we have 10000 books full of in-depth analysis of the 100.000 largest companies in the world. The value hidden in these books is huge, and as an individual, you can search for the company you are interested in and extract the needed information. But obviously, this is far from being scalable as the costs of retrieving this information are huge as well.
Let’s assume we fully digitalized these books using OCR (Optical Character Recognition). This already makes accessing the information much easier, as it is not restricted to the physical location of the book anymore and we can use a simple full-text search for information retrieval. But we still face a significant limitation: processing the information requires an intelligent agent. Besides the obvious option of letting a human process the data, the data buyer could develop a sophisticated language model that can make sense of the unstructured text and another layer that makes conclusions or even decisions. However, developing such a model with the required quality for real-world applications comes with very high costs.
Thus, if we would go one step further and structure the information in our 10000 books, e.g. model it as a large knowledge graph and link the entities to global identifiers, whoever wants to build a model using this data would have to invest a lot less into making the information machine-readable. Instead, the data buyer could focus on developing the conclusion or decision layer right away and therefore would be willing to pay a higher price for our data.
No matter how high the quality of a dataset is, if nobody cares about the domain it is relevant for, you will have a hard time monetizing it. Simply put, the closer to a valuable market, the higher the value of the data itself. With unique and high-quality data on the performance of publicly traded companies, for example, you will most likely reach much higher prices than for the same information on the local artisans in your home town.
In order to identify such, often not obvious, connections between data and markets, you can ask yourself if there are any problems that could be solved with the data. Like with traditional products, it is important to know who is affected by the problem, how much it costs, and if there is a budget for solving it. It is not a secret that many problems don’t get solved because they are not relevant to the decision-makers with budgeting power. The good news is that there are many ways to validate your assumptions and get a feeling for the demand before you invest in building a data product nobody wants in the end (see lean product development/lean startup).
One of the most important attributes of a dataset is the correlation of the contained variables with important (real-world) variables or, alternatively, the outcomes the data buyer wants to predict. Whenever we measure something, collect data, or build a model, we create an imperfect representation of the underlying reality. Therefore, we need to make sure that the variables in our dataset are actually correlated with the latent, real-world variables they represent. This correlation could also be interpreted as “correctness” or “preciseness” and is strongly connected to representativeness. However, there is another, actually even more important type of correlation that is related to relevance and could also be called “predictive power”.
Correct data in a relevant (i.e. valuable) domain is not guaranteed to be valuable. Only if the information is correlated with a variable that has a very tangible impact in the real world and is indeed actionable, the dataset itself becomes valuable. This can be nicely illustrated with leading and lagging indicators. Let’s assume you have a dataset with fundamental economic data. The information is rich, detailed, and correct. But when trying to predict KPIs of interest, like e.g. quarterly GDP growth, you notice that it has no predictive power. Instead, you see a high correlation with the GDP growth in the previous quarter. What happened? The data is lagging, i.e. it has no predictive power for the future. No matter how good the quality or how valuable the related market, if data has no predictive power for variables of interest, its value drops significantly.
The last dimension is the rarity of the dataset. Like for all other goods, prices result from supply and demand. The closer you are to having a monopoly for a certain good, the higher your margins are.
But data has a property that is very specific to immaterial assets: a high initial investment and marginal distribution costs near zero. This means that once you make the initial (high) investment of generating a dataset, you will most likely try to sell as many copies of this dataset as possible (similar to the SaaS business model). As long as you are in a monopoly or at least an oligopoly situation, you can maintain a high margin while quickly expanding your volume. However, as soon as the number of similar datasets (i.e. substitutes) in the market grows, the pressure on your margins will rise as new market entrants will almost certainly try to compete via lower prices.
There’s also another aspect to the rarity or “freshness” of a dataset that is specific to the quantitative investment industry (which is a big segment of the data market). In order to outperform the market, you need to have an information advantage. While traditional investors used to leverage their own experience and knowledge as well as in-depth fundamental analysis, quantitative investors are looking for unique datasets very few other investors have used before. If the dataset contains new information that helps to predict the performance of a stock or another asset, they can use it for making investment decisions before other investors see this opportunity. However, if the dataset is already used by the majority, everyone else will have the same information and the opportunity is no longer an opportunity.
These are the seven dimensions we use for the initial evaluation of a dataset. The stronger it is in each of them, the higher its value tends to be. What is really important to keep in mind is that if just one of the attributes is very weak, this can reduce the value of the whole dataset significantly, even if all other attributes are very strong. But if you own data that scores medium to high in all of these dimensions, you should think about monetizing it. This way, you not only have the chance to generate additional revenue streams and maybe even lay the foundation for new, data-driven business models. Furthermore, you give innovators around the world the chance to develop new, innovative solutions for real-world problems.
Luckily, you don’t have to solve your data challenges on your own. The number of free data sources, data vendors, and data-driven solution providers is constantly growing so your options are manifold. No matter if you need a turn-key solution or want to build your own, powerful data foundation, we support you as your end-to-end partner for external data. Just get in touch with us via firstname.lastname@example.org or use our contact form.