From fraud detection to agricultural crop monitoring, a new wave of tech startups has emerged, all armed with the conviction that their use of AI will address the challenges presented by the modern world.
However, as the AI landscape matures, a growing concern comes to light: The heart of many AI companies, their models, are rapidly becoming commodities. A noticeable lack of substantial differentiation among these models is beginning to raise questions about the sustainability of their competitive advantage.
Instead, while AI models continue to be pivotal components of these companies, a paradigm shift is underway. The true value proposition of AI companies now lies not just within the models, but also predominantly in the underpinning datasets. It is the quality, breadth, and depth of these datasets that enable models to outshine their competitors.
However, in the rush to market, many AI-driven companies, including those venturing into the promising field of biotechnology, are launching without the strategic implementation of a purpose-built technology stack that generates the indispensable data required for robust machine learning. This oversight carries substantial implications for the longevity of their AI initiatives.
The true value proposition of AI companies now lies not just within the models, but also predominantly in the underpinning datasets.
As seasoned venture capitalists (VCs) will be well aware, it’s not enough to scrutinize the surface-level appeal of an AI model. Instead, a comprehensive evaluation of the company’s tech stack is needed to gauge its fitness for purpose. The absence of a meticulously crafted infrastructure for data acquisition and processing could potentially signal the downfall of an otherwise promising venture right from the outset.
In this article, I offer practical frameworks derived from my hands-on experience as both CEO and CTO of machine learning–enabled startups. While by no means exhaustive, these principles aim to provide an additional resource for those with the difficult task of assessing companies’ data processes and the resulting data’s quality and, ultimately, determining whether they are set up for success.
From inconsistent datasets to noisy inputs, what could go wrong?
Before jumping into the frameworks, let’s first assess the basic factors that come into play when assessing data quality. And, crucially, what could go wrong if the data’s not up to scratch.
First, let’s consider datasets’ relevance. Data must intricately align with the problem that an AI model is trying to solve. For instance, an AI model developed to predict housing prices necessitates data encompassing economic indicators, interest rates, real income, and demographic shifts.
Similarly, in the context of drug discovery, it’s crucial that experimental data exhibits the highest possible predictiveness for the effects in patients, requiring expert thought about the most relevant assays, cell lines, model organisms, and more.
Second, the data must be accurate. Even a small amount of inaccurate data can have a significant impact on the performance of an AI model. This is especially poignant in medical diagnoses, where a small error in the data could lead to a misdiagnosis and potentially affect lives.
Third, coverage of data is also essential. If the data is missing important information, then the AI model will not be able to learn as effectively. For example, if an AI model is being used to translate a particular language, it is important that the data includes a variety of different dialects.