Investors have taken note of the promise of artificial intelligence models. VC activity has grown exponentially reaching over $25 billion in deal value across GenAI in 2023, a 250% increase over ‘22. Companies are rapidly pivoting to inject AI into their businesses, developing new suites of product lines, internal tools, or adapting core offerings. We have seen the development of whole ecosystems of products focused on the forward applications of these models. Over $2.2B has been invested in developing applications in Biotech, $2.0B for 2D Media Generation, and $3.2B for Natural Language Interfaces
A critical, real bottleneck exists on a much more foundational level. The lack of access to high quality training data undermines the entire forward-looking market developing at hand. It presents an achilles heel to the entire ecosystem.
Data forms the crux, the backbone of the product of hand. Yet current data sources for training the core product lag miles behind supporting the future ambitions investors have envisioned, and funded.
Take, for example, OpenAI’s ChatGPT - which relied “60% on internet common crawl, 22% WebText2 (text documents scraped from URLs on Reddit submissions) [and]… 3% Wikipedia”.
Harvard Business Review finds, “poor data quality is the primary enemy to the development of wide-spread, profitable use of Machine Learning algorithms”—in essence, “garbage in, garbage out”.
For data to generate any predictive value, it must satisfy two overarching criteria - it must be “right”, and it must be the “right data”. That is, you must have accurate data, and relevant observations for the entire range of inputs.
Currently, most companies fail to meet both of these edicts.
Researchers find, on average, only 3% of companies’ data meets basic quality standards. In fact, 47% of new data records have at least one critical work-interrupting error.
At the same time, companies in niche verticals are struggling to obtain the relevant data points.
Regulatory constraints limit data aggregation across medical, health, insurance industries, and accessing copy-righted/trademarked content has spurred fierce ongoing litigation from content authors.
In fact, studies from the MIT Computer Science & AI Lab find, if current training methods continue, models will exhaust high quality training data within the next 3 years.
I reached out and profiled several leading industry experts and academic researchers across the industry. My interview with Dr. Greg Green validated the opportunity.
As a leader of The University of Chicago’s Data Science Institute, he is at the forefront of solving complex industry problems with the latest academic innovations. Green’s past industry roles include Chief Analytics Officer at Harland Clarke Holdings, Director at Google, EVP/Managing Director at Publicis Groupe, and Analytics Practice Lead at PwC. Green’s patented cloud-based media analytics platform was highlighted in Fast Company.
I asked Green for his take on the current state of AI model development.
Green passionately shared he is “strongly disappointed” with the current state of training data and the status quo practices of data science in the field. He explains, “There’s a current prevailing idea to just use ‘more data’, more more more [....] but training data needs to be tailored, designed, specifically to the use-case. Otherwise you are just building deficiencies, biases.”
In his view, currently, firms are “modeling to overcome the short-comings in the training data”, but he’s skeptical of the validity and robustness of systems being created.
Indeed, Green is “surprised at how they’re managing to tenuously build these models” and anticipates necessary changes in the near-term.