How Moderna uses cloud and data wrangling to conquer COVID-19

Commentary: Most COVID-related machine learning failed–not Moderna. Here’s how data prep and cloud helped make Moderna a COVID-19 vaccination success story.

Image: iStock/gopixa

“Hundreds of AI tools have been built to catch covid. None of them helped.” That’s a bold statement by Will Douglas Heaven, senior editor for AI at MIT Technology Review, and is quite likely correct. Despite dozens upon dozens of machine learning algorithms designed to diagnose patients or predict just how sick COVID-19 might make them, two independent reviews published in the British Medical Journal and Nature came to the same conclusion: none of them worked.

But let’s not write off artificial intelligence’s impact on COVID-19 too soon. Though most ML algorithms failed, there’s one area where they succeeded and succeeded big. Data scientists at Moderna managed to pull off a modern-day miracle using cloud infrastructure and machine learning, as recounted by Moderna chief data and AI officer Dave Johnson. Why did Moderna succeed while many other efforts failed? It’s all about the data.

SEE: COVID-19 vaccination policy (TechRepublic Premium)

Garbage in, garbage out

More about Big Data

Given how fast medical researchers hastened to respond to the COVID-19 threat, it’s understandable why so many data science projects failed. As outlined by Heaven, “Many of the problems that were uncovered are linked to the poor quality of the data that researchers used to develop their tools.” Poor in what ways? “[M]any tools were built using mislabeled data or data from unknown sources.” In less frenetic times with sufficient hindsight, perhaps these problems could be fixed. But in the case of the COVID ML algorithms, Heaven continued, “[M]any tools were developed either by AI researchers who lacked the medical expertise to spot flaws in the data or by medical researchers who lacked the mathematical skills to compensate for those flaws.”

The problem, in other words, may not have been the models themselves but, rather, the data feeding into those models.

A recent Anaconda data science survey uncovered the fact that 39% of data science isn’t really “science” at all–it’s data wrangling, or cleaning and preparing data to be used by a model. This isn’t a bad thing, as Leigh Dodds of the Open Data Institute has suggested. In fact, it’s an unalloyed good: “[S]pending time working with data to transform, explore, and understand it better is absolutely what data scientists should be doing….Understand the material better and you’ll get better insights.”

Or, as analyst Benedict Evans put it in his newsletter, it turns out it’s “very hard to make sure that the training data is as clean as you think, and very hard to generalise from training data from one context to use in another context.”

Moderna approached things differently.

Building vaccinations with AI

Though we sometimes mischaracterize AI as machines acting like humans, with the very name misleading us, a founder of artificial intelligence suggested a different term: “complex information processing.” The data scientist’s job is not to feed copious quantities of data into a black box algorithm and pray for magic to happen, but rather to find ways to complement human thought with that “complex information processing” that only a computer can do at scale and speed.

This is precisely what makes Moderna’s approach so powerful.

“[P]utting in digital systems and processes to…capture homogeneous, good data that can feed into that is obviously a really important first step, but it also lays the foundation of processes that are then amenable to these greater degrees of automation,” said Johnson. Catch that? No? Johnson can rephrase it: “We spent a lot of time on the data curation, data ingestion, to make sure the data is good to be used right away. And then we put a lot of tooling and infrastructure in place to get those models into production and integrated.”

SEE: Why data storytelling in business matters more than ever (TechRepublic)

Moderna focuses on getting the data structured correctly upfront to make it more usable down the road, and then ensures it has the right cloud infrastructure in place to be able to automate data processing at scale. Here’s an example:

One of the big bottlenecks was having this mRNA for the scientist to run tests in. So, what we did is we put in place a ton of robotic automation, put in place a lot of digital systems and process automation and AI algorithms as well. And [we] went from maybe about 30 mRNAs manually produced in a given month to a capacity of about a thousand in a month period without significantly more resources and much better consistency in quality and so on.

And here’s another for mRNA sequence design:

We’re coding for some protein, which is an amino acid sequence, but there’s a huge degeneracy of potential nucleotide sequences that could code for that, and so starting from an amino acid sequence, you have to figure out what’s the ideal way to get there. And so what we have [are] algorithms that can do that translation in an optimal way. And then we have algorithms that can take one and then optimize it even further to make it better for production or to avoid things that we know are bad for this mRNA in production or for expression.

The algorithms aren’t intended to magically create cures for COVID; rather, the ML algorithms are intended to “automate activities. Anytime we see something where we know that scale and making it parallel is going to improve things, we put in place this process.” But to do this successfully, Moderna first needs to structure and prepare its data. Good data makes for good ML algorithms. It’s why Moderna has succeeded when so many other data science algorithms failed to help with COVID. That’s the lesson: if you want great results, first ensure you’re prepping great data.

Disclosure: I work for AWS, but the views expressed herein are mine.