Large-scale machine learning models are at the heart of headline-grabbing technologies like OpenAI’s DALL-E 2 and Google’s LaMDA. They’re impressive, to be sure, capable of generating images and text convincing enough to pass for a human’s work. But developing the models took an enormous amount of time and compute power — not to mention cash. DALL-E 2 alone was trained on 256 GPUs for 2 weeks, which works out to a cost of around $130,000 if it were trained on Amazon Web Services instances, according to one estimate.
Smaller companies struggle to keep up, which is why many turn to “AI-as-a-service” vendors that handle the challenging work of creating models and charge for access to them through an API. One such vendor is AssemblyAI, which focuses specifically on speech-to-text and text analysis services.
AssemblyAI today announced that it raised $30 million in a Series B round led by Insight Partners with participation from Y Combinator and Stripe co-founders John and Patrick Collison, Nat Friedman and Daniel Gross. To date, AssemblyAI has raised $64 million, which founder and CEO Dylan Fox tells TechCrunch is being invested in growing the company’s research and engineering teams and data center capacity AI model training.
Fox founded AssemblyAI after a 2-year stint at Cisco, where he worked on machine learning for collaboration products. Prior to that, he started YouGive1, an organization that worked with companies to reward customers with product offers in exchange for nonprofit donations.
“I was looking for speech recognition and natural language processing (NLP) APIs for past projects, and started AssemblyAI after seeing how limited, and low-accuracy, the available options were back in 2017,” Fox told TechCrunch in an email interview. “The company’s goal is to research and deploy cutting-edge AI models for NLP and speech recognition, and expose those models to developers in very simple software development kits and APIs that are free and easy to integrate.”
AssemblyAI offers AI-powered, API-based services in over 80 languages for automatic transcription, topic detection, and content moderation as well as “auto chapters,” which breaks down audio and video files into “chapters” with summaries for each. Using the platform, developers can call various APIs to perform tasks like “identify the speakers in this conversation” or “check this podcast for prohibited content” at a relatively low cost, starting at $0.00025 per audio-second.
“We’re training massive AI models on hundreds of GPUs, with billions of parameters,” Fox said. “Parameters” refers to the size of the models; generally speaking, larger models are more sophisticated. “Leveraging advances in AI research, we continue to dramatically improve the accuracy of all of our AI models as well as launch new ones,” he continued. “Our ‘AutoTrain’ feature enables the API to learn from a random sample of a customer’s data in order to automatically improve over time.”
AssemblyAI isn’t the only player in the bustling AI-as-a-service sector. NLPCloud provides NLP models out of the box through APIs, while Sayso created an API to change accented English from one accent to another in near-real time. Not for nothing, Amazon, Google and Microsoft have a host of API-based AI products targeting applications like text analysis, image recognition, text-to-speech, speech-to-text and more.
But Fox says AssemblyAI continues to grow at a fast clip, fueled by the pandemic, and — by extension — the rise of remote work. Audio and video is being incorporated into an expanding number of products, he notes, like videoconferencing and even dating apps. That’s led product teams to look for ways to build additive, high-value features on top of audio and video data.
“These features look like trust and safety teams at social media companies automating content moderation of audio posts, or advertising platforms automatically identifying topics spoken in podcasts and videos, collaboration tools providing readable transcripts, summaries, and keywords for video messages shared within their platforms, and telephony companies building smarter contact center platforms and revenue intelligence products that can analyze customer support and sales phone calls,” Fox said. “AssemblyAI is quickly becoming the go-to API platform for these product teams to be able to ship these AI-infused features on top of audio and video data within their products.”
Fox says that AssemblyAI now has “hundreds” of paying customers among its more than 10,000 users. Since the start of 2022, the user base has increased 3x while revenue — which Fox declined to disclose — has ticked up 3x.
“[We’re] processing millions of API calls every single day,” Fox said. “We plan to 3x our AI research team over the next six months and invest millions of dollars into GPU hardware to train larger and more complex AI models that will push the envelope.”
Fox believes the growth will position AssemblyAI well for the coming year — whatever headwinds they might bring. At a time when layoffs are becoming a regular occurrence and financing is tough to come by, he says that AssemblyAI will buck the trend by nearly doubling the size of its 52-person team by the end of the year.
“We had barely dipped into our Series A funding, which we closed just a few months ago in February from Accel, and weren’t actively fundraising. But we had been in touch with Rebecca [Liu-Doyle] from Insight for a while, and felt like she, Insight at large, plus the additional capital, would really help us [spur] our growth even further,” Fox said. “As the market unlocks, we need to be able to both establish ourselves as the dominant provider in this space, as well as support the growing expectations of customers — with more accurate AI models that can support the features and products they’re building.”