One reason it’s so difficult to produce effective vaccines against some viruses, including influenza and HIV, is that these viruses mutate very rapidly. This allows them to evade the antibodies generated by a particular vaccine, through a process known as “viral escape.”
MIT researchers have now devised a new way to computationally model viral escape, based on models that were originally developed to analyze language. The model can predict which sections of viral surface proteins are more likely to mutate in a way that enables viral escape, and it can also identify sections that are less likely to mutate, making them good targets for new vaccines.
“Viral escape is a big problem,” says Bonnie Berger, the Simons Professor of Mathematics and head of the Computation and Biology group in MIT’s Computer Science and Artificial Intelligence Laboratory. “Viral escape of the surface protein of influenza and the envelope surface protein of HIV are both highly responsible for the fact that we don’t have a universal flu vaccine, nor do we have a vaccine for HIV, both of which cause hundreds of thousands of deaths a year.”
In a study appearing today in Science, Berger and her colleagues identified possible targets for vaccines against influenza, HIV, and SARS-CoV-2. Since that paper was accepted for publication, the researchers have also applied their model to the new variants of SARS-CoV-2 that recently emerged in the United Kingdom and South Africa. That analysis, which has not yet been peer-reviewed, flagged viral genetic sequences that should be further investigated for their potential to escape the existing vaccines, the researchers say.
Berger and Bryan Bryson, an assistant professor of biological engineering at MIT and a member of the Ragon Institute of MGH, MIT, and Harvard, are the senior authors of the paper, and the lead author is MIT graduate student Brian Hie.
The language of proteins
Different types of viruses acquire genetic mutations at different rates, and HIV and influenza are among those that mutate the fastest. For these mutations to promote viral escape, they must help the virus change the shape of its surface proteins so that antibodies can no longer bind to them. However, the protein can’t change in a way that makes it nonfunctional.
The MIT team decided to model these criteria using a type of computational model known as a language model, from the field of natural language processing (NLP). These models were originally designed to analyze patterns in language, specifically, the frequency which with certain words occur together. The models can then make predictions of which words could be used to complete a sentence such as “Sally ate eggs for …” The chosen word must be both grammatically correct and have the right meaning. In this example, an NLP model might predict “breakfast,” or “lunch.”
The researchers’ key insight was that this kind of model could also be applied to biological information such as genetic sequences. In that case, grammar is analogous to the rules that determine whether the protein encoded by a particular sequence is functional or not, and semantic meaning is analogous to whether the protein can take on a new shape that helps it evade antibodies. Therefore, a mutation that enables viral escape must maintain the grammaticality of the sequence but change the protein’s structure in a useful way.
“If a virus wants to escape the human immune system, it doesn’t want to mutate itself so that it dies or can’t replicate,” Hie says. “It wants to preserve fitness but disguise itself enough so that it’s undetectable by the human immune system.”
To model this process, the researchers trained an NLP model to analyze patterns found in genetic sequences, which allows it to predict new sequences that have new functions but still follow the biological rules of protein structure. One significant advantage of this kind of modeling is that it requires only sequence information, which is much easier to obtain than protein structures. The model can be trained on a relatively small amount of information — in this study, the researchers used 60,000 HIV sequences, 45,000 influenza sequences, and 4,000 coronavirus sequences.
“Language models are very powerful because they can learn this complex distributional structure and gain some insight into function just from sequence variation,” Hie says. “We have this big corpus of viral sequence data for each amino acid position, and the model learns these properties of amino acid co-occurrence and co-variation across the training data.”
Once the model was trained, the researchers used it to predict sequences of the coronavirus spike protein, HIV envelope protein, and influenza hemagglutinin (HA) protein that would be more or less likely to generate escape mutations.
For influenza, the model revealed that the sequences least likely to mutate and produce viral escape were in the stalk of the HA protein. This is consistent with recent studies showing that antibodies that target the HA stalk (which most people infected with the flu or vaccinated against it do not develop) can offer near-universal protection against any flu strain.
The model’s analysis of coronaviruses suggested that a part of the spike protein called the S2 subunit is least likely to generate escape mutations. The question still remains as to how rapidly the SARS-CoV-2 virus mutates, so it is unknown how long the vaccines now being deployed to combat the Covid-19 pandemic will remain effective. Initial evidence suggests that the virus does not mutate as rapidly as influenza or HIV. However, the researchers recently identified new mutations that have appeared in Singapore, South Africa, and Malaysia, that they believe should be investigated for potential viral escape (these new data are not yet peer-reviewed).
In their studies of HIV, the researchers found that the V1-V2 hypervariable region of the protein has many possible escape mutations, which is consistent with previous findings, and they also found sequences that would have a lower probability of escape.
The researchers are now working with others to use their model to identify possible targets for cancer vaccines that stimulate the body’s own immune system to destroy tumors. They say it could also be used to design small-molecule drugs that might be less likely to provoke resistance, for diseases such as tuberculosis.
“There are so many opportunities, and the beautiful thing is all we need is sequence data, which is easy to produce,” Bryson says.
The research was funded by a National Defense Science and Engineering Graduate Fellowship from the Department of Defense and a National Science Foundation Graduate Research Fellowship.