Accelerating Development of New Medicines: Artificial Intelligence System Rapidly Predicts How Proteins Will Attach

By Crystal Jones on February 5, 2022

This image shows one protein (in gray) docking with another protein (in purple) to form a protein complex. Equidock, the machine learning system the researchers developed, can directly predict a protein complex like this in a matter of seconds. Credit: Courtesy of the researchers

The machine-learning model could help scientists speed the development of new medicines.

Antibodies, small proteins produced by the immune system, can attach to specific parts of a virus to neutralize it. As scientists continue to battle SARS-CoV-2Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the official name of the virus strain that causes coronavirus disease (COVID-19). Previous to this name being adopted, it was commonly referred to as the 2019 novel coronavirus (2019-nCoV), the Wuhan coronavirus, or the Wuhan virus.” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>SARS-CoV-2, the virus that causes Covid-19, one possible weapon is a synthetic antibody that binds with the virus’ spike proteins to prevent the virus from entering a human cell.

To develop a successful synthetic antibody, researchers must understand exactly how that attachment will happen. Proteins, with lumpy 3D structures containing many folds, can stick together in millions of combinations, so finding the right protein complex among almost countless candidates is extremely time-consuming.

To streamline the process, MITMIT is an acronym for the Massachusetts Institute of Technology. It is a prestigious private research university in Cambridge, Massachusetts that was founded in 1861. It is organized into five Schools: architecture and planning; engineering; humanities, arts, and social sciences; management; and science. MIT's impact includes many scientific breakthroughs and technological advances.” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>MIT researchers created a machine-learning model that can directly predict the complex that will form when two proteins bind together. Their technique is between 80 and 500 times faster than state-of-the-art software methods, and often predicts protein structures that are closer to actual structures that have been observed experimentally.

This technique could help scientists better understand some biological processes that involve protein interactions, like DNADNA, or deoxyribonucleic acid, is a molecule composed of two long strands of nucleotides that coil around each other to form a double helix. It is the hereditary material in humans and almost all other organisms that carries genetic instructions for development, functioning, growth, and reproduction. Nearly every cell in a person’s body has the same DNA. Most DNA is located in the cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be found in the mitochondria (where it is called mitochondrial DNA or mtDNA).” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>DNA replication and repair; it could also speed up the process of developing new medicines.

“Deep learning is very good at capturing interactions between different proteins that are otherwise difficult for chemists or biologists to write experimentally. Some of these interactions are very complicated, and people haven’t found good ways to express them. This deep-learning model can learn these types of interactions from data,” says Octavian-Eugen Ganea, a postdoc in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper.

Ganea’s co-lead author is Xinyuan Huang, a graduate student at ETH Zurich. MIT co-authors include Regina Barzilay, the School of Engineering Distinguished Professor for AI and Health in CSAIL, and Tommi Jaakkola, the Thomas Siebel Professor of Electrical Engineering in CSAIL and a member of the Institute for Data, Systems, and Society. The research will be presented at the International Conference on Learning Representations.

Protein attachment

The model the researchers developed, called Equidock, focuses on rigid body docking — which occurs when two proteins attach by rotating or translating in 3D space, but their shapes don’t squeeze or bend.

The model takes the 3D structures of two proteins and converts those structures into 3D graphs that can be processed by the neural network. Proteins are formed from chains of amino acids<div class="cell text-container large-6 small-order-0 large-order-1">
<div class="text-wrapper"><br />Amino acids are a set of organic compounds used to build proteins. There are about 500 naturally occurring known amino acids, though only 20 appear in the genetic code. Proteins consist of one or more chains of amino acids called polypeptides. The sequence of the amino acid chain causes the polypeptide to fold into a shape that is biologically active. The amino acid sequences of proteins are encoded in the genes. Nine proteinogenic amino acids are called "essential" for humans because they cannot be produced from other compounds by the human body and so must be taken in as food.<br /></div>
</div>” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>amino acids, and each of those amino acids is represented by a node in the graph.

The researchers incorporated geometric knowledge into the model, so it understands how objects can change if they are rotated or translated in 3D space. The model also has mathematical knowledge built in that ensures the proteins always attach in the same way, no matter where they exist in 3D space. This is how proteins dock in the human body.

Using this information, the machine-learning system identifies atoms of the two proteins that are most likely to interact and form chemical reactions, known as binding-pocket points. Then it uses these points to place the two proteins together into a complex.

“If we can understand from the proteins which individual parts are likely to be these binding pocket points, then that will capture all the information we need to place the two proteins together. Assuming we can find these two sets of points, then we can just find out how to rotate and translate the proteins so one set matches the other set,” Ganea explains.

One of the biggest challenges of building this model was overcoming the lack of training data. Because so little experimental 3D data for proteins exist, it was especially important to incorporate geometric knowledge into Equidock, Ganea says. Without those geometric constraints, the model might pick up false correlations in the dataset.

Seconds vs. hours

Once the model was trained, the researchers compared it to four software methods. Equidock is able to predict the final protein complex after only one to five seconds. All the baselines took much longer, from between 10 minutes to an hour or more.

In quality measures, which calculate how closely the predicted protein complex matches the actual protein complex, Equidock was often comparable with the baselines, but it sometimes underperformed them.

“We are still lagging behind one of the baselines. Our method can still be improved, and it can still be useful. It could be used in a very large virtual screening where we want to understand how thousands of proteins can interact and form complexes. Our method could be used to generate an initial set of candidates very fast, and then these could be fine-tuned with some of the more accurate, but slower, traditional methods,” he says.

In addition to using this method with traditional models, the team wants to incorporate specific atomic interactions into Equidock so it can make more accurate predictions. For instance, sometimes atoms in proteins will attach through hydrophobic interactions, which involve water molecules.

Their technique could also be applied to the development of small, drug-like molecules, Ganea says. These molecules bind with protein surfaces in specific ways, so rapidly determining how that attachment occurs could shorten the drug development timeline.

In the future, they plan to enhance Equidock so it can make predictions for flexible protein docking. The biggest hurdle there is a lack of data for training, so Ganea and his colleagues are working to generate synthetic data they could use to improve the model.

Reference: “Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking” by Octavian-Eugen Ganea, Xinyuan Huang, Charlotte Bunne, Yatao Bian, Regina Barzilay, Tommi S. Jaakkola and Andreas Krause, 28 September 2021, ICLR 2022 Conference.
OpenReview

This work was funded, in part, by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium, the Swiss National Science Foundation, the Abdul Latif Jameel Clinic for Machine Learning in Health, the DTRA Discovery of Medical Countermeasures Against New and Emerging (DOMANE) threats program, and the DARPAFormed in 1958 (as ARPA), the Defense Advanced Research Projects Agency (DARPA) is an agency of the United States Department of Defense responsible for the development of emerging technologies for use by the military. DARPA formulates and executes research and development projects to expand the frontiers of technology and science, often beyond immediate U.S. military requirements, by collaborating with academic, industry, and government partners.” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>DARPA Accelerated Molecular Discovery program.

Source: SciTechDaily

Published in Artificial Intelligence, Chemistry, DARPA, machine learning, MIT, Pharmaceuticals and Protein