MIT Biologists Glean New Insight Into Repetitive Protein Sequences

MIT researchers used a technique called dot-plot matrix, which is a way to visually represent amino acid sequences, to compare protein sequences known as “low-complexity regions” across many different species. Credit: Courtesy of the researchers, and edited by MIT News.

Computational analysis reveals that many repetitive sequences are shared across proteins and are similar in species from bacteria to humans.

Approximately 70 percent of all human proteins include at least one sequence consisting of a single amino acidAny substance that when dissolved in water, gives a pH less than 7.0, or donates a hydrogen ion.” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>acid repeated many times, with a few other amino acids<div class="cell text-container large-6 small-order-0 large-order-1">
<div class="text-wrapper"><br />Amino acids are a set of organic compounds used to build proteins. There are about 500 naturally occurring known amino acids, though only 20 appear in the genetic code. Proteins consist of one or more chains of amino acids called polypeptides. The sequence of the amino acid chain causes the polypeptide to fold into a shape that is biologically active. The amino acid sequences of proteins are encoded in the genes. Nine proteinogenic amino acids are called "essential" for humans because they cannot be produced from other compounds by the human body and so must be taken in as food.<br /></div>
</div>” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>amino acids sprinkled in. These “low-complexity regions” (LCRs) are also found in the proteins of most other organisms.

Although the proteins that contain these sequences have many different functions, MITMIT is an acronym for the Massachusetts Institute of Technology. It is a prestigious private research university in Cambridge, Massachusetts that was founded in 1861. It is organized into five Schools: architecture and planning; engineering; humanities, arts, and social sciences; management; and science. MIT's impact includes many scientific breakthroughs and technological advances. Their stated goal is to make a better world through education, research, and innovation.” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>MIT biologists have now come up with a way to identify and analyze them as a unified group. Their technique allows them to examine similarities and differences between LCRs from different species, and helps them to resolve the functions of these sequences and the proteins in which they are found.

Using their technique, the scientists analyzed all of the proteins found in eight different species, from bacteria to humans. They discovered that while LCRs can vary between proteins and species, they often share a similar role — helping the protein in which they’re found to join a larger-scale assembly such as the nucleolus, an organelle found in nearly all human cells.

“Instead of looking at specific LCRs and their functions, which might seem separate because they’re involved in different processes, our broader approach allows us to see similarities between their properties, suggesting that maybe the functions of LCRs aren’t so disparate after all,” says Byron Lee, an MIT graduate student.

Differences between LCRs of different species were also found by the research team. They showed that these species-specific LCR sequences correspond to species-specific functions, such as forming plant cell walls.

Lee and graduate student Nima Jaberi-Lashkari are the lead authors of the study, which was recently published in the journal eLife. Eliezer Calo, an assistant professor of biology at MIT, is the senior author of the paper.

Using computational analysis, researchers have found that many repetitive sequences are shared across proteins and are similar in species from bacteria to humans. Credit: Courtesy of the researchers

Large-scale study

Previous research revealed that LCRs are involved in a variety of cellular processes, including cell adhesion and DNADNA, or deoxyribonucleic acid, is a molecule composed of two long strands of nucleotides that coil around each other to form a double helix. It is the hereditary material in humans and almost all other organisms that carries genetic instructions for development, functioning, growth, and reproduction. Nearly every cell in a person’s body has the same DNA. Most DNA is located in the cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be found in the mitochondria (where it is called mitochondrial DNA or mtDNA).” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>DNA binding. These LCRs are often rich in a single amino acid such as alanine, lysine, or glutamic acid.

Finding these sequences and then studying their functions individually is a time-consuming process, so the scientists decided to use bioinformatics — an approach that uses computational methods to analyze large sets of biological data — to evaluate them as a larger group.

Bioinformatics is a relatively new scientific subdiscipline that incorporates elements of biology and computer science together for the purpose of developing efficient and robust methods for the analyses and interpretation of large amounts of biological data, such as DNA, RNA

Ribonucleic acid (RNA) is a polymeric molecule similar to DNA that is essential in various biological roles in coding, decoding, regulation and expression of genes. Both are nucleic acids, but unlike DNA, RNA is single-stranded. An RNA strand has a backbone made of alternating sugar (ribose) and phosphate groups. Attached to each sugar is one of four bases—adenine (A), uracil (U), cytosine (C), or guanine (G). Different types of RNA exist in the cell: messenger RNA (mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA).

” data-gt-translate-attributes=”[{“attribute”:”data-cmtooltip”, “format”:”html”}]”>RNA, and amino acid sequences or annotations about those sequences.

“What we wanted to do is take a step back and instead of looking at individual LCRs, to try to take a look at all of them and to see if we could observe some patterns on a larger scale that might help us figure out what the ones that have assigned functions are doing, and also help us learn a bit about what the ones that don’t have assigned functions are doing,” Jaberi-Lashkari says.

To do that, the MIT team used a technique called dot-plot matrix (see image at the top of the page), which is a way to visually represent amino acid sequences, to generate images of each protein under study. Next, they used computational image processing methods to compare thousands of these matrices at the same time.

Using this technique, the researchers were able to categorize LCRs based on which amino acids were most frequently repeated in the LCR. They also grouped LCR-containing proteins by the number of copies of each LCR type found in the protein. Analyzing these traits helped the researchers to learn more about the functions of these LCRs.

As one demonstration, the team of researchers picked out a human protein, known as RPA43, which has three lysine-rich LCRs. This protein is one of many subunits that make up an enzyme called RNA polymerase 1, which synthesizes ribosomal RNA. The scientists discovered that the copy number of lysine-rich LCRs is important for helping the protein integrate into the nucleolus, the organelle responsible for synthesizing ribosomes.

Biological assemblies

In a comparison of the proteins found in eight different species, the researchers found that some LCR types are highly conserved between species, meaning that the sequences have changed very little over evolutionary timescales. These sequences tend to be found in proteins and cell structures that are also highly conserved, such as the nucleolus.

“These sequences seem to be important for the assembly of certain parts of the nucleolus,” Lee says. “Some of the principles that are known to be important for higher order assembly seem to be at play because the copy number, which might control how many interactions a protein can make, is important for the protein to integrate into that compartment.”

The MIT team also found differences between LCRs seen in two different types of proteins that are involved in nucleolus assembly. They discovered that a nucleolar protein known as TCOF contains many glutamine-rich LCRs that can help scaffold the formation of assemblies, while nucleolar proteins with only a few of these glutamic acid-rich LCRs could be recruited as clients (proteins that interact with the scaffold).

Another structure that appears to have many conserved LCRs is the nuclear speckle, which is found inside the cell nucleus. The researchers also found many similarities between LCRs that are involved in forming larger-scale assemblies such as the extracellular matrix, a network of molecules that provides structural support to cells in plants and animals.

The research team also found examples of structures with LCRs that seem to have diverged between species. For example, plants have distinctive LCR sequences in the proteins that they use to scaffold their cell walls, and these LCRs are not seen in other types of organisms.

Now the researchers plan to expand their LCR analysis to additional species.

“There’s so much to explore, because we can expand this map to essentially any species,” Lee says. “That gives us the opportunity and the framework to identify new biological assemblies.”

Reference: “A unified view of low complexity regions (LCRs) across species” by Byron Lee, Nima Jaberi-Lashkari and Eliezer Calo, 13 September 2022, eLife.
DOI: 10.7554/eLife.77058

The research was funded by the National Institute of General Medical Sciences, National Cancer Institute, the Ludwig Center at MIT, a National Institutes of Health Pre-Doctoral Training Grant, and the Pew Charitable Trusts.

Source: SciTechDaily

MIT Biologists Glean New Insight Into Repetitive Protein Sequences

Large-scale study

Biological assemblies

Related