AI trained on bacterial genomes produces never-before-seen proteins

The researchers say this setup allows Evo to “link nucleotide-level patterns to genomic context at the kilobase scale.” In other words, if you feed it a large chunk of genomic DNA, Evo can interpret it the way LLM interprets the query and produces output that, in a genomic sense, matches that interpretation.

The researchers reasoned that by training on bacterial genomes, they could use a known gene as a hint, and Evo should produce a result that included protein-coding regions with similar functions. The key question is whether it will simply output protein sequences that we already know about, or whether the results will be less predictable.

New proteins

To begin testing the system, the researchers fed it gene fragments of known proteins and determined whether Evo could complement them. In one example, if 30 percent of the gene sequence for a known protein was obtained, Evo could derive 85 percent of the rest. If 80 percent of the sequence is requested, it can return the entire missing sequence. When one gene was removed from a functional cluster, Evo could also correctly identify and restore the missing gene.

The large amount of training data also allowed Evo to correctly identify the most important regions of the protein. If he made changes to the sequence, they were usually in regions of the protein where variation is tolerated. In other words, learning allowed the system to implement rules for evolutionary constraints on changes in known genes.

So the researchers decided to test what happened when Evo was asked to come up with something new. To do this, they used bacterial toxins, which are usually encoded along with an antitoxin that keeps the cell from self-destructing whenever it activates genes. There are many such examples, and they tend to develop rapidly as part of an arms race between bacteria and their competitors. So the team developed a toxin that was only loosely related to the known ones and contained no known antitoxin, and gave its sequence to Evo as a clue. And this time they filtered out any reactions similar to known antitoxin genes.

Leave a Comment