MIT Engineers Use AI to Optimize Protein Manufacturing in Industrial Yeast
MIT chemical engineers used a large language model to optimize protein production in industrial yeast. The AI model improved yields for five of six tested proteins, including a cancer monoclonal antibody, potentially reducing drug development costs.
MIT chemical engineers have harnessed artificial intelligence to optimize the development of new protein manufacturing processes in industrial yeast, which could reduce the overall costs of developing and manufacturing biopharmaceuticals.
Using a large language model (LLM), the MIT team analyzed the genetic code of the industrial yeast Komagataella phaffii — specifically, the codons that it uses. There are multiple possible codons, or three-letter DNA sequences, that can be used to encode a particular amino acid, and the patterns of codon usage are different for every organism.
The new MIT model learned those patterns for K. phaffii and then used them to predict which codons would work best for manufacturing a given protein. This allowed the researchers to boost the efficiency of the yeast's production of six different proteins, including human growth hormone and a monoclonal antibody used to treat cancer.
Yeast such as K. phaffii and Saccharomyces cerevisiae (baker's yeast) are the workhorses of the biopharmaceutical industry, producing billions of dollars of protein drugs and vaccines every year. To engineer yeast for industrial protein production, researchers take a gene from another organism, such as the insulin gene, and modify it so that the microbe will produce it in large quantities. This requires coming up with an optimal DNA sequence for the yeast cells, integrating it into the yeast's genome, devising favorable growth conditions, and finally purifying the end product.
For new biologic drugs — large, complex drugs produced by living organisms — this development process might account for 15 to 20 percent of the overall cost of commercializing the drug.
The MIT team deployed a type of large language model known as an encoder-decoder. Instead of analyzing text, the researchers used it to analyze DNA sequences and learn the relationships between codons that are used in specific genes. Their training data, which came from a publicly available dataset from the National Center for Biotechnology Information, consisted of the amino acid sequences and corresponding DNA sequences for all of the approximately 5,000 proteins naturally produced by K. phaffii.
"The model learns the syntax or the language of how these codons are used," said the senior author of the study, a professor of chemical engineering at MIT and a member of the Koch Institute for Integrative Cancer Research. "It takes into account how codons are placed next to each other, and also the long-distance relationships between them."
Once the model was trained, the researchers asked it to optimize the codon sequences of six different proteins, including human growth hormone, human serum albumin, and trastuzumab, a monoclonal antibody used to treat cancer. They also generated optimized sequences of these proteins using four commercially available codon optimization tools. The researchers inserted each of these sequences into K. phaffii cells and measured how much of the target protein each sequence generated. For five of the six proteins, the sequences from the new MIT model worked the best.
The study appears this week in the Proceedings of the National Academy of Sciences.