Democratizing protein language model training, sharing and collaboration

  • Drews, J. Drug discovery: a historical perspective. Science 287, 1960–1964 (2000).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Jacob, François & Monod, J. Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Glickman, M. H. & Ciechanover, A. The ubiquitin–proteasome proteolytic pathway: destruction for the sake of construction. Phys. Rev. 82, 373–428 (2002).

    CAS 

    Google Scholar
     

  • Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proc. 35th International Conference on Neural Information Processing Systems (eds Ranzato, M. et al.) (NIPS, 2021).

  • Rao, R. M. et al. MSA transformer. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) (PMLR, 2021).

  • Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

    Article 

    Google Scholar
     

  • Heinzinger, M. et al. Bilingual language model for protein sequence and structure. NAR Genom. Bioinform. 6, lqae150 (2024).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) (PMLR, 2022).

  • Nijkamp, E., Ruffolo, J. A., Weinstein, E. N., Naik, N. & Madani, A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 14, 968–978 (2023).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Su, J. et al. SaProt: orotein language modeling with structure-aware vocabulary. In Proc. 12th International Conference on Learning Representations (ed Kim, B.) (ICLR, 2023).

  • Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations https://openreview.net/pdf?id=nZeVKeeFYf9 (ICLR, 2022).

  • Pfeiffer, J. et al. AdapterHub: a framework for adapting transformers. In Proc. 2020 EMNLP (Systems Demonstrations) https://aclanthology.org/2020.emnlp-demos.7.pdf (Association for Computational Linguistics, 2020).

  • Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl Acad. Sci. USA 114, 3521–3526 (2017).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).

    Article 
    PubMed 

    Google Scholar
     

  • Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387, 850–858 (2025).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Li, M. et al. ProSST: protein language modeling with quantized structure and disentangled attention. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) https://openreview.net/forum?id=4Z7RZixpJQ&referrer=%5Bthe%20profile%20of%20Bozitao%20Zhong%5D(%2Fprofile%3Fid%3D~Bozitao_Zhong1) (NeurIPS, 2024).

  • Wang, X. et al. DPLM-2: a multimodal diffusion protein language model. In The Thirteenth International Conference on Learning Representation https://openreview.net/pdf?id=5z9GjHgerY (ICLR, 2025).

  • Tan, Y., Wang, R., Wu, B., Hong, L. & Zhou, B. Retrieval-enhanced mutation mastery: augmenting zero-shot prediction of protein language model. Preprint at https://arxiv.org/abs/2410.21127 (2024).

  • Pourmirzaei, M., Esmaili, F., Pourmirzaei, M., Wang, D. & Xu, D. Prot2Token: a multi-task framework for protein language processing using autoregressive language modeling. In ICML 2024 Workshop on Efficient and Accessible Foundation Models for Biological Discovery https://openreview.net/pdf?id=5z9GjHgerY (2024).

  • Gao, K. et al. Tokenizing 3D molecule structure with quantized spherical coordinates. Preprint at https://arxiv.org/abs/2412.01564 (2024).

  • Lin, X. et al. Tokenizing foldable protein structures with machine-learned artificial amino-acid vocabulary. Preprint at bioRxiv https://doi.org/10.1101/2023.11.27.568722 (2023).

  • Ivanisenko, N. V. et al. SEMA 2.0: web-platform for B-cell conformational epitopes prediction using artificial intelligence. Nucleic Acids Res. 52, W533–W539 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Devlin, J., Chang, Ming-Wei, Lee, K. & Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019).

  • Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Rao, R. et al. Evaluating protein transfer learning with tape. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M. et al.) (NIPS, 2019).

  • Kucera, T., Oliver, C., Chen, D. & Borgwardt, K. ProteinShake: building datasets and benchmarks for deep learning on protein structures. In Proc. 36th International Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NIPS, 2024).

  • Xu, M. et al. PEER: a comprehensive and multi-task benchmark for protein sequence understanding. In Proc. 35th International Conference on Neural Information Processing Systems (eds Ranzato, M. et al.) (2021).

  • Hie, B., Zhong, E. D., Berger, B. & Bryson, B. Learning the language of viral evolution and escape. Science 371, 284–288 (2021).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Dauparas, J. et al. Robust deep learning-based protein sequence design using proteinmpnn. Science 378, 49–56 (2022).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Tsuboyama, K. et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 620, 434–444 (2023).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Notin, P. et al. ProteinGym: large-scale benchmarks for protein fitness prediction and design. In Proc. 36th International Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NIPS, 2024).

  • Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Tan, Y. et al. VenusX: unlocking fine-grained functional understanding of proteins. Preprint at https://arxiv.org/abs/2505.11812 (2025).

  • Yan, S. et al. Protap: a benchmark for protein modeling on realistic downstream applications. Preprint at https://arxiv.org/abs/2506.02052 (2025).

  • Zhou, Z. et al. Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat. Commun. 15, 5566 (2024).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Dai, F. et al. Toward de novo protein design from natural language. Preprint at bioRxiv https://doi.org/10.1101/2024.08.01.606258 (2024).

  • Meshchaninov, V. et al. Diffusion on language model encodings for protein sequence generation Preprint at https://arxiv.org/abs/2403.03726 (2024).

  • Sagawa, T., Kanao, E., Ogata, K., Imami, K. & Ishihama, Y. Prediction of protein half-lives from amino acid sequences by protein language models. Preprint at bioRxiv https://doi.org/10.1101/2024.09.10.612367 (2024).

  • Bushuiev, A. et al. Training on test proteins improves fitness, structure, and function prediction. Preprint at https://arxiv.org/abs/2411.02109 (2024).

  • Zhuang, X. et al. Advancing biomolecular understanding and design following human instructions. Nat. Mach. Intell. 7, 1154–1167 (2025).

    Article 

    Google Scholar
     

  • Zhou, X. et al. Decoding the molecular language of proteins with Evola. Preprint at bioRxiv https://doi.org/10.1101/2025.01.05.630192 (2025).

  • Wang, L., Zhang, X., Wang, Y. & Xue, Z. SSAlign: ultrafast and sensitive protein structure search at scale. Preprint at bioRxiv https://doi.org/10.1101/2025.07.03.662911 (2025).

  • Meng, Z., Meng, Z. & Ounis, I. FusionDTI: fine-grained binding discovery with token-level fusion for drug-target interaction. Preprint at https://arxiv.org/abs/2406.01651 (2024).

  • McNutt, A. T. et al. Scaling structure aware virtual screening to billions of molecules with sprint. Preprint at https://arxiv.org/abs/2411.15418 (2025).

  • He, Y. et al. Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol. Cell 84, 1257–1270 (2024).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Hsu, C. et al. Learning inverse folding from millions of predicted structures. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) (PMLR, 2022).

  • Sledzieski, S. et al. Democratizing protein language models with parameter-efficient fine-tuning. Proc. Natl Acad. Sci. USA 121, e2405840121 (2024).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Zeng, S., Wang, D. & Xu, D. PEFT-SP: parameter-efficient fine-tuning on large protein language models improves signal peptide prediction. Genome Res. 34, 1445–1454 (2024).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Sledzieski, S., Kshirsagar, M., Berger, B., Dodhia, R. & Ferres, J. L. Parameter-efficient fine-tuning of protein language models improves prediction of protein-protein interactions. In Machine Learning for Structural Biology Workshop, NeurIPS 2023 https://www.mlsb.io/papers_2023/Parameter-Efficient_Fine-Tuning_of_Protein_Language_Models_Improves_Prediction_of_Protein-Protein_Interactions.pdf (2023).

  • Wang, D. et al. S-PLM: structure-aware protein language model via contrastive learning between sequence and structure. Adv. Sci. 12, 2404212 (2025).

    Article 
    CAS 

    Google Scholar
     

  • Su, J., Zhou, X., Zhang, X. & Yuan, F. A trimodal protein language model enables advanced protein searches. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02836-0 (2025).

  • van den Oord, A. et al. Neural discrete representation learning. In Proc. 30th International Conference on Neural Information Processing Systems (eds Lee, D. D., von Luxburg, U., Garnett, R., Sugiyama, M. & Guyon, I.) (NIPS, 2017).

  • Gong, L. et al. Efficient training of BERT by progressively stacking. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) (PMLR, 2019).

  • Loshchilov, I. & Hutter, F. Fixing weight decay regularization in Adam. Preprint at OpenReview https://openreview.net/forum?id=rk6qdGgCZ (2018).

  • Yang, K. K., Zanichelli, N. & Yeh, H. Masked inverse folding with sequence transfer for protein representation learning. Protein Eng. Des. Sel. 36, gzad015 (2023).

    Article 
    PubMed 

    Google Scholar
     

  • Zhang, Z. et al. Protein representation learning by geometric structure pretraining. In First Workshop of Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 https://openreview.net/pdf?id=V5MEFikiBQy (2023).

  • Dallago, C. et al. FLIP: benchmark tasks in fitness landscape inference for proteins. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks https://openreview.net/pdf?id=p2dMLEwL8tF (2021).

  • Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K., Nielsen, H. & Winther, O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 33, 3387–3395 (2017).

    Article 
    PubMed 

    Google Scholar
     

  • Hu, M. et al. Exploring evolution-aware & -free protein language models as protein function predictors. In Proc. 35th International Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NIPS, 2023).

  • Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1109 (1997).

    Article 
    PubMed 
    CAS 

    Google Scholar
     

  • Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. 32nd International Conference on Neural Information Processing Systems (eds Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K. & Cesa-Bianchi, N.) (NIPS, 2018).

  • Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) (PMLR, 2019).

  • Fu, J. et al. Exploring adapter-based transfer learning for recommender systems: empirical studies and practical insights. In Proc. 17th ACM International Conference on Web Search and Data Mining (eds Angélica, L., Lattanzi, S. & Muñoz Medina, A.) (ACM, 2024).

  • Yuan, F., He, X., Karatzoglou, A. & Zhang, L. Parameter-efficient transfer from sequential behaviors for user modeling and recommendation. In Proc. 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (eds Huang, J., Chang, Y. & Cheng, X.) (ACM, 2020).

  • Schreiber, A. ESMBind and QBind: LoRA, QLoRA, and ESM-2 for predicting binding sites and post translational modification. Preprint at bioRxiv https://doi.org/10.1101/2023.11.13.566930 (2023).

  • Schmirler, R., Heinzinger, M. & Rost, B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat. Commun. 15, 7407 (2024).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Karimi Mahabadi, R., Henderson, J. & Ruder, S. COMPACTER: efficient low-rank hypercomplex adapter layers. In Proc. 35th International Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NIPS, 2023).

  • Fu, L. et al. Critical Assessment of Protein Engineering (CAPE): a student challenge on the cloud. ACS Synth. Biol. 13, 3782–3787 (2024).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • He, Y., Zhou, X., Yuan, F. & Chang, X. Protocol to use protein language models predicting and following experimental validation of function-enhancing variants of thymine-N-glycosylase. STAR Protoc. 5, 103188 (2024).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar
     

  • Leave a Comment