The adaptation of large language models (LLMs) for understanding protein sequences faces several challenges. One significant hurdle is the lack of direct correlations between protein sequences and text descriptions in existing datasets. This limitation hinders the effective training and evaluation of LLMs for protein comprehension. Additionally, the absence of comprehensive datasets that integrate protein sequences with textual content restricts the full utilization of multimodal large language models (MMLMs) in protein science.
Moreover, there are challenges in fully representing protein diversity and accurately annotating data in major protein databases like UniProtKB and RefSeq. These databases can contain biases and errors due to community contributions and automated systems. Furthermore, protein design databases such as KEGG and STRING, despite being comprehensive, are limited by biases, resource-intensive curation, and difficulties in integrating diverse data sources.
To overcome these challenges, researchers have developed the ProteinLMDataset and ProteinLMBench. The ProteinLMDataset contains 17.46 billion tokens for self-supervised pretraining and 893K instructions for supervised fine-tuning, while ProteinLMBench is the first benchmark with 944 manually verified multiple-choice questions for evaluating protein comprehension in LLMs. These resources aim to bridge the gap in protein-text data integration, enabling LLMs to understand protein sequences without extra encoders and generate accurate protein knowledge using the novel Enzyme Chain of Thought (ECoT) approach.
The newly created ProteinLMDataset aims to enhance LLMs' comprehension of protein sequences by providing a large-scale dataset specifically designed for further self-supervised pretraining and supervised fine-tuning of LLMs. The dataset includes 17.46 billion tokens for pretraining and 893,000 instructions for supervised fine-tuning. Additionally, the researchers have developed ProteinLMBench, the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs. This dataset and benchmark combination bridges the gap in protein-text data integration, allowing LLMs to understand protein sequences without extra encoders and generate accurate protein knowledge using the novel Enzyme Chain of Thought (ECoT) approach.
The key parallels drawn between protein sequences and natural language that have led to advancements in deep learning models are based on the sequential structures of both proteins and language6. Protein sequences, like language, consist of a specific order of amino acids that determine their structure and function. Similarly, language consists of a specific order of words that convey meaning.
This parallel has inspired the adaptation of deep learning models, originally developed for natural language processing (NLP), to the field of protein science5. For instance, language models such as LLMs (Large Language Models) have been adapted to understand and generate protein sequences.
However, a significant challenge in this adaptation is the lack of direct correlations between protein sequences and text descriptions in existing datasets. This hinders the effective training and evaluation of LLMs for protein comprehension. To address this, researchers have created datasets like ProteinLMDataset, which integrate protein sequences with textual content to enhance LLMs' understanding of protein sequences.
Moreover, benchmarks like ProteinLMBench have been developed to evaluate the performance of LLMs in understanding protein sequences. These advancements are bridging the gap in protein-text data integration, enabling LLMs to understand protein sequences without extra encoders, and to generate accurate protein knowledge using approaches like the Enzyme Chain of Thought (ECoT).
In conclusion, the parallels between protein sequences and natural language have opened up avenues for the application of NLP techniques in protein science, leading to advancements in our understanding and manipulation of proteins.