Large Language Models in Molecular Biology | by Serafim Batzoglou | Jun, 2023


One important clue in determining whether a given variant is benign, or at least not too deleterious, comes from comparing human genetics to the genetics of close relatives such as chimpanzees and other primates (Figure 12). Our genome closely resembles the genomes of other primates: it is 98.8% similar to the genome of chimpanzees, 98.4% similar to the genome of gorillas, and 97% similar to the genome of orangutans, for instance. Proteins, which are conserved by evolution, are even more similar on average. Our biology is also very similar, and when a mutation in a human protein is lethal or causes a serious genetic disease, the same mutation in the corresponding primate protein is likely to also be harmful. Conversely, protein variants that are observed in healthy primates are likely to be benign in humans as well. Therefore, the more primate genomes we can access, the more information we can gather about the human genome: we can compile a list of protein variants that are frequently observed in primates and deduce that these variants are likely benign in humans. Hence, the search for mutations that confer serious genetic disease should start from mutations not on this list.

Such a list of variants in primate proteins can never be enough to classify human mutations as benign or pathogenic. Simply put, there will be too many benign human mutations that have not had the opportunity to appear on the list of variants observed in primates. However, this list can be utilized in a more productive way: by observing the patterns within protein sequences and structures that tend to tolerate variants, and the patterns that tend not to tolerate variants. By learning to differentiate between these two classes of protein positions, we can gain the ability to annotate variants in proteins as likely benign and likely pathogenic.

The Illumina AI lab headed by Kyle Farh, which developed the SpliceAI method, adopted this approach to annotate variants in human proteins (Gao et al. 2023). Initially, in collaboration with others, they collected primate blood samples and sequenced the genomes of as many primates as they could access, including 809 individuals from 233 distinct primate species. This sequencing effort is an important conservation initiative: some primate species are endangered, and preserving the wealth of genetic information in these species is crucial for basic science as well as for informing human genetics.

The team identified a catalog of 4.3 million common protein variants in primates, with the corresponding protein also being present in humans. Then, they constructed a transformer that learns to distinguish between benign and pathogenic variants in human proteins. This was accomplished by learning the patterns of protein positions where primate variants tend to be present, in contrast to protein positions where primate variants tend to be absent. The transformer, named PrimateAI-3D, is a new version of a previous deep learning tool, PrimateAI (Sundaram et al. 2018), developed by the same laboratory. PrimateAI-3D utilizes both protein sequence data, as well as protein 3D models that are either experimentally reconstructed or computationally predicted by tools like AlphaFold and HHpred, voxelized at 2 Angstrom resolution (Figure 13).

Figure 13. Architecture of PrimateAI-3D. Human protein structures are voxelized, and together with multiple sequence alignments are passed as input to a 3D convolutional neural network that predicts pathogenicity of all possible point mutations of a target residue. The network is trained using a loss function with three components: (1) a language model predicting a missing human or primate amino acid using the surrounding multiple alignment as input; (2) a 3D convolutional “fill-in-the-blank” model predicting a missing amino acid in the 3D structure; a language model score trained on classifying between observed variants and random variants with matching statistical properties. Figure created by Tobias Hemp and included with permission.

In the ClinVar data set of human-annotated variants and their effects, PrimateAI-3D achieved 87.3% recall and 80.2% precision, with an AUC of 0.843, which was best across state-of-the-art methods, even though unlike other methods, it was not trained on ClinVar. Moreover, examining corrections to ClinVar across its versions hints to some proportion of the variants where PrimateAI-3D and ClinVar disagree, could be correctly called by PrimateAI-3D.

PrimateAI-3D can be applied to diagnosis of rare disease, where it can prioritize variants that are likely deleterious, and filter out likely benign variants. Another application is the discovery of genes associated with complex diseases: in a cohort of patients of a given disease, one can look for variants that are likely deleterious according to PrimateAI-3D, and then look for an abundance of such variants within a specific gene across the cohort. Genes that exhibit this pattern of being hit by many likely deleterious variants in patients of a given disease, are said to have a genetic “burden” that is a signal of playing a role in the disease. Gao and colleagues from the PrimateAI-3D team studied several genetic diseases with this methodology and discovered many genes previously not known to be associated with these diseases. Using PrimateAI-3D, Fiziev et al (2023) developed improved rare variant polygenic risk score (PRS) models to identify individuals at high disease risk. They also integrated PrimateAI-3D into rare variant burden tests within UK Biobank and identified promising novel drug target candidates.

Modeling gene regulation

As outlined earlier, the intricate process of gene regulation encompasses many interacting molecular components: the DNA chromatin structure, the chemical alterations within histones that DNA wraps around, the attachment of transcription factors to promoters and enhancers, the establishment of 3D DNA structure involving promoters, enhancers, bound transcription factors, and the recruitment of RNA polymerase. Theoretically, the precise DNA sequence in the vicinity of a gene carries all the information needed for this machinery to be triggered at the correct time, in the right amount, and in the appropriate cell type. In practice, predicting gene expression from the DNA sequence alone is a formidable task. Yet, language models have recently achieved significant progress in this area.

Data generation informative of gene regulation. Over the past two decades, genomic researchers have undertaken monumental efforts to produce the appropriate types of large-scale molecular data for understanding gene regulation. Hundreds of different assays have been developed that inform various aspects of the central dogma, too numerous to detail here. Here are some examples of the information obtained, always related to a human cell line or tissue type (the former often being immortalized cell lines, and the latter often sourced from deceased donors): (1) Identifying the precise locations across the entire genome that have open chromatin and those that have tightly packed chromatin. Two relevant assays for this are DNAse-seq and ATAC-seq. (2) Pinpointing all locations in the genome where a specific transcription factor is bound. (3) Identifying all locations in the genome where a specific histone chemical modification has occurred. (4) Determining the level of mRNA available for a given gene, i.e., the expression level of a particular gene. This type of data has been obtained for hundreds of human and mouse cell lines from numerous individuals. In total, several thousand such experiments have already been collected under multi-year international projects like ENCODE, modENCODE, Roadmap Epigenomics, Human Cell Atlas, and others. Each experiment, in turn, has tens to hundreds of thousands of data points across the entire human or model organism genome.

A lineage of language models, culminating in the transformer-based Enformer tool (Avsek et al. 2021), have been developed to accept the DNA sequence near a gene as input and output the cell type-specific expression level of this gene for any gene in the genome. Enformer is trained on the following task: given a genome region of 100,000 nucleotides and a specific cell type, it is trained to predict each of the available types of experimental data for this region, including the status of open or packed chromatin, the present histone modifications, the specific bound transcription factors, and the level of gene expression. A language model is ideal for this task: instead of masked language modeling, Enformer is trained in a supervised way, predicting all the tracks simultaneously from DNA sequence. By incorporating attention mechanisms, it can efficiently collate information from distant regions (up to 100,000 nucleotides away) to predict the status of a given location. In effect, Enformer learns all the intricate correlations between these diverse molecular entities.

Figure 14. Predictions of Enformer and an earlier system, Basenji2, compared to experimental results. Image included with permission from corresponding author, Ziga Avsec.

Enformer performs reasonably well in predicting gene expression from sequence alone. If we measure gene expression across all genes in the same cell line using a specific experimental assay (for instance, the CAGE assay), two replicates of the same experiment typically correlate at an average of 0.94. A computational method performing at this level could arguably reduce the need for collecting experimental data. Enformer doesn’t quite achieve this yet, correlating at a level of 0.85 with experimental data, which is about three times the error compared to two experimental replicates. However, this performance is expected to improve as more data are incorporated and enhancements are made to the model. Notably, Enformer can predict the changes in gene expression caused by mutations present in different individuals, as well as by mutations artificially introduced through CRISPR experiments. However, it still has its limitations, such as performing poorly in predicting the effects of distal enhancers — enhancers that are far from the gene start — (Karollus et al. 2023) and to correctly determine the direction of the effect of personal variants in gene expression (Sasse et al. 2023). Such shortcomings are likely due to insufficient training data. With data generation proceeding at an accelerated pace, it is not unreasonable to anticipate that in the foreseeable future we will have LLMs capable of predicting gene expression from sequence alone with experimental-level accuracy, and consequently models that accurately and comprehensively depict the complex molecular mechanisms involved in the central dogma of molecular biology.

As discussed above, DNA within cells is arranged in complex, hierarchical 3D chromatin structure, which plays a role in gene regulation because only genes within open chromatin are expressed. Orca (Zhou 2022) is a recent language model, based on a convolutional encoder-decoder architecture, that predicts 3D genome structure from proximity data provided by Hi-C experiments. Those are datasets across the entire genomes of a cell line or tissue sample, in which pairs of genomic positions that are close to each other are revealed as DNA fragments that glue a piece of DNA from each region. The Orca model is a hierarchical multi-level convolutional encoder, and a multilevel decoder, which predict DNA structure at 9 levels of resolution, from 4kb (kilo base pairs) to 1024kb, for input DNA sequences that are as long as the longest human chromosome.

Foundation Models

Foundation models are large deep learning architectures, such as the transformer-based GPT models by OpenAI, that encode a vast amount of knowledge from diverse sources. Researchers and practitioners can fine-tune these pre-trained models for specific tasks, resulting in high-performance systems for a wide range of downstream applications. Several foundation models have begun to emerge in molecular biology. Here, we will briefly introduce two such models that just appeared as preprints in biorXiv. (Because the papers have not been peer reviewed yet, we refrain from reporting on their performance compared to other state-of-the-art methods.)

scGPT is a foundation model designed for single-cell transcriptomics, chromatin accessibility, and protein abundance. This model is trained on single-cell data from 10 million human cells. Each cell contains expression values for a fraction of the approximately 20,000 human genes. The model learns embeddings of this large cell × gene matrix, which provide insights into the underlying cellular states and active biological pathways. The authors innovatively adapted the GPT methodology to this vastly different setting (Figure 15). Specifically, the ordering of genes in the genome, unlike the ordering of words in a sentence, is not as meaningful. Therefore, while GPT models are trained to predict the next word, the concept of the “next gene” is unclear in single-cell data. The authors solve this problem by training the model to generate data based on a gene prompt (a collection of known gene values) and a cell prompt. Starting from the known genes, the model predicts the remaining genes along with their confidence values. For K iterations, it divides those into K bins, and the top 1/K most confident genes are fixed as known genes for the next iteration. Once trained, scGPT is fine-tuned for numerous downstream tasks: batch correction, cell annotation (where the ground truth is annotated collections of different cell types), perturbation prediction (predicting the cell state after a given set of genes are experimentally perturbed), multiomics (where each layer, transcriptome, chromatin, proteome, is treated as a different language), prediction of biological pathways, and more.

Figure 15. Overview of scGPT. A. Workflow of scGPT. The model is trained on a large number of cells from cell atlas, and is then fine tuned for downstream applications such as clustering, batch correction, cell annotation, perturbation prediction and gene network inference. B. Input embeddings. There are gene tokens, gene expression values, and condition tokens. C. The transformer layer. Image provided by Bo Wang.

Nucleotide Transformer is a foundational model that focuses on raw DNA sequences. These sequences are tokenized into words of six characters each (k-mers of length 6) and trained using the BERT methodology. The training data consists of the reference human genome, 3200 additional diverse human genomes to capture variations across human genomics, and the genomes of 850 other species. The Nucleotide Transformer is then applied to 18 downstream tasks that encompass many of the previously discussed ones: promoter prediction, splice site donor and acceptor prediction, histone modifications, and more. Predictions are made either through probing, wherein embeddings at different layers are used as features for simple classifiers (such as logistic regression or perceptrons), or through light, computationally inexpensive fine-tuning.

Deciphering the biomolecular code that connects our genomes to the intricate biomolecular pathways in our body’s various cells, and subsequently to our physiology in combination with environmental interactions, doesn’t require AGI. While there are numerous AI tasks that may or may not be on the horizon, I argue that understanding molecular biology and linking it to human health isn’t one of them. LLMs are already proving adequate for this general aspiration.

Here are some tasks that we are not asking the AI to do. We aren’t asking it to generate new content; rather, we’re asking it to learn the complex statistical properties of existing biological systems. We aren’t requesting it to navigate intricate environments in a goal-oriented manner, maintain an internal state, form goals and subgoals, or learn through interaction with the environment. We aren’t asking it to solve mathematical problems or to develop deep counterfactual reasoning. We do, however, expect it to learn one-step causality relationships: if a certain mutation occurs, a specific gene malfunctions. If this gene is under-expressed, other genes in the cascade increase or decrease. Through simple one-step causal relationships, which can be learned from triangulating between correlations across modalities such as DNA variation, protein abundance and phenotype (a technique known as Mendelian randomization) and large-scale perturbation experiments that are becoming increasingly common, LLMs will effectively model cellular states. This connection extends from the genome at one end to the phenotype at the other.

In summary, today’s LLMs are sufficiently advanced to model molecular biology. Further methodological improvements are always welcome. However, the barrier is no longer deep learning methodology; the more significant gatekeeper is data.

Fortunately, data is becoming both cheaper and richer. Advances in DNA sequencing technology have reduced the cost of sequencing a human genome from $3Bn billion for the first genome, to roughly $1000 a few years back, and now to as low as $200 today. The same cost reductions apply to all molecular assays that use DNA sequencing as their primary readout. This includes assays for quantifying gene expression, chromatin structure, histone modifications, transcription factor binding, and hundreds of other ingenious assays developed over the past 10–20 years. Further innovations in single-cell technologies, as well as in proteomics, metabolomics, lipidomics, and other -omic assays, allow for increasingly detailed and efficient measurements of the various molecular layers between DNA and human physiology.

Figure 16. UK Biobank. The UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from around 500,000 UK volunteers. The participants were all between the ages of 40–69 years when they were recruited from 2006–2010. The data collected includes blood, urine and saliva samples, detailed information about the participants’ backgrounds, lifestyle and health, and subsequent medical histories accessed through health records. For a subset of participants, imaging data (brain, heart, abdomen, bones and joints) have also been collected. The exomes of 470,000 individuals were released in June 2022, and the entire genomes of all individuals are coming up by the end of 2023. Images provided by UK Biobank and included with permission.

So, how can all this be put together? A key type of data initiative is one that brings together a large group of volunteer participants for deep exploration of their -omic data, phenotypes, and health records. A leading example of this is the UK Biobank Project (UKB), a large-scale biobank, biomedical database and research resource containing comprehensive genetic and health information from half a million UK participants (Figure 16). Participant biosamples have been collected with broad consent, and a wealth of data is continuously being generated. The exomes (protein-coding parts of the genome) of almost all participants have been released, with whole genomes to follow. In addition, various types of data are available including COVID-19 antibody data, metabolomic, telomere, imaging, genotype, clinical measurements, primary care, pain questionnaires, and more. Additional data types are continuously added. UKB data are available to anyone for research purposes. All Of Us is a similar initiative in the US, which to date has sequenced the genomes of 250,000 participants. FinnGen (Finnland Genomics) aims to create a similar biobank of 500,000 Finnish participants, which is incredibly valuable because genetic studies turn out to be much easier in a cohort that is genetically more homogeneous. deCODE Genetics leads a similar effort in Iceland, with more than two-thirds of the adult population in Iceland participating in the effort. Additional cohorts of sequenced participants exist, including millions of exomes sequenced by Regeneron Pharmaceuticals (a private initiative), and many national initiatives worldwide.

Cancer in particular is a disease of the genome, and many companies are building a wealth of genomic information on cancer patients and cancer samples, and additional clinical information. Covering this field is beyond the scope, but it is worth mentioning Tempus, an AI-based precision medicine company with a large and growing library of clinical and molecular data on cancer, Foundation Medicine, a molecular information company that offers comprehensive genomic profiling assays to identify the molecular alterations in a patient’s cancer and match them with relevant targeted therapies, immunotherapies, and clinical trials, and GRAIL and Guardant Helth, two pioneering diagnostic companies that focus on early tumor detection from “liquid biopsies” or analysis of the genomic content of patient blood samples, which often contain molecular shedding of cancer cells. Each of these companies has data on large and growing cohorts of patients.

In addition to these cohort initiatives, there are numerous other large-scale data initiatives. Notably, the Human Cell Atlas project has already produced gene expression data for 42 million human cells from 6,300 donor individuals. The ENCODE Project, a vast functional genomic dataset on hundreds of human cell lines and various molecular quantities, has generated data on gene expression, chromatin accessibility, transcription factor binding, histone marks, DNA methylation, and more.

LLMs are perfectly suited to integrate these data. Looking to the future, we could envision a mammoth LLM integrating across all such datasets. So, what might the architecture and training of such a model look like? Let’s engage in a thought experiment and try to piece it together:

  • Genes in the genome, including important variants like different isoforms of the resulting proteins, are tokenized.
  • Different types of cells and tissues are tokenized.
  • Human phenotypes, such as disease states, clinical indications, and adherence to drug regimens, are also tokenized.
  • DNA sequences are tokenized at a fixed-length nucleotide level.
  • Positional information in the genome connects genes with nucleotide content.
  • Protein sequences are tokenized using the amino acid alphabet.
  • Data from the Human Cell Atlas and other single-cell datasets train the LLM in an autoregressive manner akin to GPT, or with masked language modeling akin to BERT, highlighting cell-type specific and cell-state specific gene pathways.
  • ENCODE and similar data teach the LLM to associate different molecular information layers like raw DNA sequence and its variants, gene expression, methylation, histone modifications, chromatin accessibility, etc., in a cell-type specific manner. Each layer is a distinct “language,” with varying richness and vocabulary, providing unique information. The LLM learns to translate between these languages.
  • Projects like the PrimateAI-3D’s primate genomics initiative and other species sequencing efforts instruct the LLM about the potential benign or harmful effects of mutations in the human genome.
  • The entire proteomes including protein variants are enriched with protein 3D structural information that is either experimentally obtained or predicted by AlphaFold, RoseTTAfold and other structural prediction methods.
  • Datasets from the UK Biobank (UKB) and other cohorts allow the LLM to associate genomic variant information and other molecular data with human health information.
  • The LLM leverages the complete clinical records of participants to understand common practice and its effects, and connect this with other “languages” across all datasets.
  • The LLM harnesses the vast existing literature on basic biology, genetics, molecular science, and clinical practice, including all known associations of genes and phenotypes.

Developing such an LLM presents a significant challenge, which is of different kind than the GPT line of LLMs. It requires technical innovation to represent and integrate the various information layers, as well scaling up the number of tokens processed by the model. Potential applications of such an LLM are vast. To list a few:

  • Clinical diagnosis. It could leverage all available patient information, including their genome, other measurements, entire clinical history, and family health information, aiding doctors in making precise diagnoses, even for rare conditions. It could be particularly useful in diagnosing rare diseases and subtyping cancers.
  • Drug development. The LLM could help identify promising gene and pathway targets for different clinical indications, individuals likely to respond to certain drugs, and those unlikely to benefit, thereby increasing the success of clinical trials. It could also assist in drug molecule development and drug repurposing.
  • Basic molecular biology. Each of the layers of molecular information will be connected to other layers in a manner similar to language translation, and the LLM will be probed for features that provide substantial predictive power. Whereas interpretation of deep learning models is a challenge, impressive advances are continuusly made by a research community that is eager to make AI interpretable. In the latest such advance by OpenAI4 , GPT-4 has just been deployed to explain the behavior of each of the neurons of GPT-2. (https://openai.com/research/language-models-can-explain-neurons-in-language-models)
  • Suggestions of additional experiments. The model can be leveraged to identify the “gaps’’ in the training data, in the form of cell types, or molecular layers, or even individuals of specific genetic background or disease indications, which are predicted with poor confidence levels from other data.

While developing these technologies, it’s essential to consider potential risks, including those related to patient privacy and clinical practice. Patient privacy remains a significant concern. This is especially true for LLMs, because depending on the capacity of the model, in principle the data of participants that were used to train the model is retrievable through a prompt that includes part of that data or other information that hones in to a specific patient. Therefore, it is especially important when training LLMs with participant data to have proper informed consent for the intended use of and access to these models.

However, many individuals, exemplified by the participants in the UK Biobank cohort, are motivated to share their data and biosamples generously, providing immense benefits for research and society. As for clinical practice, it’s unclear if LLMs can independently be used for diagnosis and treatment recommendations. The primary purpose of these models is not to replace, but to assist healthcare professionals, offering powerful tools that doctors can use to verify and audit medical information. To quote Isaac Kohane, “trust, but verify” (Lee, Goldberg, Kohane 2023).

So, what are the hurdles to fully implement an LLM to bridge genetics, molecular biology, and human health? The main obstacle is data availability. The production of functional genomic data, such as those from ENCODE and the Human Cell Atlas, needs to be accelerated. Fortunately, the cost of generating such data is rapidly decreasing. Simultaneously, multiomic cohort and clinical data must be produced and made publicly accessible. This process requires participants’ consent, taking into account legitimate privacy concerns. However, alongside the inalienable right to privacy, there’s an equally important right to participant data transparency: many people want to contribute by sharing their data. This is especially true for patients of rare genetic diseases and cancer, who want to help other patients by contributing to the study of the disease and development of treatments. The success of the UK Biobank is a testament to participants’ generosity in data sharing, aiming to make a positive impact on human health.

Molecular biology is not a set of neat concepts and clear principles, but a collection of trillions of little facts assembled over eons of trial and error. Human biologists excel in storytelling, putting these facts into descriptions and stories that help with intuition and experimental planning. However, making biology into a computational science requires a combination of massive data acquisition and computational models of the right capacity to distill the trillions of biological facts from data. With LLMs and the accelerating pace of data acquisition, we are indeed a few years away from having accurate in silico predictive models of the primary biomolecular information highway, to connect our DNA, cellular biology, and health. We can reasonably expect that over the next 5-10 years a wealth of biomedical diagnostic, drug discovery, and health span companies and initiatives will bring these models to application in human health and medicine, with enormous impact. We will also likely witness the development of open foundation models that integrate across data spanning from genomes all the way to medical information. Such models will vastly accelerate research and innovation, and foster precision medicine.

I thank Eric Schadt and Bo Wang for numerous suggestions and edits to the document. I thank Anshul Kundaje, Bo Wang and Kyle Farh for providing thoughts, comments and figures. I thank Lukas Kuderna for creating the Primate Phylogeny figure for this manuscript. I am an employee of Seer, Inc, however all opinions expressed here are my own.

Avsek Z et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods 2021.

Baek M et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021.

Baek M et al. Efficient and accurate prediction of protein structure using RoseTTAFold2. biorXiv doi: https://doi.org/10.1101/2023.05.24.542179, 2023.

Bubeck S et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712, 2023.

Cui et al. scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI. biorXiv https://doi.org/10.1101/2023.04.30.538439, 2023.

Dalla-Torre H et al. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. biorXiv https://doi.org/10.1101/2023.01.11.523679, 2023.

Devlin J et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2018.

Fiziev P et al. Rare penetrant mutations confer severe risk of common diseases. Science 2023.

Gao et al. The landscape of tolerated genetic variation in humans and primates. Science 2023.

Jaganathan et al. Predicting splicing from primary sequence with deep learning. Cell 2019.

Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589, 2021.

Karollus et al. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biology 2023.

Kong et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 2012.

Lee P, Goldberg C, Kohane I. The AI Revolution in Medicine: GPT-4 and Beyond. Pearson, 2023.

Lin Z et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023.

Lyayuga Lisanza S et al. Joint generation of protein sequence and structure with RoseTTAFold sequence space diffusion. biorXiv https://doi.org/10.1101/2023.05.08.539766, 2023.

Sasse et al. How far are we from personalized gene expression prediction using sequence-to-expression deep neural networks? biorXiv https://doi.org/10.1101/2023.03.16.532969, 2023.

Sundaram et al. Predicting the clinical impact of human mutation with deep neural networks. Nature Genetics 2018.

Varadi M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 2021.

Wang S et al. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS Computational Biology 2017.

Wolfram S. What is ChatGPT doing… and why does it work? Wolfram Media, Inc. 2023.

Zhou J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nature Genetics 2022.



Source link

Leave a Comment