Please read our extensive Privacy policy here. You can also read our Privacy Notice and our Cookie Notice
Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking
ICML 2024 workshop Jul 2024
![](https://www.instadeep.com/wp-content/uploads/2024/07/pepper_thumbnail-352x182.png)
Generative Model for Small Molecules with Latent Space RL Fine-Tuning to Protein Targets
ICML 2024 workshop Jul 2024
![Figure 1. Schematic representation of our model’s architecture. A sequence of N tokens is passed as input to our encoder which is a transformer model. The output encoded embeddings of shape N × E are either passed directly to the mean and logvar layers (path 1) or they are first passed to the perceiver resampler layer which maps the encoded embeddings to a reduced dimension of shape LS ×LE (path 2). The mean and logvar layers are linear layers that are applied independently to each sequence dimension. The final reparametrised embeddings are then passed to the decoder transformer model to be used as encoder embeddings in the decoder’s cross-attention layers.](https://www.instadeep.com/wp-content/uploads/2024/07/thumbnail-352x182.png)
Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design
ICML 2024 workshop Jul 2024
![We plot the fraction of the top 30% of sequences in the initial candidate pool that are retrieved by the optimisation process as a function of the number of optimisation rounds for both single and multi-mutant in Figure 1. Across both sets of landscapes, the PoET ranking ensemble outperforms all other methods. In general, the design curves show similar trends to the supervised results](https://www.instadeep.com/wp-content/uploads/2024/07/prefbo_fig1-352x182.png)
A large language foundational model for edible plant genomes
Nature Communications Biology 2024 Jul 2024
![AgroNT: a novel large language model that integrates genomes across plants species We developed a transformer-based DNA language model named the Agronomic Nucleotide Transformer (AgroNT), which learned general nucleotide sequence representations from genomic DNA sequences of 48 different plant species (Fig. 1a, Supplementary Fig. 1 and Supplementary Table 1; Methods). Building upon our previous work23, our pre-training strategy involves performing masked language modeling (MLM) on a DNA sequence consisting of ~ 6000 base pairs (bp). Our tokenization algorithm splits the DNA sequence into 6-mers, treating each 6-mer as a token, and masks 15% of the tokens for prediction (Fig. 1b; Methods). For our finetuning strategy, we implemented parameter-efficient fine-tuning using the IA3 technique30. In this approach, we replaced the language model head with a prediction head, using either a classification or regression head based on the task. We kept the weights of the transformer layers and embedding layers frozen, or alternatively, unfroze a small number of the final layers to reduce training time for specific downstream tasks (Fig. 1c; Methods).](https://www.instadeep.com/wp-content/uploads/2024/07/Screenshot-2024-07-17-152925-352x182.png)
Machine Learning of Force Fields for Molecular Dynamics Simulations of Proteins at DFT Accuracy
ICLR 2024 GEM Workshop May 2024
![](https://www.instadeep.com/wp-content/uploads/2024/05/thumbnail_iclr_1-352x182.png)