Research Papers Archive | InstaDeep - Decision-Making AI For The Enterprise

Protein Sequence Modelling with Bayesian Flow Networks

Sep 2024

Application of a Bayesian Flow Network (BFN) to protein sequence modelling. BFN’s update parameters of data distribution, 𝜃, using Bayesian inference given a noised observation, y of a data sample. When applied to protein sequence modelling, the distribution over the data is given by separate categorical distributions over the possible tokens (all amino acids and special tokens such as , , and ) at each sequence index. During training, ‘Alice’ knows a ground truth data point x, and so 𝜃 can be directly updated using noised observation of x. ‘Bob’ trains a neural network to predict the ‘sender’ distribution from which Alice is sampling these observations at each step (i.e. to predict the noised ground truth). During inference, when Alice is not present, Bob replaces noised observations of the ground truth with samples from the ‘reciever’ distribution predicted by the network.

FULL PAPER Read More

SMX: Sequential Monte Carlo Planning for Expert Iteration

ICML 2024 Jul 2024

Figure 1: Diagram depicting a representation of SMX search from left to right. N Rollouts are executed in parallel according to πθ (the sampling policy β). At each step in the environment the particle weights are adjusted, indicated by the particle sizes. We depict two resampling zones where particles are resampled (favouring higher weights) and weights are reset. Finally an improved policy π ′ = Iˆ βπ is constructed from the initial actions from the remaining particles, furthest to the right. This improved policy is then used to update πθ.

FULL PAPER Read More

Multi-Objective Quality-Diversity for Crystal Structure Prediction

Gecco 2024 Jul 2024

FULL PAPER Read More

Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking

Shikha Surana | Nathan Grinsztajn | Timothy Atkinson | Paul Duckworth | Thomas D. Barrett

ICML 2024 workshop Jul 2024

FULL PAPER Read More

Generative Model for Small Molecules with Latent Space RL Fine-Tuning to Protein Targets

ICML 2024 workshop Jul 2024

Figure 1. Schematic representation of our model’s architecture. A sequence of N tokens is passed as input to our encoder which is a transformer model. The output encoded embeddings of shape N × E are either passed directly to the mean and logvar layers (path 1) or they are first passed to the perceiver resampler layer which maps the encoded embeddings to a reduced dimension of shape LS ×LE (path 2). The mean and logvar layers are linear layers that are applied independently to each sequence dimension. The final reparametrised embeddings are then passed to the decoder transformer model to be used as encoder embeddings in the decoder’s cross-attention layers.

FULL PAPER Read More

Should we be going MAD?
A Look at Multi-Agent Debate Strategies for LLMs

Andries Petrus Smit | Nathan Grinsztajn | Paul Duckworth | Thomas D Barrett | Arnu Pretorius

ICML 2024 Jul 2024

Recent advancements in large language models (LLMs) underscore their potential for responding to inquiries in various domains. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. In this context, multi-agent debate (MAD) has emerged as a promising strategy for enhancing the truthfulness of LLMs. We benchmark a range of debating and prompting strategies to explore the trade-offs between cost, time, and accuracy. Importantly, we nd that multi-agent debating systems, in their current form, do not reliably outperform other proposed prompting strategies, such as self-consistency and ensembling using multiple reasoning paths. However, when performing hyperparameter tuning, several MAD systems, such as Multi-Persona, perform better. This suggests that MAD protocols might not be inherently worse than other approaches, but that they are more sensitive to different hyperparameter settings and difcult to optimize. We build on these results to offer insights into improving debating strategies, such as adjusting agent agreement levels, which can signicantly enhance performance and even surpass all other non-debate protocols we evaluated. We provide an open-source repository to the community with several state-of-theart protocols together with evaluation scripts to benchmark across popular research datasets.

FULL PAPER Read More

Quality-Diversity for One-Shot Biological Sequence Design

Jérémie DONA | Arthur Flajolet | Andrei Marginean | Antoine Cully | Thomas PIERROT

ICML 2024 Jul 2024

Figure 1. Left. Schematic overview of our experimental protocol. An oracle, e.g. an expressive neural network is learned from real data. It enables us to relabel the dataset and emulates wet-lab results. An ensemble of scoring functions are learned from this relabelled dataset. Right. We optimize a MAP-ELITES grid with respect to this ensemble of scoring functions, following eq. (2), and the descriptors of eq. (4)

FULL PAPER Read More

Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design

Alex Hawkins-Hooker | Jakub Kmec | Oliver Bent | Paul Duckworth

ICML 2024 workshop Jul 2024

$We plot the fraction of the top 30% of sequences in the initial candidate pool that are retrieved by the optimisation process as a function of the number of optimisation rounds for both single and multi-mutant in Figure 1. Across both sets of landscapes, the PoET ranking ensemble outperforms all other methods. In general, the design curves show similar trends to the supervised results$

FULL PAPER Read More

A large language foundational model for edible plant genomes

Nature Communications Biology 2024 Jul 2024

AgroNT: a novel large language model that integrates genomes across plants species We developed a transformer-based DNA language model named the Agronomic Nucleotide Transformer (AgroNT), which learned general nucleotide sequence representations from genomic DNA sequences of 48 different plant species (Fig. 1a, Supplementary Fig. 1 and Supplementary Table 1; Methods). Building upon our previous work23, our pre-training strategy involves performing masked language modeling (MLM) on a DNA sequence consisting of ~ 6000 base pairs (bp). Our tokenization algorithm splits the DNA sequence into 6-mers, treating each 6-mer as a token, and masks 15% of the tokens for prediction (Fig. 1b; Methods). For our finetuning strategy, we implemented parameter-efficient fine-tuning using the IA3 technique30. In this approach, we replaced the language model head with a prediction head, using either a classification or regression head based on the task. We kept the weights of the transformer layers and embedding layers frozen, or alternatively, unfroze a small number of the final layers to reduce training time for specific downstream tasks (Fig. 1c; Methods).

FULL PAPER GITHUB Read More

Coordination Failure in Cooperative Offline MARL

Callum Rhys Tilbury | Claude Formanek | Louise Beyers | Jonathan Shock | Arnu Pretorius

ICML 2024 ARLET Workshop Jul 2024

FULL PAPER GITHUB Read More

Machine Learning of Force Fields for Molecular Dynamics Simulations of Proteins at DFT Accuracy

ICLR 2024 GEM Workshop May 2024

FULL PAPER Read More

Model-Based Reinforcement Learning for Protein Backbone Design

Frédéric Renard | Cyprien Courtot | Oliver Bent

ICLR 2024 GEM Workshop May 2024

FULL PAPER Read More