Proteins are essential to life, driving nearly every biological process and performing critical functions in the human body—from building muscles to fighting diseases. Understanding these intricate molecules has long challenged researchers, but advancements in AI could change that.
What is ProtBFN?
There are 20 proteinogenic amino acids that make up proteins. Arranged in specific sequences, these amino acids form a protein’s unique three-dimensional structure, which determines its function.

Modelling protein sequences is like unfolding the map of our existence, these sequences hold the key to understanding biological processes at a profound level. However, their inherent complexity makes them difficult to interpret, modify, or design for therapeutic use. ProtBFN was developed to address this challenge.
ProtBFN is a 650-million-parameter Bayesian Flow Network (BFN) trained on a curated dataset of 72 million biologically validated examples, optimised for generating new protein sequences.
By expanding the repository of biologically relevant data, ProtBFN equips researchers with powerful tools to explore uncharted regions of the proteome, potentially driving meaningful advancements in healthcare applications.
Why GenAI is important for protein sequencing
Meaningful protein sequences represent a fraction of the potential sequences that could be constructed from the amino acid alphabet. Considering that functional proteins can range from 10 to over a few thousand amino acids in length, the number of potential sequences is astronomically large—far beyond human comprehension.
This is where generative AI can help. By focusing on creating sequences that are biologically plausible, the model learns to compose amino acids into meaningful proteins.
Rather than laboriously sifting through endless possibilities, generative AI can identify and create new candidate sequences that hold the most promise for further research.
Protein generation typically falls into two key areas:
- Unconditional generation (de novo), where AI creates entirely new proteins without specific guidance.
- Conditional generation, where AI is more directed to address tasks, like completing a partial sequence, or designing proteins with specific properties.
While AI continues to advance rapidly in this field, traditional generative models face significant obstacles in capturing the intricate nature of protein design. For instance, many models excel in one type of generation (e.g., conditional) but fall short in the other.
Additionally, some models are restricted by how they process data—GPT, for instance, operates in a fixed left-to-right order. Proteins don’t follow such rigid successions, and this limitation often results in missed interactions within a sequence, resulting in less accurate outcomes.
These challenges underscore the need for a more versatile approach. As shown in Figure 2, ProtBFN demonstrates the necessary flexibility by generating meaningful and diverse protein sequences that are both novel and natural.
1. 10,000 generated sequences from each model are matched to clusterings from UniRef50. A hit is determined as a match with >50% sequence identity. Coverage score is the ratio of the number of unique clusters hit to the expected number of sequences drawn i.i.d. from the model’s training distribution.
2. Identity of ProtBFN generated sequences to the best matching protein sequence found in the models training data. Any identity < 100% is a novel sequence that the model has not seen before. Source: internal
How ProtBFN works
Imagine a digital image made up of thousands of pixels. Examining a single pixel wouldn’t reveal much, but viewing all the pixels together completes the picture. BFNs go beyond analysing the “big picture”. They model the underlying patterns and relationships within the data.
Rather than just assembling pixels into an image, BFNs understand the intricate parameters that define them. This ability makes them particularly suited to tasks where order isn’t fixed or essential, such as generating protein sequences.
Like all foundational models, BFNs are pre-trained on vast datasets to identify general patterns and representations. However, their unique approach to modelling data enables them to perform a wide range of tasks with remarkable flexibility, including zero-shot conditional generation—handling tasks without specific training, even in unfamiliar scenarios.This flexibility has enabled ProtBFN to extend its capabilities beyond protein sequence modelling to areas like antibody design. When trained on data from the Observed Antibody Space (OAS), ProtBFN demonstrated exceptional performance in generating antibody variable heavy (VH) chain sequences.

Figure 3 illustrates ProtBFN’s ability to accurately capture the natural distribution of these sequences while outperforming specialised inpainting models, like AntiBERTy and AbLang2, in antibody ‘Inpainting.’ This task involves predicting or generating missing parts of an antibody sequence based on partially obstructed data, such as an incomplete sequence.
For antibody VH chains, this means completing segments like the Complementarity-Determining Regions (CDRs), which are critical for determining binding specificity, or Framework Regions (FRs), which provide structural stability to the antibody. By leveraging its broad understanding–gained through pre-training on complete data, ProtBFN can effectively infer these missing regions from a partial sequence.
Remarkably, ProtBFN demonstrates this flexibility, even though it was not trained specifically for inpainting tasks. Its understanding of antibody composition and relationships enables ProtBFN to handle complex tasks with precision, underscoring its potential as a powerful tool for antibody modelling
What’s next?
While BFNs are still an emerging technology, their ability to handle diverse data modalities and inference-time tasks in a unified, flexible manner holds notable promise.
Our researchers have highlighted the work of Xue et al., which formalises the connection between BFN and Diffusion models—systems that refine random patterns into meaningful outputs. This insight opens the door to advanced sampling methods that could further enhance the generative power of BFNs by enabling the production of diverse and meaningful outputs. This diffusion-like approach to data processing, combined with their versatility, highlights BFNs’ potential for protein sequence modelling. By generating discrete data, capturing complex dependencies, and adapting to both unconditional and conditional tasks, ProtBFN overcomes the current limitations of traditional models.
Looking to the future, our vision extends beyond just modeling protein sequences. We aim to model these sequences alongside their full spectrum of associated metadata, building foundational models that capture the joint distribution of diverse scientific data to support a more comprehensive view of the biological landscape.


Feeling curious? Dive into ProtBFN— available on our DeepChain Platform. Download our paper and explore the open-source training dataset and model weights on Hugging Face or GitHub
Disclaimer: all claims made are supported by our research paper: Protein Sequence Modelling with Bayesian Flow networks, unless explicitly cited otherwise.