InstaDeep open-sources the Nucleotide Transformers, its collection of genomics Language Models, to HuggingFace



InstaDeep Research is pleased to announce the open-sourcing of its Nucleotide Transformer on the HuggingFace platform. 

The Nucleotide Transformer project introduces a collection of four DNA large language models, with parameter counts ranging between 500M and 2.5B, which were developed in collaboration with NVIDIA and the Technical University of Munich. As part of this work, a curated and standardised set of downstream task benchmarks and evaluation protocols has been introduced to ensure consistent and thorough evaluation of these models, as well as to facilitate any future contributions in this field.

Beyond providing valuable insights into the efficient construction and pre-training of these models, we demonstrate that our models achieve improvements over strong baselines or perform on par with them on various tasks of interest. Notably, these tasks include regulatory elements detection and chromatin accessibility prediction. Remarkably, fine-tuning our Nucleotide Transformer models on affordable hardware takes only a matter of minutes.

With the open-source release, not only the model weights but also the pre-training datasets and downstream tasks datasets are made available. Furthermore, we provide example notebooks that demonstrate how to efficiently fine-tune the models, allowing for seamless adaptation to any new nucleotide sequence task, quickly and cost-effectively!

Senior InstaDeep Research Scientist, Dr Thomas Pierrot, commented, “We are very happy to contribute to the HuggingFace ecosystem. This platform provides a fantastic opportunity to facilitate the access of the latest advancement in the field of AI with a much wider audience, and help push the boundaries of what’s possible in life sciences.”

Alex Laterre, AI Research Lead, continued, “Part of our company DNA is centred around giving back to the AI community. Our contribution to HuggingFace will open up new avenues for AI researchers to access these exciting new models and datasets.”

To learn more about our work, see our paper, “The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics, and our HuggingFace page.

We believe that these contributions will help the community accelerate the development of such models and eventually crack the mysteries of the human genome to develop personalised and targeted medicines. In the meantime, the InstaDeep research team is already working on improving the models’ architectures, extending their capabilities and application fields. Stay tuned!

Want to be a part of a dynamic and innovative company and work on initiatives like this? InstaDeep is hiring! Visit our careers page to see our open positions and apply now.