Advancing DNA Language Models: The Genomics Long-Range Benchmark

Chia-Hsiang Kao 1 | Evan Trop | McKinley Polen 2 | Yair Schiff 1 | Bernardo P. de Almeida | Aaron Gokaslan 1 | Thomas Pierrot | Volodymyr Kuleshov

1 Cornell University | 2 Massachusetts Institute of Technology (MIT)

Published

ABSTRACT

Building on the successes in other domains, there has been rapid development of language models (LMs) for genomics. Key to this development is the establishment of proper benchmarks and systematic evaluation approaches. The benchmarks that have been proposed so far have focused on tasks that depend on short-range sequence contexts, while the evaluation of models for long-range tasks that are integral to genomics, such as gene expression and genetic variant prediction, is lacking. In this work, we propose a benchmark that fills this need and introduce the genomics long-range benchmark – an evaluation tool that is designed to encompass tasks requiring long-range sequence dependencies, an aspect which we deem crucial to genomic applications of DNA language models. In addition to clearly defining and organizing relevant tasks into a cohesive benchmark, we provide preliminary results of several prominent and recent DNA LMs evaluated on the proposed benchmark. Finally, we probe the tasks in
our benchmarks by exploring the effect of context length extension methods for one of the evaluated DNA LMs, the Nucleotide Transformer. By proposing this benchmark we hope to stimulate the ongoing development of DNA LMs and provide a fruitful testing ground for future developments that aim to capture long-range sequence modeling in genomics.