publications | Allan dos Santos Costa

EquiJump: Protein Dynamics Simulation via SO(3)-Equivariant Stochastic Interpolants

Allan Santos Costa, Ilan Mitnikov, Franco Pellegrini, and 7 more authors

In ICLR GEM Workshop (Oral) – Generative and Experimental Perspectives for Biomolecular Design, 2025

Abs HTML PDF

Mapping the conformational dynamics of proteins is crucial for elucidating their functional mechanisms. While Molecular Dynamics (MD) simulation enables detailed time evolution of protein motion, its computational toll hinders its use in practice. To address this challenge, multiple deep learning models for reproducing and accelerating MD have been proposed drawing on transport-based generative methods. However, existing work focuses on generation through transport of samples from prior distributions, that can often be distant from the data manifold. The recently proposed framework of stochastic interpolants, instead, enables transport between arbitrary distribution endpoints. Building upon this work, we introduce EquiJump, a transferable SO(3)-equivariant model that bridges all-atom protein dynamics simulation time steps directly. Our approach unifies diverse sampling methods and is benchmarked against existing models on trajectory data of fast folding proteins. EquiJump achieves state-of-the-art results on dynamics simulation with a transferable model on all of the fast folding proteins.
RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow

Dana Rubin, Allan Santos Costa, Manvitha Ponnapati, and 1 more author

In ICLR AI4NA Workshop (Oral) – Artificial Intelligence for Nucleic Acids, Best Paper Award, 2025

Abs HTML PDF

Ribonucleic acid (RNA) plays fundamental roles in biological systems, from carrying genetic information to performing enzymatic function. Understanding and designing RNA can enable novel therapeutic application and biotechnological innovation. To enhance RNA design, in this paper we introduce RiboGen, the first deep learning model to simultaneously generate RNA sequence and all-atom 3D structure. RiboGen leverages the standard Flow Matching with Discrete Flow Matching in a multimodal data representation. RiboGen is based on Euclidean Equivariant neural networks for efficiently processing and learning three-dimensional geometry. Our experiments show that RiboGen can efficiently generate chemically plausible and self-consistent RNA samples. Our results suggest that co-generation of sequence and structure is a competitive approach for modeling RNA.
Efficient molecular conformer generation with SO(3) averaged flow-matching and reflow

Zhonglin Cao, Mario Geiger, Allan Dos Santos Costa, and 6 more authors

In ICML, 2025

Abs PDF

Molecular conformer generation is a critical task in computational chemistry and drug discovery. Diverse generative deep learning methods have been proposed and shown to outperform traditional cheminformatics tools. State-of-the-art models leverage neural transport, employing denoising diffusion or flow-matching to generate or refine atomic point clouds from a prior distribution. Still, sampling with existing models requires significant computational expense. In this work, we build upon flow-matching and propose two mechanisms for accelerating training and inference of 3D molecular conformer generation. For fast training, we introduce the SO(3)-Averaged Flow, which we show to converge faster and generate better conformer ensembles compared to flow-matching and Kabsch alignment-based optimal transport flow. For fast inference, we further show that reflow methods and distillation of these models enable few-steps or even one-step molecular conformer generation with high quality. Using these two techniques, we demonstrate a model that can match the performance of strong transformer baselines with only a fraction of the number of parameters and generation steps. The training techniques proposed in this work lay the foundation for highly efficient molecular conformer generation with generative deep learning model.
BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery

Peter St. John, Dejun Lin, Polina Binder, and 85 more authors

In arXiv, 2024

Abs PDF

Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.
Ophiuchus: Scalable Modeling of Protein Structures through Hierarchical Coarse-graining SO(3)-Equivariant Autoencoders

Allan Santos Costa, Ilan Mitnikov, Mario Geiger, and 3 more authors

In ICLR GEM Workshop (Poster) – Generative and Experimental Perspectives for Biomolecular Design, 2024

Abs HTML PDF

Three-dimensional native states of natural proteins display recurring and hierarchical patterns. Yet, traditional graph-based modeling of protein structures is often limited to operate within a single fine-grained resolution, and lacks hourglass neural architectures to learn those high-level building blocks. We narrow this gap by introducing Ophiuchus, an SO(3)-equivariant coarse-graining model that efficiently operates on all heavy atoms of standard protein residues, while respecting their relevant symmetries. Our model departs from current approaches that employ graph modeling, instead focusing on local convolutional coarsening to model sequence-motif interactions in log-linear length complexity. We train Ophiuchus on contiguous fragments of PDB monomers, investigating its reconstruction capabilities across different compression rates. We examine the learned latent space and demonstrate its prompt usage in conformational interpolation, comparing interpolated trajectories to structure snapshots from the PDBFlex dataset. Finally, we leverage denoising diffusion probabilistic models (DDPM) to efficiently sample readily-decodable latent embeddings of diverse miniproteins. Our experiments demonstrate Ophiuchus to be a scalable basis for efficient protein modeling and generation.
Evolutionary-scale prediction of atomic-level protein structure with a language model

Zeming Lin, Halil Akin, Roshan Rao, and 8 more authors

Science, 2023

Abs PDF

Large language models have recently been shown to develop emergent capabilities with scale, going beyond simple pattern matching to perform higher level reasoning and generate lifelike images and text. While language models trained on protein sequences have been studied at a smaller scale, little is known about what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters, the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn information enabling the prediction of the three-dimensional structure of a protein at the resolution of individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for sequences with low perplexity that are well understood by the language model. ESMFold inference is an order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales.Competing Interest StatementThe authors have declared no competing interest.
Euclidean Transformers for Macromolecular Structures: Lessons Learned

David Liu, Lígia Melo, Allan Costa, and 3 more authors

In ICML CompBio – Workshop on Computational Biology, 2022

Abs PDF

Recent years have seen significant efforts towards creating machine learning approaches for modeling molecular structure. In this work, we investigate a class of architectures of particular interest—translation-and rotation-equivariant transformers—across a number of important problems involving macromolecules with complex three-dimensional geometry. In particular, we build a representative model of this class that achieves state-of-the-art performance on a number of tasks in the ATOM3D collection. Surprisingly, we find that while equivariance is critical for achieving high performance, attention does not provide major improvements. We hope that these insights, combined with the overall robustness of the method, will help further machine learning architectural research on problems involving molecular structures. The model code is available out-of-the-box at https://github. com/drorlab/gert.
InterDocker: End-to-end Cross-Attentive and Geometric Transformers for Efficient Iterative Protein Docking

Allan Costa, Manvitha Ponnapati, Eric Alcaide, and 5 more authors

In NeurIPS LMRL - Learning Meaningful Representations of Life (Poster), 2021

Abs PDF

Modeling protein-protein interactions is a necessary step for elucidating the mechanisms behind fundamental biological processes. Recent advances in protein structure prediction have laid the groundwork for resolving those interactions through protein co-folding. In this work, we repurpose the building blocks of protein folding architectures to directly operate on structures and perform end-to-end protein docking, without the need for costly sequence alignments. To do this, we introduce a two-track pipeline to reason pairwise three-dimensional matching of interfaces and guide Euclidean-equivariant models for iterative construction and refinement of complexes. Our approach performs on par with state-of-the-art methods while reducing computation time by orders of magnitude.
Distillation of MSA Embeddings to Folded Protein Structures with Graph Transformers

Allan Costa, Manvitha Ponnapati, Joseph M. Jacobson, and 1 more author

bioRxiv, 2021

Abs PDF

Determining the structure of proteins has been a long-standing goal in biology. Language models have been recently deployed to capture the evolutionary semantics of protein sequences. Enriched with multiple sequence alignments (MSA), these models can encode protein tertiary structure. In this work, we introduce an attention-based graph architecture that exploits MSA Transformer embeddings to directly produce three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction.Competing Interest StatementThe authors have declared no competing interest.
Fast Neural Models for Symbolic Regression at Scale

Owen Dugan, Rumen Dangovski, Allan Costa, and 4 more authors

In arXiv, 2020

Abs PDF

Deep learning owes much of its success to the astonishing expressiveness of neural networks. However, this comes at the cost of complex, black-boxed models that extrapolate poorly beyond the domain of the training dataset, conflicting with goals of finding analytic expressions to describe science, engineering and real world data. Under the hypothesis that the hierarchical modularity of such laws can be captured by training a neural network, we introduce OccamNet, a neural network model that finds interpretable, compact, and sparse solutions for fitting data, à la Occam’s razor. Our model defines a probability distribution over a non-differentiable function space. We introduce a two-step optimization method that samples functions and updates the weights with backpropagation based on cross-entropy matching in an evolutionary strategy: we train by biasing the probability mass toward better fitting solutions. OccamNet is able to fit a variety of symbolic laws including simple analytic functions, recursive programs, implicit functions, simple image classification, and can outperform noticeably state-of-the-art symbolic regression methods on real world regression datasets. Our method requires minimal memory footprint, does not require AI accelerators for efficient training, fits complicated functions in minutes of training on a single CPU, and demonstrates significant performance gains when scaled on a GPU. Our implementation, demonstrations and instructions for reproducing the experiments are available at this https URL.
Hiring Fairly in the Age of Algorithms

Max Langenkamp, Allan Costa, and Chris Cheung

In arXiv, Dec 2020

Abs PDF

Widespread developments in automation have reduced the need for human input. However, despite the increased power of machine learning, in many contexts these programs make decisions that are problematic. Biases within data and opaque models have amplified human prejudices, giving rise to such tools as Amazon’s (now defunct) experimental hiring algorithm, which was found to consistently downgrade resumes when the word "women’s" was added before an activity. This article critically surveys the existing legal and technological landscape surrounding algorithmic hiring. We argue that the negative impact of hiring algorithms can be mitigated by greater transparency from the employers to the public, which would enable civil advocate groups to hold employers accountable, as well as allow the U.S. Department of Justice to litigate. Our main contribution is a framework for automated hiring transparency, algorithmic transparency reports, which employers using automated hiring software would be required to publish by law. We also explain how existing regulations in employment and trade secret law can be extended by the Equal Employment Opportunity Commission and Congress to accommodate these reports.
Algorithmic Approaches to Reconfigurable Assembly Systems

Allan Costa, Amira Abdel-Rahman, Benjamin Jenett, and 3 more authors

In 2019 IEEE Aerospace Conference, Mar 2019

Abs PDF

Assembly of large scale structural systems in space is understood as critical to serving applications that cannot be deployed from a single launch. Recent literature proposes the use of discrete modular structures for in-space assembly and relatively small scale robotics that are able to modify and traverse the structure. This paper addresses the algorithmic problems in scaling reconfigurable space structures built through robotic construction, where reconfiguration is defined as the problem of transforming an initial structure into a different goal configuration. We analyze different algorithmic paradigms and present corresponding abstractions and graph formulations, examining specialized algorithms that consider discretized space and time steps. We then discuss fundamental design trades for different computational architectures, such as centralized versus distributed, and present two representative algorithms as concrete examples for comparison. We analyze how those algorithms achieve different objective functions and goals, such as minimization of total distance traveled, maximization of fault-tolerance, or minimization of total time spent in assembly. This is meant to offer an impression of algorithmic constraints on scalability of corresponding structural and robotic design. From this study, a set of recommendations is developed on where and when to use each paradigm, as well as implications for physical robotic and structural system design.