publications
- EquiJump: Protein Dynamics Simulation via SO(3)-Equivariant Stochastic InterpolantsAllan Santos Costa, Ilan Mitnikov, Franco Pellegrini, and 7 more authorsIn arXiv, 2024
Mapping the conformational dynamics of proteins is crucial for elucidating their functional mechanisms. While Molecular Dynamics (MD) simulation enables detailed time evolution of protein motion, its computational toll hinders its use in practice. To address this challenge, multiple deep learning models for reproducing and accelerating MD have been proposed drawing on transport-based generative methods. However, existing work focuses on generation through transport of samples from prior distributions, that can often be distant from the data manifold. The recently proposed framework of stochastic interpolants, instead, enables transport between arbitrary distribution endpoints. Building upon this work, we introduce EquiJump, a transferable SO(3)-equivariant model that bridges all-atom protein dynamics simulation time steps directly. Our approach unifies diverse sampling methods and is benchmarked against existing models on trajectory data of fast folding proteins. EquiJump achieves state-of-the-art results on dynamics simulation with a transferable model on all of the fast folding proteins.
- Ophiuchus: Scalable Modeling of Protein Structures through Hierarchical Coarse-graining SO(3)-Equivariant AutoencodersAllan Santos Costa, Ilan Mitnikov, Mario Geiger, and 3 more authorsIn ICLR Workshop GEM, 2024
Three-dimensional native states of natural proteins display recurring and hierarchical patterns. Yet, traditional graph-based modeling of protein structures is often limited to operate within a single fine-grained resolution, and lacks hourglass neural architectures to learn those high-level building blocks. We narrow this gap by introducing Ophiuchus, an SO(3)-equivariant coarse-graining model that efficiently operates on all heavy atoms of standard protein residues, while respecting their relevant symmetries. Our model departs from current approaches that employ graph modeling, instead focusing on local convolutional coarsening to model sequence-motif interactions in log-linear length complexity. We train Ophiuchus on contiguous fragments of PDB monomers, investigating its reconstruction capabilities across different compression rates. We examine the learned latent space and demonstrate its prompt usage in conformational interpolation, comparing interpolated trajectories to structure snapshots from the PDBFlex dataset. Finally, we leverage denoising diffusion probabilistic models (DDPM) to efficiently sample readily-decodable latent embeddings of diverse miniproteins. Our experiments demonstrate Ophiuchus to be a scalable basis for efficient protein modeling and generation.
- Evolutionary-scale prediction of atomic-level protein structure with a language modelZeming Lin, Halil Akin, Roshan Rao, and 8 more authorsScience, 2023
Large language models have recently been shown to develop emergent capabilities with scale, going beyond simple pattern matching to perform higher level reasoning and generate lifelike images and text. While language models trained on protein sequences have been studied at a smaller scale, little is known about what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters, the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn information enabling the prediction of the three-dimensional structure of a protein at the resolution of individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for sequences with low perplexity that are well understood by the language model. ESMFold inference is an order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales.Competing Interest StatementThe authors have declared no competing interest.
- Euclidean Transformers for Macromolecular Structures: Lessons LearnedDavid Liu, Lígia Melo, Allan Costa, and 3 more authorsIn ICML Workshop on Computational Biology, 2022
Recent years have seen significant efforts towards creating machine learning approaches for modeling molecular structure. In this work, we investigate a class of architectures of particular interest—translation-and rotation-equivariant transformers—across a number of important problems involving macromolecules with complex three-dimensional geometry. In particular, we build a representative model of this class that achieves state-of-the-art performance on a number of tasks in the ATOM3D collection. Surprisingly, we find that while equivariance is critical for achieving high performance, attention does not provide major improvements. We hope that these insights, combined with the overall robustness of the method, will help further machine learning architectural research on problems involving molecular structures. The model code is available out-of-the-box at https://github. com/drorlab/gert.
- InterDocker: End-to-end Cross-Attentive and Geometric Transformers for Efficient Iterative Protein DockingAllan Costa, Manvitha Ponnapati, Eric Alcaide, and 5 more authorsIn NeurIPS LMRL - Learning Meaningful Representations of Life (Poster), 2021
Modeling protein-protein interactions is a necessary step for elucidating the mechanisms behind fundamental biological processes. Recent advances in protein structure prediction have laid the groundwork for resolving those interactions through protein co-folding. In this work, we repurpose the building blocks of protein folding architectures to directly operate on structures and perform end-to-end protein docking, without the need for costly sequence alignments. To do this, we introduce a two-track pipeline to reason pairwise three-dimensional matching of interfaces and guide Euclidean-equivariant models for iterative construction and refinement of complexes. Our approach performs on par with state-of-the-art methods while reducing computation time by orders of magnitude.
- ChaperoNet : distillation of language model semantics to folded three-dimensional protein structuresAllan CostaIn Master’s Thesis , 2021
Determining the structure of proteins has been a long-standing goal in biology. Language models have been recently deployed to capture the evolutionary semantics of protein sequences, and as an emergent property, were found to be structural learners. Enriched with multiple sequence alignments (MSA), these transformer models were able to capture significant information about a protein’s tertiary structure. In this work, we show how such structural information can be recovered by processing language model embeddings, and introduce a two-stage folding pipeline to directly estimate three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction through protein language modeling.
- Distillation of MSA Embeddings to Folded Protein Structures with Graph TransformersAllan Costa, Manvitha Ponnapati, Joseph M. Jacobson, and 1 more authorbioRxiv, 2021
Determining the structure of proteins has been a long-standing goal in biology. Language models have been recently deployed to capture the evolutionary semantics of protein sequences. Enriched with multiple sequence alignments (MSA), these models can encode protein tertiary structure. In this work, we introduce an attention-based graph architecture that exploits MSA Transformer embeddings to directly produce three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction.Competing Interest StatementThe authors have declared no competing interest.
- Fast Neural Models for Symbolic Regression at ScaleAllan Costa, Rumen Dangovski, Owen Dugan, and 4 more authors2020
Deep learning owes much of its success to the astonishing expressiveness of neural networks. However, this comes at the cost of complex, black-boxed models that extrapolate poorly beyond the domain of the training dataset, conflicting with goals of finding analytic expressions to describe science, engineering and real world data. Under the hypothesis that the hierarchical modularity of such laws can be captured by training a neural network, we introduce OccamNet, a neural network model that finds interpretable, compact, and sparse solutions for fitting data, à la Occam’s razor. Our model defines a probability distribution over a non-differentiable function space. We introduce a two-step optimization method that samples functions and updates the weights with backpropagation based on cross-entropy matching in an evolutionary strategy: we train by biasing the probability mass toward better fitting solutions. OccamNet is able to fit a variety of symbolic laws including simple analytic functions, recursive programs, implicit functions, simple image classification, and can outperform noticeably state-of-the-art symbolic regression methods on real world regression datasets. Our method requires minimal memory footprint, does not require AI accelerators for efficient training, fits complicated functions in minutes of training on a single CPU, and demonstrates significant performance gains when scaled on a GPU. Our implementation, demonstrations and instructions for reproducing the experiments are available at this https URL.
- Hiring Fairly in the Age of AlgorithmsMax Langenkamp, Allan Costa, and Chris CheungDec 2020
Widespread developments in automation have reduced the need for human input. However, despite the increased power of machine learning, in many contexts these programs make decisions that are problematic. Biases within data and opaque models have amplified human prejudices, giving rise to such tools as Amazon’s (now defunct) experimental hiring algorithm, which was found to consistently downgrade resumes when the word "women’s" was added before an activity. This article critically surveys the existing legal and technological landscape surrounding algorithmic hiring. We argue that the negative impact of hiring algorithms can be mitigated by greater transparency from the employers to the public, which would enable civil advocate groups to hold employers accountable, as well as allow the U.S. Department of Justice to litigate. Our main contribution is a framework for automated hiring transparency, algorithmic transparency reports, which employers using automated hiring software would be required to publish by law. We also explain how existing regulations in employment and trade secret law can be extended by the Equal Employment Opportunity Commission and Congress to accommodate these reports.
- Algorithmic Approaches to Reconfigurable Assembly SystemsAllan Costa, Amira Abdel-Rahman, Benjamin Jenett, and 3 more authorsIn 2019 IEEE Aerospace Conference, Mar 2019
Assembly of large scale structural systems in space is understood as critical to serving applications that cannot be deployed from a single launch. Recent literature proposes the use of discrete modular structures for in-space assembly and relatively small scale robotics that are able to modify and traverse the structure. This paper addresses the algorithmic problems in scaling reconfigurable space structures built through robotic construction, where reconfiguration is defined as the problem of transforming an initial structure into a different goal configuration. We analyze different algorithmic paradigms and present corresponding abstractions and graph formulations, examining specialized algorithms that consider discretized space and time steps. We then discuss fundamental design trades for different computational architectures, such as centralized versus distributed, and present two representative algorithms as concrete examples for comparison. We analyze how those algorithms achieve different objective functions and goals, such as minimization of total distance traveled, maximization of fault-tolerance, or minimization of total time spent in assembly. This is meant to offer an impression of algorithmic constraints on scalability of corresponding structural and robotic design. From this study, a set of recommendations is developed on where and when to use each paradigm, as well as implications for physical robotic and structural system design.