FURI | Spring 2024

Understanding the Root Causes for catELMo’s Superior Performance Embedding T-Cell Receptors With Respect to the Downstream TCR-Epitope Binding Affinity Prediction Task

Health icon, disabled. A red heart with a cardiac rhythm running through it.

Embedding variable-length strings of amino acids into a fixed-length vector is the first step in applying machine learning techniques to biological data. Better embedding methods yield better results in downstream tasks. Currently, the best embedding model for T-cell receptors is catELMo. The research team seeks to uncover the underlying reasons for catELMo’s superior performance compared with other embedding models (specifically GPT and BERT). The approach taken is to conduct a large-scale ablation study in which several hyperparameters of catELMo are varied, as well as the scale of the models, to determine what parameters have the greatest impact on downstream performance.

Student researcher

Ryan Connolly-Kelley

Computer science

Hometown: Tempe, Arizona, United States

Graduation date: Spring 2025