Machine Learning Scientist — Agentic data pipelines

Job Summary

We are seeking a scientist to join our team at Iambic Therapeutics, working on data acquisition and curation for Enchant , our multimodal transformer model trained at scale on a wide variety of biomedical data. In this role, you will design and build agentic systems that acquire, clean, format, and quality-control the large-scale datasets that power Enchant training. You will work at the intersection of LLM-based automation and biomedical data engineering—developing AI agents that can navigate heterogeneous data sources, enforce quality standards, and operate reliably at scale.

This role is ideal for candidates who combine strong software engineering instincts with scientific understanding of biomedical data, and who are excited about using LLMs as tools to solve practical data problems.

Key Responsibilities

Design, build, and maintain agentic systems for automated data acquisition from public and proprietary biomedical data sources
Develop LLM-based pipelines for data cleaning, normalization, and formatting across diverse data modalities (e.g., molecular, genomic, clinical, literature)
Implement automated quality-control workflows that detect anomalies, flag inconsistencies, and enforce data standards
Evaluate and iterate on agent architectures, prompting strategies, and tool-use patterns to improve reliability and throughput
Collaborate with ML scientists on the Enchant team to understand data requirements and translate them into scalable acquisition and processing systems
Monitor and maintain data pipelines in production, diagnosing failures and improving robustness over time
Document data provenance, processing decisions, and quality metrics to support reproducibility and auditing

Qualifications

Required:

Master's or PhD in a computational STEM field, or equivalent industry experience
Strong Python engineering skills, including experience building and maintaining production-quality software
Hands-on experience with LLM APIs (e.g., Claude, GPT) and agentic patterns such as tool use, orchestration, and multi-step reasoning
Familiarity with biomedical or chemical data sources and formats (e.g., PDB, UniProt, ChEMBL, SDF/MOL, FASTA, or similar)
Comfort with data engineering fundamentals: ETL design, data validation, and working with structured and unstructured data at scale

Desired:

Experience with agent orchestration frameworks
Familiarity with cloud infrastructure and workflow orchestration (e.g., AWS, Docker, Kubernetes)
Knowledge of multimodal biomedical data—spanning small molecules, proteins, assays, images, ‘omics, and/or clinical records
Experience with large-scale dataset construction or curation for ML model training

Location

Remote (US or UK). On-site available in Bristol, UK and Boston, US.

About Iambic Therapeutics