Job Summary
We are seeking a scientist to join our team at Iambic Therapeutics, working on data acquisition and curation for Enchant , our multimodal transformer model trained at scale on a wide variety of biomedical data. In this role, you will design and build agentic systems that acquire, clean, format, and quality-control the large-scale datasets that power Enchant training. You will work at the intersection of LLM-based automation and biomedical data engineering—developing AI agents that can navigate heterogeneous data sources, enforce quality standards, and operate reliably at scale.
This role is ideal for candidates who combine strong software engineering instincts with scientific understanding of biomedical data, and who are excited about using LLMs as tools to solve practical data problems.
Key Responsibilities
- Design, build, and maintain agentic systems for automated data acquisition from public and proprietary biomedical data sources
- Develop LLM-based pipelines for data cleaning, normalization, and formatting across diverse data modalities (e.g., molecular, genomic, clinical, literature)
- Implement automated quality-control workflows that detect anomalies, flag inconsistencies, and enforce data standards
- Evaluate and iterate on agent architectures, prompting strategies, and tool-use patterns to improve reliability and throughput
- Collaborate with ML scientists on the Enchant team to understand data requirements and translate them into scalable acquisition and processing systems
- Monitor and maintain data pipelines in production, diagnosing failures and improving robustness over time
- Document data provenance, processing decisions, and quality metrics to support reproducibility and auditing
Qualifications
Required:
- Master's or PhD in a computational STEM field, or equivalent industry experience
- Strong Python engineering skills, including experience building and maintaining production-quality software
- Hands-on experience with LLM APIs (e.g., Claude, GPT) and agentic patterns such as tool use, orchestration, and multi-step reasoning
- Familiarity with biomedical or chemical data sources and formats (e.g., PDB, UniProt, ChEMBL, SDF/MOL, FASTA, or similar)
- Comfort with data engineering fundamentals: ETL design, data validation, and working with structured and unstructured data at scale
Desired:
- Experience with agent orchestration frameworks
- Familiarity with cloud infrastructure and workflow orchestration (e.g., AWS, Docker, Kubernetes)
- Knowledge of multimodal biomedical data—spanning small molecules, proteins, assays, images, ‘omics, and/or clinical records
- Experience with large-scale dataset construction or curation for ML model training
Location
Remote (US or UK). On-site available in Bristol, UK and Boston, US.