Mitochondrial Genome Assembly & Annotation Platform
MitoPilot is an open-source mitochondrial genome assembly and annotation platform developed for the Smithsonian Institution while I was a software engineer at Jonah Ventures. I served as the lead developer and architect for the system, integrating a containerized Nextflow pipeline with an interactive R Shiny interface to streamline genome skimming workflows from raw Illumina reads through GenBank-ready submissions.
Initially designed for fish mitogenomes, the platform now supports diverse taxonomic groups including starfish, octocorals, hexacorals, dipterans, turtles, and others, with ongoing expansion.
The Problem
Mitochondrial genome assembly and annotation requires chaining together multiple specialized tools: read preprocessing, assembly, gene annotation, mapping, and validation. Each component introduces its own parameters, output formats, and failure modes.
Researchers often manage these workflows manually, combining command-line tools with spreadsheets while tracking dozens or hundreds of samples. This approach introduces reproducibility challenges, parameter drift, and inefficiencies during annotation curation and submission.
The goal was to create a reproducible, scalable system that handles computational complexity while preserving interactive control for researchers during quality review and annotation refinement.
System Architecture
MitoPilot is structured around three coordinated modules: Assembly
- Quality control and adapter trimming via fastp
- Assembly using GetOrganelle or MitoFinder
- Read mapping with bowtie2 for coverage and error rate estimation Annotation
- rRNA, protein-coding gene, and tRNA annotation via MITOS2
- Additional tRNA detection using tRNAscan-SE
- Custom gene boundary refinement scripts
- Pre-submission validation against NCBI GenBank requirements Interactive Curation
- Browser-based Shiny interface for monitoring pipeline status
- Sample state management with batch parameter updates
- Manual annotation refinement with visual feedback
- Export to GenBank-ready formats
Technical Design
The architecture deliberately separates computational execution from user interaction.
Nextflow orchestrates the workflow using containerized tools (Docker/Singularity), ensuring reproducibility across local machines, AWS Batch, and institutional HPC clusters including Smithsonian Hydra and NOAA SEDNA.
An SQLite database (via the nf-sqldb plugin) stores sample metadata and processing parameters. The Shiny interface interacts with this database to:
- Track job status
- Modify parameters
- Coordinate batch operations
- Maintain state across sessions
Users can optionally supply custom assemblies or reference databases, allowing the system to adapt to specialized taxonomic workflows without modifying the underlying pipeline.
This separation of orchestration, state management, and interaction enables both reproducibility and flexibility.
Outcome
MitoPilot is actively maintained and used by researchers at NOAA and the Smithsonian. By combining automated, containerized processing with interactive annotation curation, the platform reduces time from raw sequencing reads to validated GenBank submission while preserving researcher oversight.
The project demonstrates a broader principle: bioinformatics workflows can be engineered as durable systems rather than ad hoc scripts, balancing computational rigor with domain-specific flexibility.