MitoPilot | Full Stack Ecoinformatics

Mitochondrial Genome Assembly & Annotation Platform

MitoPilot is an open-source mitochondrial genome assembly and annotation platform developed for the Smithsonian Institution while I was a software engineer at Jonah Ventures. I served as the lead developer and architect for the system, integrating a containerized Nextflow pipeline with an interactive R Shiny interface to streamline genome skimming workflows from raw Illumina reads through GenBank-ready submissions.

Initially designed for fish mitogenomes, the platform now supports diverse taxonomic groups including starfish, octocorals, hexacorals, dipterans, turtles, and others, with ongoing expansion.

The Problem

Mitochondrial genome assembly and annotation requires chaining together multiple specialized tools: read preprocessing, assembly, gene annotation, mapping, and validation. Each component introduces its own parameters, output formats, and failure modes.

Researchers often manage these workflows manually, combining command-line tools with spreadsheets while tracking dozens or hundreds of samples. This approach introduces reproducibility challenges, parameter drift, and inefficiencies during annotation curation and submission.

The goal was to create a reproducible, scalable system that handles computational complexity while preserving interactive control for researchers during quality review and annotation refinement.

System Architecture

MitoPilot is structured around three coordinated modules: Assembly

Quality control and adapter trimming via fastp
Assembly using GetOrganelle or MitoFinder
Read mapping with bowtie2 for coverage and error rate estimation Annotation
rRNA, protein-coding gene, and tRNA annotation via MITOS2
Additional tRNA detection using tRNAscan-SE
Custom gene boundary refinement scripts
Pre-submission validation against NCBI GenBank requirements Interactive Curation
Browser-based Shiny interface for monitoring pipeline status
Sample state management with batch parameter updates
Manual annotation refinement with visual feedback
Export to GenBank-ready formats

Technical Design

The architecture deliberately separates computational execution from user interaction.

Nextflow orchestrates the workflow using containerized tools (Docker/Singularity), ensuring reproducibility across local machines, AWS Batch, and institutional HPC clusters including Smithsonian Hydra and NOAA SEDNA.

An SQLite database (via the nf-sqldb plugin) stores sample metadata and processing parameters. The Shiny interface interacts with this database to:

Track job status
Modify parameters
Coordinate batch operations
Maintain state across sessions

Users can optionally supply custom assemblies or reference databases, allowing the system to adapt to specialized taxonomic workflows without modifying the underlying pipeline.

This separation of orchestration, state management, and interaction enables both reproducibility and flexibility.

Outcome

MitoPilot is actively maintained and used by researchers at NOAA and the Smithsonian. By combining automated, containerized processing with interactive annotation curation, the platform reduces time from raw sequencing reads to validated GenBank submission while preserving researcher oversight.

The project demonstrates a broader principle: bioinformatics workflows can be engineered as durable systems rather than ad hoc scripts, balancing computational rigor with domain-specific flexibility.

Mitochondrial Genome Assembly & Annotation Platform

The Problem

System Architecture

Technical Design

Outcome

Have a Similar Challenge?