Full Stack Ecoinformatics

MitoPilot

Source R Nextflow Shiny Docker SQLite

A reproducible mitogenome assembly and annotation system integrating Nextflow orchestration, containerized bioinformatics tools, and a database-backed Shiny interface. Designed for scalable genome skimming workflows across HPC and cloud environments.

Role
Lead Developer & System Architect (Jonah Ventures)
Client
Smithsonian Institution
Context
Developed while employed at Jonah Ventures
Status
Actively maintained ยท Open Source

Mitochondrial Genome Assembly & Annotation Platform

MitoPilot is an open-source mitochondrial genome assembly and annotation platform developed for the Smithsonian Institution while I was a software engineer at Jonah Ventures. I served as the lead developer and architect for the system, integrating a containerized Nextflow pipeline with an interactive R Shiny interface to streamline genome skimming workflows from raw Illumina reads through GenBank-ready submissions.

Initially designed for fish mitogenomes, the platform now supports diverse taxonomic groups including starfish, octocorals, hexacorals, dipterans, turtles, and others, with ongoing expansion.


The Problem

Mitochondrial genome assembly and annotation requires chaining together multiple specialized tools: read preprocessing, assembly, gene annotation, mapping, and validation. Each component introduces its own parameters, output formats, and failure modes.

Researchers often manage these workflows manually, combining command-line tools with spreadsheets while tracking dozens or hundreds of samples. This approach introduces reproducibility challenges, parameter drift, and inefficiencies during annotation curation and submission.

The goal was to create a reproducible, scalable system that handles computational complexity while preserving interactive control for researchers during quality review and annotation refinement.


System Architecture

MitoPilot is structured around three coordinated modules: Assembly

  • Quality control and adapter trimming via fastp
  • Assembly using GetOrganelle or MitoFinder
  • Read mapping with bowtie2 for coverage and error rate estimation Annotation
  • rRNA, protein-coding gene, and tRNA annotation via MITOS2
  • Additional tRNA detection using tRNAscan-SE
  • Custom gene boundary refinement scripts
  • Pre-submission validation against NCBI GenBank requirements Interactive Curation
  • Browser-based Shiny interface for monitoring pipeline status
  • Sample state management with batch parameter updates
  • Manual annotation refinement with visual feedback
  • Export to GenBank-ready formats

Technical Design

The architecture deliberately separates computational execution from user interaction.

Nextflow orchestrates the workflow using containerized tools (Docker/Singularity), ensuring reproducibility across local machines, AWS Batch, and institutional HPC clusters including Smithsonian Hydra and NOAA SEDNA.

An SQLite database (via the nf-sqldb plugin) stores sample metadata and processing parameters. The Shiny interface interacts with this database to:

  • Track job status
  • Modify parameters
  • Coordinate batch operations
  • Maintain state across sessions

Users can optionally supply custom assemblies or reference databases, allowing the system to adapt to specialized taxonomic workflows without modifying the underlying pipeline.

This separation of orchestration, state management, and interaction enables both reproducibility and flexibility.


Outcome

MitoPilot is actively maintained and used by researchers at NOAA and the Smithsonian. By combining automated, containerized processing with interactive annotation curation, the platform reduces time from raw sequencing reads to validated GenBank submission while preserving researcher oversight.

The project demonstrates a broader principle: bioinformatics workflows can be engineered as durable systems rather than ad hoc scripts, balancing computational rigor with domain-specific flexibility.

Have a Similar Challenge?

If you're developing or modernizing software and data systems for ecological or scientific work, let's connect. We'll begin with a focused conversation about your goals, technical constraints, and how to build infrastructure that supports long-term impact.

Discuss a Similar Project