Amplicon Sequence Variant Registry | Full Stack Ecoinformatics

Cloud-Based eDNA Data Management & Exploration Platform

The Amplicon Sequence Variant (ASV) Registry is a web application developed for the Smithsonian Institution’s Ocean DNA initiative and Jonah Ventures’ marine eDNA monitoring programs. It serves as both a centralized data registry and an interactive exploration platform for amplicon sequence variant data across multiple ocean biodiversity projects.

The system integrates sequence data, sample metadata, taxonomic assignments, and spatial context into a unified interface designed for cross-project comparison and standardized biodiversity reporting.

The Problem

Environmental DNA monitoring produces large volumes of sequence data that must be:

Linked to detailed sample metadata
Assigned taxonomy across evolving reference libraries
Compared across projects with different workflows
Exported in standardized formats for biodiversity databases such as GBIF

Researchers often rely on fragmented scripts or static outputs, making it difficult to investigate poorly matched sequences, compare taxa across projects, or maintain versioned taxonomy over time.

A scalable, structured platform was needed to manage ASV data across ocean monitoring efforts while supporting both exploration and downstream reporting.

The Solution

I developed a modular Shiny application structured around two core components: data registry and interactive exploration.

Dark Taxa Investigation:

Interactive sunburst plots visualizing taxonomic composition with match-quality overlays
Scatter plots highlighting poorly matched sequences by abundance and similarity
ASV selection tools for exporting sequences for external analysis

Core Capabilities:

Integration with GEOME via API for automated sample metadata retrieval
AWS Lambda-based automated taxonomic assignment
Interactive Leaflet maps with drawing tools for spatial subsetting
Species accumulation curves, shared taxa analysis, and ordination plots
Darwin Core-compliant exports for GBIF integration
Taxonomy versioning support for temporal comparisons
Flexible cross-project filtering

The result is a system that supports both exploratory analysis and standardized biodiversity data exchange.

Technical Architecture

The application was built using the golem framework to ensure modularity and long-term maintainability.

Key architectural components include:

SQLite database backend for structured ASV and metadata storage
Amazon S3 for spatial and project data assets
AWS Lambda for asynchronous taxonomic assignment workflows
Docker containerization with a two-stage build process
ShinyProxy deployment on AWS EC2, providing session-level container isolation
AWS Cognito authentication for secure user access

ShinyProxy ensures each user session runs in an isolated container instance, maintaining both security and reproducibility. The schema was designed with migration to production-scale database infrastructure in mind.

Outcome

What began as a prototype for exploring eDNA visualization approaches evolved into a production system managing data from multiple marine monitoring projects.

The ASV Registry currently supports cross-project biodiversity analysis and is positioned for expanded deployment within Smithsonian computational infrastructure, including integration with GBIF via standardized Darwin Core exports.

The project demonstrates how biodiversity informatics systems can be engineered as durable, modular platforms rather than ad hoc analysis scripts — bridging ecological genomics and production infrastructure.

Cloud-Based eDNA Data Management & Exploration Platform

The Problem

The Solution

Technical Architecture

Outcome

Have a Similar Challenge?