Open data science
Reproducibility determines how reliable your results are. If it can’t be reproduced, it isn’t real. The reproducibility stack consists of multiple layers that work together to ensure your analysis can be exactly reproduced by others (or yourself on different machines).
To create a fully reproducible analysis:
- Start with a structured project
- Use containers for system-level consistency (Docker/Apptainer)
- Manage packages with language-specific tools (renv/UV)
- Version control all code and configuration
- Set random seeds for deterministic results
- Document everything in READMEs and lab notebooks
- Test reproducibility on clean environments
- Publish code, data, and environment specifications
1. Containerization (System Level)
Containers package your application with all its system dependencies into a standardized unit. This ensures consistency across different operating systems, system libraries, and computing environments. Docker is the most common containerization platform. Singularity (now called Apptainer) is a container platform designed for HPC environments. It is similar to Docker and can use docker images without root permissions.
Further reading:
FHIL Compute Environment Launchers
FHIL provides streamlined launchers that automatically manage containers for interactive compute environments. These handle container setup and hosting for interactive analysis sessions on Fred Hutch HPCs
2. Project Structure & Documentation (project level)
Well-defined project structures, lab notebooks, ample READMEs, and modularized & documented code make your work more accessible and reproducible. The FHIL analysis template provides a preferred project structure inspired by the gin-tonic project and the file naming conventions of The Turing Way.
3. Package Management (Language Level)
Package managers track exact versions of language-specific dependencies, ensuring the same packages are used across different environments. The FHIL project template comes with some startup scripts to initialize these project workflows.
4. Version Control (Code Level)
Version control tracks changes to your analysis code, allowing you to:
- Revert to previous versions
- Track what changed and when
- Collaborate with others
- Publish code with papers
Resources:
5. Random Seed Control (Algorithm Level)
Set random seeds to ensure non-deterministic processes produce the same results when rerun.
# R
set.seed(42)
# Python
import numpy as np
np.random.seed(42)
# Set seeds at the beginning of your analysis
# Document the seed value used
6. Data Versioning (Data Level)
Track versions of input data and intermediate results:
- Use data versioning tools (DVC, Git LFS)
- Document data sources and processing steps
- Include data checksums for verification
- When possible, store data in human-readable/non-proprietary formats