Open data science

Open data science
Further Reading

Reproducibility determines how reliable your results are. If it can’t be reproduced, it isn’t real. The reproducibility stack consists of multiple layers that work together to ensure your analysis can be exactly reproduced by others (or yourself on different machines).

To create a fully reproducible analysis:

Start with a structured project
- FHIL analysis template
Use containers for system-level consistency (Docker/Apptainer)
- R/Rstudio
- Python/Jupyter (with R and Julia)
Manage packages with language-specific tools (renv/UV)
Version control all code and configuration
Set random seeds for deterministic results
Document everything in READMEs and lab notebooks
Test reproducibility on clean environments
Publish code, data, and environment specifications

An expanded version

1. Containerization (System Level)

Containers package your application with all its system dependencies into a standardized unit. This ensures consistency across different operating systems, system libraries, and computing environments. Docker is the most common containerization platform. Singularity (now called Apptainer) is a container platform designed for HPC environments. It is similar to Docker and can use docker images without root permissions.

FHIL Compute Environment Launchers

FHIL provides streamlined launchers that automatically manage containers for interactive compute environments. These handle container setup and hosting for interactive analysis sessions on Fred Hutch HPCs

2. Project Structure & Documentation (project level)

Well-defined project structures, lab notebooks, ample READMEs, and modularized & documented code make your work more accessible and reproducible. The FHIL analysis template provides a preferred project structure inspired by the gin-tonic project and the file naming conventions of The Turing Way.

3. Package Management (Language Level)

Package managers track exact versions of language-specific dependencies, ensuring the same packages are used across different environments. The FHIL project template comes with some startup scripts to initialize these project workflows.

4. Version Control (Code Level)

Version control tracks changes to your analysis code, allowing you to:

Revert to previous versions
Track what changed and when
Collaborate with others
Publish code with papers

Resources:

5. Random Seed Control (Algorithm Level)

Set random seeds to ensure non-deterministic processes produce the same results when rerun.

# R
set.seed(42)

# Python
import numpy as np
np.random.seed(42)

# Set seeds at the beginning of your analysis
# Document the seed value used

6. Data Versioning (Data Level)

Track versions of input data and intermediate results:

Use data versioning tools (DVC, Git LFS)
Document data sources and processing steps
Include data checksums for verification
When possible, store data in human-readable/non-proprietary formats