Open data science

  1. Open data science
    1. Openness
    2. Accessibility
      1. Version Control
      2. Dry lab notebooks
      3. Data/research compendia
    3. Reproducibility
      1. Containerization
        1. Apptainer
      2. Dependency management
        1. Renv (R)
        2. UV (Python)

Openness

Open science implies it can be seen by others. This usually means publishing code and data online. Github is used for code management, while Zenodo is useful for general digital storage.Gene Expression Omnibus or Sequence Read Archive are more domain specific resources for storing genomic/transcriptomic data for publications. If you intend to publish an analysis, you will likely have to make your data public through one of these resources.

Accessibility

Accessibility deals with how understandable your code and workflows are. Well defined project structures, lab notebooks, ample READMEs, and modularized & documented code will make your work more accessible.

See the github template repo for the preferred project structure. The structure is inspired by the gin-tonic project and the file naming conventions of The Turing Way.

Version Control

Version control is essential for tracking changes, collaborating with others, and maintaining reproducible workflows. You should develop a habit of regularily commiting updates to analyses. All scripts used to produce an analysis should be version tracked and ideally cloud-backed in GitHub or something similar. This will allow you to revert to:

Dry lab notebooks

Data/research compendia

Reproducibility

Reproducibility determines how reliable your results are. If it can’t be reproduced, it isn’t real. Container systems like Docker and Apptainer help manage system level consistency such as system libraries, operating systems, and language versions. Programming-language-specific dependency managers like Renv (R) and UV or poetry (Python) record versions. Random seed control ensure non-deterministic processes will output the same results when reran.

Further reading:

  • https://ubc-dsci.github.io/reproducible-and-trustworthy-workflows-for-data-science/

Containerization

Containers allow you to package an application with all of its dependencies into a standardized unit for software development. This is useful for developing robust workflows and publishing reproducible analyses. Docker is the most common containerization platform. Singularity (now called Apptainer) is a container platform designed for HPC environments. It is similar to Docker and can use docker images without root permissions.

Apptainer

Apptainer (formerly Singularity) is a docker-like solution for shared computing environments (such as high-performance computing clusters). When working with Rhino/Gizmo/Slurm, you will need to use Apptainer instead of docker to skirt permissions issues. See the wiki for more info.

Dependency management

Renv (R)

Renv operates within an R console. renv Documentation

## Install Renv if needed
if (!require('renv')) {
  install.packages('renv')
}

## Initialize Renv
renv::init()

## Run this when installing new programs
renv::snapshot()

UV (Python)

UV operates outside of Python and will be invoked at the command line. On Gizmo/Rhino (FH HPCs), there is a module with UV installed. UV Documentation

module load UV

## initialize a project with UV
uv init

## install a dependency
uv add <>