Open data science
Openness
Open science implies it can be seen by others. This usually means publishing code and data online. Github is used for code management, while Zenodo is useful for general digital storage.Gene Expression Omnibus or Sequence Read Archive are more domain specific resources for storing genomic/transcriptomic data for publications. If you intend to publish an analysis, you will likely have to make your data public through one of these resources.
Accessibility
Accessibility deals with how understandable your code and workflows are. Well defined project structures, lab notebooks, ample READMEs, and modularized & documented code will make your work more accessible.
See the github template repo for the preferred project structure. The structure is inspired by the gin-tonic project and the file naming conventions of The Turing Way.
Version Control
Version control is essential for tracking changes, collaborating with others, and maintaining reproducible workflows. You should develop a habit of regularily commiting updates to analyses. All scripts used to produce an analysis should be version tracked and ideally cloud-backed in GitHub or something similar. This will allow you to revert to:
- Revert to previous versions of an analysis
- Record failed analyses
- Test different analyses in parallel with branching
-
Publish code with papers
- Git branching workflows
- Git with RStudio
Dry lab notebooks
Data/research compendia
Reproducibility
Reproducibility determines how reliable your results are. If it can’t be reproduced, it isn’t real. Container systems like Docker and Apptainer help manage system level consistency such as system libraries, operating systems, and language versions. Programming-language-specific dependency managers like Renv (R) and UV or poetry (Python) record versions. Random seed control ensure non-deterministic processes will output the same results when reran.
Further reading:
- https://ubc-dsci.github.io/reproducible-and-trustworthy-workflows-for-data-science/
Containerization
Containers allow you to package an application with all of its dependencies into a standardized unit for software development. This is useful for developing robust workflows and publishing reproducible analyses. Docker is the most common containerization platform. Singularity (now called Apptainer) is a container platform designed for HPC environments. It is similar to Docker and can use docker images without root permissions.
Apptainer
Apptainer (formerly Singularity) is a docker-like solution for shared computing environments (such as high-performance computing clusters). When working with Rhino/Gizmo/Slurm, you will need to use Apptainer instead of docker to skirt permissions issues. See the wiki for more info.
Dependency management
Renv (R)
Renv operates within an R console. renv Documentation
## Install Renv if needed
if (!require('renv')) {
install.packages('renv')
}
## Initialize Renv
renv::init()
## Run this when installing new programs
renv::snapshot()
UV (Python)
UV operates outside of Python and will be invoked at the command line. On Gizmo/Rhino (FH HPCs), there is a module with UV installed. UV Documentation
module load UV
## initialize a project with UV
uv init
## install a dependency
uv add <>