Skip to content

A demonstration of how to work reproducibly using version control, containers, and a workflow manager.

License

Notifications You must be signed in to change notification settings

mahesh-panchal/reproducible-research

Repository files navigation

Reproducible research demo

A demonstration of how to work reproducibly using version control, containers, notebooks, and a workflow manager.

Visit the website for how to use the tools demonstrated here.

Play with this demo in Gitpod.

Open in Gitpod

About

Communication is a cornerstone of effective science.

The aim of this demo is to show how one can efficiently make a publishable dry-lab notebook, with all the details necessary to replicate the various computational analyses performed. Importantly, an aim is to reduce the work needed to get from code to something published. In this case, the published form is a website that demonstrates with no ambiguity how you got from the beginning to the end of your project. Unlike a journal article, a website can go into more detail of the analyses that got you to the novel result without relegating everything to supplementary material. This kind of publishable notebook can also be a demonstration of the skills you've learned along the way, which would otherwise be hidden away (or worse, forgotten about).

This demo aims to lower the barrier as much as possible for computational biologists and bioinformaticians, but there is one thing that can not be avoided. That is, comfort with the Unix command line. Knowing how to navigate the file system is a fundamental first step in using these tools.

Workflow overview

graph TB
A(Fetch data) --> B[Process with Nextflow/Snakemake/Galaxy]
B --> C(Package intermediate results)
B --> D[Notebook Report with Quarto]
B --> E[Website with Quarto]
E --> F[Publish Website on Github pages]
Loading

Tools

Here is a brief description of the tools I use. See their corresponding webpages in this demo for more details (Go to demo website).

Version control: git is a tool used for managing file versions that has become ubiquitous. I also use Github to publically host the folder with the managed files, and as a means of hosting the website.

Containers: Installing software can be a pain, and so can sharing software so others can reproduce your work. Containerized software can alleviate that. A container designed to run a particular software will have all the dependencies necessary to run. There are many container platforms, but here I use the container platform Docker, and also reference the container platform Singularity.

Notebooks: These are files where you record how, what and why you are doing something. They intermingle code and explanation, keeping relevant parts together for easier understanding. The code you use to obtain your results can not be misinterpreted. I use Quarto in this demo, which also allows the notebooks to be published in various forms such as a website, report, or slide show. Quarto is an improved system over RMarkdown, or Jupyter, which I would have otherwise recommended before.

Workflow manager: A workflow manager is used to handle large scale data processing (or even small scale as demonstrated here). Notebooks should generally be limited to processing tables of results, running statistical analyses, and producing plots for interpretation. A workflow manager on the other hand can be used to take large quantities of data, process them in parallel, and deliver refined tables of results that are easy to process in a notebook. Nextflow is my favoured workflow manager.

Lastly, this Github repository also enables you to use these tools and play with them at your convenience. Click on the "Open in Gitpod" button to open an ephmeral computing environment, with both file editor and command line terminal to try this for yourself. The following tools are available in this environment:

  • git
  • Docker
  • Quarto
  • R and RStudio (via a Docker container)
  • Python and Jupyter (via a Docker container)
  • Nextflow

Resources available:

  • 16 cpus
  • 62 GB mem
  • 285 GB disk space

Acknowledgements

This repository was inspired by the Hypocolypse website. A lot of the things I've learned here have come about through my interactions with the nf-core community, particulary on their Slack channels. Data and exercise material were modified from the Data Carpentries - Data Wrangling and Processing for Genomics and Data Carpentries - Intro to R and RStudio for Genomics lessons.

About

A demonstration of how to work reproducibly using version control, containers, and a workflow manager.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published