Skip to content

Commit

Permalink
Improving the scripts for reproducibility
Browse files Browse the repository at this point in the history
  • Loading branch information
canfirtina committed Jan 29, 2023
1 parent 836fa99 commit 49a5983
Show file tree
Hide file tree
Showing 66 changed files with 50,862 additions and 50,462 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ test/data/*/*.tar
test/data/*/*.tar.gz
test/data/*/fast5_files
test/data/*/slurm-*

/test/data/random_community/1_symbolic_links.sh

*/*.paf
test/evaluation/*/*/slurm*.out
test/evaluation/*/*/*.paf
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,9 @@ If you have conda you can simply install the following package (`ont_vbz_hdf_plu
```bash
conda install ont_vbz_hdf_plugin
```
# Reproducing the results

Please follow the instructions in the [README](test/README.md) file in [test](./test/).

# Upcoming Features

Expand Down
122 changes: 122 additions & 0 deletions test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Reproducing the results

## Prerequisites

We compare RawHash with [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap). We specify the versions we use for each tool below.

We list the links to download and compile each tool for comparison below:

* [UNCALLED](https://github.com/skovaka/UNCALLED/tree/74a5d4e5b5d02fb31d6e88926e8a0896dc3475cb)
* [Sigmap](https://github.com/haowenz/sigmap/commit/c9a40483264c9514587a36555b5af48d3f054f6f)

We use minimap2 to generate the ground truth mapping information by mapping basecalled seqeunces to their corresponding reference genomes. We use the following minimap2 version:

* [Minimap2 v2.24](https://github.com/lh3/minimap2/releases/tag/v2.24)

We use various tools to process and analyze the data we generate using each tool. The following tools must also be installed in your machine. We suggest using conda to install these tools with their specified versions as almost all of them are included in the conda repository.

* [Python v3.6.15](https://www.python.org/downloads/release/python-3615/)
* [pip v20.2.3 -- via conda](https://anaconda.org/conda-forge/pip/20.2.3/download/noarch/pip-20.2.3-py_0.tar.bz2)
* [ont_vbz_hdf_plugin v1.0.1 -- via conda](https://anaconda.org/bioconda/ont_vbz_hdf_plugin/files?version=1.0.1)

Please make sure that all of these tools are in your `PATH`

## Creating a Virtual Environment (Optional)

You can generate the virtual environment using conda to make sure all the prerequisities are met. Here is an example usage that replicates our virtual environment we use in our evaluations.

```bash
#We will use rawhash-env directory to download and compile the tools from their repositories
mkdir -p rawhash-env/bin && cd rawhash-env

#Important: If you already completed the Step 0 and Step 1 as described, you can skip these steps and add the binaries to your PATH again
#Re-adding the binary to your path is necessary after you start a new shell session.

#Optional Step 0 Creating a conda environment (Note we highly recommend using conda for easy installation of dependencies).
#If not using conda, the packages with the specified versions below (e.g., python=3.6.15) must be installed manually in your environment
conda create -n rawhash-env python=3.6.15 pip=20.2.3 ont_vbz_hdf_plugin=1.0.1
conda activate rawhash-env

#Step 1 Compiling the tools
#Cloning and compiling RawHash
git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash && cd rawhash && make && cp ./bin/rawhash ../bin/ && cd ..

#Cloning and compiling Sigmap
git clone --recursive https://github.com/haowenz/sigmap.git sigmap && cd sigmap && make & cp sigmap ../bin/ && cd ..

#Cloning and compiling UNCALLED
git clone --recursive https://github.com/skovaka/UNCALLED.git uncalled && cd uncalled/submods/bwa && git pull origin master && cd ../../ && pip3 install . && cd ..

#Downloading and compiling minimap2
wget https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24.tar.bz2; tar -xf minimap2-2.24.tar.bz2; rm minimap2-2.24.tar.bz2; mv minimap2-2.24 minimap2; cd minimap2 && make && cp minimap2 ../bin/ && cd ..

#Step 2 Adding binaries to PATH
#If you are skipping Step 0 and Step 1, uncomment the following line and execute:
# conda activate rawhash-env
export PATH=$PWD/bin:$PATH
cd rawhash/test/
```

# Datasets

The scripts and a [README](./data/README.md) can be found in the [data directory](./data/) to download the datasets.

You can start the evaluation step after downloading the datasets as described in [README](./data/README.md).

To start downloading the datasets, enter [data](./data/)

```bash
cd ./data
```

# Evaluation

## Read Mapping

The scripts and a [README](./evaluation/read_mapping/README.md) can be found in the [read mapping directory](./evaluation/read_mapping/) to perform read mapping using UNCALLED, Sigmap and RawHash.

To start performing the read mapping evaluations, enter [./evaluation/read_mapping](./evaluation/read_mapping/)

```bash
cd ./evaluation/read_mapping
```

## Relative Abundance Estimation

The scripts and a [README](./evaluation/relative_abundance/README.md) can be found in the [relative abundance directory](./evaluation/relative_abundance/) to perform the relative abundance estimation using UNCALLED, Sigmap and RawHash.

To start performing the relative abundance estimation evaluations, enter [./evaluation/relative_abundance](./evaluation/relative_abundance/)

```bash
cd ./evaluation/relative_abundance
```

## Contamination Analysis

The scripts and a [README](./evaluation/contamination/README.md) can be found in the [contamination directory](./evaluation/contamination/) to perform the contamination analysis using UNCALLED, Sigmap and RawHash.

To start performing the contamination analysis evaluations, enter [./evaluation/contamination](./evaluation/contamination/)

```bash
cd ./evaluation/contamination/
```

## Sequence Until

The scripts and a [README](./evaluation/relative_abundance/sequenceuntil/README.md) can be found in the [sequence until directory](./evaluation/relative_abundance/sequenceuntil/) to evaluate the benefits of the Sequence Until mechanism using UNCALLED and RawHash.

To start evaluating Sequence Until, enter [./evaluation/relative_abundance/sequenceuntil](./evaluation/relative_abundance/sequenceuntil/)

```bash
cd ./evaluation/relative_abundance/sequenceuntil
```

# Reproducing the Manuscript

## Generating the Figures

We will provide the scripts to generate the figures we show in the preprint.

## Generating the Manuscript

We will provide the tex source to generate preprint PDF from its TeX source based on the results and figures you generate.
22 changes: 22 additions & 0 deletions test/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Datasets

We assume your current directory points to this directory ([`data`](./))

Download and process each dataset so that they are fully ready to be used in the evaluation step:

```bash
bash download_d1_sars-cov-2_r94.sh #Download Dataset D1 SARS-CoV-2
bash download_d2_ecoli_r94.sh #Download Dataset D2 E. coli
bash download_d3_yeast.sh #Download Dataset D3 Yeast
bash download_d4_green_algae.sh #Downlaod Dataset D4 Green Algae
bash download_d5_human_na12878.sh #Download Dataset D5 Human HG001

#Prepare the contamination dataset
cd contamination && bash generate_fast5_files.sh && cd ..

#Prepare the relative abundance dataset
cd relative_abundance && bash generate_fast5_files.sh && bash generate_ref.sh && cd ..

#Create a random community of randomly ordered reads from D1-D5 reads read_ids.txt
cd random_community && bash 0_generate_random_ids.sh && bash 1_symbolic_links.sh && cd ..
```
4 changes: 2 additions & 2 deletions test/data/download_d1_sars-cov-2_r94.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#!/bin/bash

mkdir d1_sars-cov-2_r94
mkdir d1_sars-cov-2_r94/fast5_files/
cd d1_sars-cov-2_r94

#Download FAST5 and FASTQ Caddecentre
wget -qO- https://cadde.s3.climb.ac.uk/SP1-raw.tgz | tar -xzv; rm README

#Moving the fast5 files to fast5_files
mkdir -p ./fast5_files/; find ./SP1-fast5-mapped -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/; rm -rf SP1-fast5-mapped;
find ./SP1-fast5-mapped -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/; rm -rf SP1-fast5-mapped;

#Converting FASTQ to FASTA files
awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2)}else if(line % 4 == 2){print $0}}' SP1-mapped.fastq > reads.fasta; rm SP1-mapped.fastq
Expand Down
4 changes: 2 additions & 2 deletions test/data/download_d2_ecoli_r94.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
#!/bin/bash

mkdir d2_ecoli_r94
mkdir d2_ecoli_r94/fast5_files/
cd d2_ecoli_r94

#Download FAST5 from AWS | Unzip; Mv FAST5 files into the fast5_files directory. NCBI SRA Link: https://trace.ncbi.nlm.nih.gov/Traces/?run=ERR9127551
wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR9127551/ecoli_r9.tar.gz.1 | tar xzv; mkdir -p fast5_files; mv r9/f5s/RefStrains210914_NK/f5s/barcode02/*.fast5 fast5_files; rm -rf r9
wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR9127551/ecoli_r9.tar.gz.1 | tar xzv; mv r9/f5s/RefStrains210914_NK/f5s/barcode02/*.fast5 fast5_files; rm -rf r9

#Download FASTQ from SRA (Note: fastq-dump should exist in your path.) | #Processing the FASTA file so that read names contain the read ids as stored in FAST5 files
fastq-dump -Z --fasta 0 ERR9127551 | awk '{if(substr($1,1,1) == ">"){print ">"$2}else{print $0}}' > reads.fasta
Expand Down
7 changes: 4 additions & 3 deletions test/data/download_d3_yeast.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
#!/bin/bash

mkdir d3_yeast_r94
mkdir d3_yeast_r94/fast5_files/
cd d3_yeast_r94

#Download FAST5 from AWS. NCBI SRA Accession: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR8648503&display=metadata
wget -qO- https://sra-pub-sars-cov2.s3.amazonaws.com/sra-src/SRR8648503/GLU1II_basecalled_fast5_1.tar.gz.1 | tar -xzv;

mkdir -p ./fast5_files/; find ./GLU1II_basecalled_fast5_1 -type f -name '*.fast5' | head -50000 | xargs -i{} mv {} ./fast5_files/; rm -rf GLU1II_basecalled_fast5_1;
find ./GLU1II_basecalled_fast5_1 -type f -name '*.fast5' | head -50000 | xargs -i{} mv {} ./fast5_files/; rm -rf GLU1II_basecalled_fast5_1;

#To extract the reads from FAST5 from this dataset, you will need to clone the following repository and make sure you have h5py <= 2.9 (if you have conda you can do the following):
# conda create -n oldh5 h5py=2.9.0; conda activate oldh5;
# git clone https://github.com/rrwick/Fast5-to-Fastq
# Fast5-to-Fastq/fast5_to_fastq.py fast5_files/ | awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2,36)}else if(line % 4 == 2){print $0}}' > reads.fasta

#We will provide the direct Zenodo link to this reads.fasta file as well to save from the hassle
#We have provided the Zenodo link to this reads.fasta file to avoid the hassle above
wget https://zenodo.org/record/7582018/files/reads.fasta

#Downloading S.cerevisiae S288c (Yeast) reference genome from UCSC
wget https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz; gunzip sacCer3.fa.gz; mv sacCer3.fa ref.fa;
Expand Down
8 changes: 4 additions & 4 deletions test/data/download_d4_green_algae.sh
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
#!/bin/bash

mkdir -p d4_green_algae_r94
mkdir -p d4_green_algae_r94/fast5_files/
cd d4_green_algae_r94

#Download FAST5 from AWS
wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR3237140/Chlamydomonas_0.tar.gz.1 | tar xzv
wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR3237140/Chlamydomonas_0.tar.gz.1 | tar xzv;

find Chlamydomonas_0/reads/downloads/pass/ -type f -name '*.fast5' | xargs -i{} mv {} fast5_files/
find ./Chlamydomonas_0/reads/downloads/pass/ -type f -name '*.fast5' | xargs -i{} mv {} fast5_files/

awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2)}else if(line % 4 == 2){print $0}}' Chlamydomonas_0/reads/downloads/pass/*.fastq > reads.fasta
awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2)}else if(line % 4 == 2){print $0}}' ./Chlamydomonas_0/reads/downloads/pass/*.fastq > reads.fasta

rm -rf Chlamydomonas_0

Expand Down
6 changes: 4 additions & 2 deletions test/data/download_d5_human_na12878.sh
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
#!/bin/bash

mkdir -p d5_human_na12878_r94
mkdir -p d5_human_na12878_r94/fast5_files/
cd d5_human_na12878_r94

#Download FAST5 from https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md
wget -qO- http:https://s3.amazonaws.com/nanopore-human-wgs/rel6/MultiFast5Tars/FAB42260-4177064552_Multi_Fast5.tar | tar xv

mkdir -p ./fast5_files/; find ./UBC -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/; rm -rf UBC;
find ./UBC -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/;

#Download FASTQ
wget -qO- http:https://s3.amazonaws.com/nanopore-human-wgs/rel6/FASTQTars/FAB42260-4177064552_Multi.tar | tar xv

zcat UBC/FAB42260-4177064552_Multi/fastq/*.fastq.gz | awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2,36)}else if(line % 4 == 2){print $0}}' > reads.fasta

rm -rf UBC;

#Downloading CHM13v2 (hs1) Human reference genome
wget https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz; gunzip hs1.fa.gz; mv hs1.fa ref.fa

Expand Down
4 changes: 3 additions & 1 deletion test/data/random_community/0_generate_random_ids.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

echo '#!/bin/bash' > 1_symbolic_links.sh;
echo >> 1_symbolic_links.sh;
echo 'mkdir -p fast5_files' >> 1_symbolic_links.sh;
echo >> 1_symbolic_links.sh;
echo 'cd fast5_files' >> 1_symbolic_links.sh;
echo >> 1_symbolic_links.sh;
shuf -n50000 ../read_ids.txt | awk 'BEGIN{id=0}{print "ln -s "$0" read"count++".fast5"}' >> 1_symbolic_links.sh
shuf -n50000 read_ids.txt | awk 'BEGIN{id=0}{print "ln -s "$0" read"count++".fast5"}' >> 1_symbolic_links.sh

Loading

0 comments on commit 49a5983

Please sign in to comment.