Improving the scripts for reproducibility

CMU-SAFARI · Jan 29, 2023 · 49a5983 · 49a5983
1 parent 836fa99
commit 49a5983
Show file tree

Hide file tree

Showing 66 changed files with 50,862 additions and 50,462 deletions.
diff --git a/.gitignore b/.gitignore
@@ -11,6 +11,9 @@ test/data/*/*.tar
 test/data/*/*.tar.gz
 test/data/*/fast5_files
 test/data/*/slurm-*
+
+/test/data/random_community/1_symbolic_links.sh
+
 */*.paf
 test/evaluation/*/*/slurm*.out
 test/evaluation/*/*/*.paf

diff --git a/README.md b/README.md
@@ -110,6 +110,9 @@ If you have conda you can simply install the following package (`ont_vbz_hdf_plu
 ```bash
 conda install ont_vbz_hdf_plugin
 ```
+# Reproducing the results
+
+Please follow the instructions in the [README](test/README.md) file in [test](./test/).
 
 # Upcoming Features
 

diff --git a/test/README.md b/test/README.md
@@ -0,0 +1,122 @@
+# Reproducing the results
+
+## Prerequisites
+
+We compare RawHash with [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap). We specify the versions we use for each tool below.
+
+We list the links to download and compile each tool for comparison below:
+
+* [UNCALLED](https://github.com/skovaka/UNCALLED/tree/74a5d4e5b5d02fb31d6e88926e8a0896dc3475cb)
+* [Sigmap](https://github.com/haowenz/sigmap/commit/c9a40483264c9514587a36555b5af48d3f054f6f)
+
+We use minimap2 to generate the ground truth mapping information by mapping basecalled seqeunces to their corresponding reference genomes. We use the following minimap2 version:
+
+* [Minimap2 v2.24](https://github.com/lh3/minimap2/releases/tag/v2.24)
+
+We use various tools to process and analyze the data we generate using each tool. The following tools must also be installed in your machine. We suggest using conda to install these tools with their specified versions as almost all of them are included in the conda repository.
+
+* [Python v3.6.15](https://www.python.org/downloads/release/python-3615/)
+* [pip v20.2.3 -- via conda](https://anaconda.org/conda-forge/pip/20.2.3/download/noarch/pip-20.2.3-py_0.tar.bz2)
+* [ont_vbz_hdf_plugin v1.0.1 -- via conda](https://anaconda.org/bioconda/ont_vbz_hdf_plugin/files?version=1.0.1)
+
+Please make sure that all of these tools are in your `PATH`
+
+## Creating a Virtual Environment (Optional)
+
+You can generate the virtual environment using conda to make sure all the prerequisities are met. Here is an example usage that replicates our virtual environment we use in our evaluations.
+
+```bash
+#We will use rawhash-env directory to download and compile the tools from their repositories
+mkdir -p rawhash-env/bin && cd rawhash-env
+
+#Important: If you already completed the Step 0 and Step 1 as described, you can skip these steps and add the binaries to your PATH again
+#Re-adding the binary to your path is necessary after you start a new shell session.
+
+#Optional Step 0 Creating a conda environment (Note we highly recommend using conda for easy installation of dependencies).
+#If not using conda, the packages with the specified versions below (e.g., python=3.6.15) must be installed manually in your environment
+conda create -n rawhash-env python=3.6.15 pip=20.2.3 ont_vbz_hdf_plugin=1.0.1
+conda activate rawhash-env
+
+#Step 1 Compiling the tools
+#Cloning and compiling RawHash
+git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash && cd rawhash && make && cp ./bin/rawhash ../bin/ && cd ..
+
+#Cloning and compiling Sigmap
+git clone --recursive https://github.com/haowenz/sigmap.git sigmap && cd sigmap && make & cp sigmap ../bin/ && cd ..
+
+#Cloning and compiling UNCALLED
+git clone --recursive https://github.com/skovaka/UNCALLED.git uncalled && cd uncalled/submods/bwa && git pull origin master && cd ../../ && pip3 install . && cd ..
+
+#Downloading and compiling minimap2
+wget https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24.tar.bz2; tar -xf minimap2-2.24.tar.bz2; rm minimap2-2.24.tar.bz2; mv minimap2-2.24 minimap2; cd minimap2 && make && cp minimap2 ../bin/ && cd ..
+
+#Step 2 Adding binaries to PATH
+#If you are skipping Step 0 and Step 1, uncomment the following line and execute:
+# conda activate rawhash-env
+export PATH=$PWD/bin:$PATH
+cd rawhash/test/
+```
+
+# Datasets
+
+The scripts and a [README](./data/README.md) can be found in the [data directory](./data/) to download the datasets.
+
+You can start the evaluation step after downloading the datasets as described in [README](./data/README.md).
+
+To start downloading the datasets, enter [data](./data/)
+
+```bash
+cd ./data
+```
+
+# Evaluation
+
+## Read Mapping
+
+The scripts and a [README](./evaluation/read_mapping/README.md) can be found in the [read mapping directory](./evaluation/read_mapping/) to perform read mapping using UNCALLED, Sigmap and RawHash.
+
+To start performing the read mapping evaluations, enter [./evaluation/read_mapping](./evaluation/read_mapping/)
+
+```bash
+cd ./evaluation/read_mapping
+```
+
+## Relative Abundance Estimation
+
+The scripts and a [README](./evaluation/relative_abundance/README.md) can be found in the [relative abundance directory](./evaluation/relative_abundance/) to perform the relative abundance estimation using UNCALLED, Sigmap and RawHash.
+
+To start performing the relative abundance estimation evaluations, enter [./evaluation/relative_abundance](./evaluation/relative_abundance/)
+
+```bash
+cd ./evaluation/relative_abundance
+```
+
+## Contamination Analysis
+
+The scripts and a [README](./evaluation/contamination/README.md) can be found in the [contamination directory](./evaluation/contamination/) to perform the contamination analysis using UNCALLED, Sigmap and RawHash.
+
+To start performing the contamination analysis evaluations, enter [./evaluation/contamination](./evaluation/contamination/)
+
+```bash
+cd ./evaluation/contamination/
+```
+
+## Sequence Until
+
+The scripts and a [README](./evaluation/relative_abundance/sequenceuntil/README.md) can be found in the [sequence until directory](./evaluation/relative_abundance/sequenceuntil/) to evaluate the benefits of the Sequence Until mechanism using UNCALLED and RawHash.
+
+To start evaluating Sequence Until, enter [./evaluation/relative_abundance/sequenceuntil](./evaluation/relative_abundance/sequenceuntil/)
+
+```bash
+cd ./evaluation/relative_abundance/sequenceuntil
+```
+
+# Reproducing the Manuscript
+
+## Generating the Figures
+
+We will provide the scripts to generate the figures we show in the preprint.
+
+## Generating the Manuscript
+
+We will provide the tex source to generate preprint PDF from its TeX source based on the results and figures you generate.
diff --git a/test/data/README.md b/test/data/README.md
@@ -0,0 +1,22 @@
+# Datasets
+
+We assume your current directory points to this directory ([`data`](./))
+
+Download and process each dataset so that they are fully ready to be used in the evaluation step:
+
+```bash
+bash download_d1_sars-cov-2_r94.sh #Download Dataset D1 SARS-CoV-2
+bash download_d2_ecoli_r94.sh #Download Dataset D2 E. coli
+bash download_d3_yeast.sh #Download Dataset D3 Yeast
+bash download_d4_green_algae.sh #Downlaod Dataset D4 Green Algae
+bash download_d5_human_na12878.sh #Download Dataset D5 Human HG001
+
+#Prepare the contamination dataset
+cd contamination && bash generate_fast5_files.sh && cd ..
+
+#Prepare the relative abundance dataset
+cd relative_abundance && bash generate_fast5_files.sh && bash generate_ref.sh && cd ..
+
+#Create a random community of randomly ordered reads from D1-D5 reads read_ids.txt
+cd random_community && bash 0_generate_random_ids.sh && bash 1_symbolic_links.sh && cd ..
+```
diff --git a/test/data/download_d1_sars-cov-2_r94.sh b/test/data/download_d1_sars-cov-2_r94.sh
@@ -1,13 +1,13 @@
 #!/bin/bash
 
-mkdir d1_sars-cov-2_r94
+mkdir d1_sars-cov-2_r94/fast5_files/
 cd d1_sars-cov-2_r94
 
 #Download FAST5 and FASTQ Caddecentre
 wget -qO- https://cadde.s3.climb.ac.uk/SP1-raw.tgz | tar -xzv; rm README
 
 #Moving the fast5 files to fast5_files
-mkdir -p ./fast5_files/; find ./SP1-fast5-mapped -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/; rm -rf SP1-fast5-mapped; 
+find ./SP1-fast5-mapped -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/; rm -rf SP1-fast5-mapped; 
 
 #Converting FASTQ to FASTA files
 awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2)}else if(line % 4 == 2){print $0}}' SP1-mapped.fastq > reads.fasta; rm SP1-mapped.fastq

diff --git a/test/data/download_d2_ecoli_r94.sh b/test/data/download_d2_ecoli_r94.sh
@@ -1,10 +1,10 @@
 #!/bin/bash
 
-mkdir d2_ecoli_r94
+mkdir d2_ecoli_r94/fast5_files/
 cd d2_ecoli_r94
 
 #Download FAST5 from AWS | Unzip; Mv FAST5 files into the fast5_files directory. NCBI SRA Link: https://trace.ncbi.nlm.nih.gov/Traces/?run=ERR9127551
-wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR9127551/ecoli_r9.tar.gz.1 | tar xzv; mkdir -p fast5_files; mv r9/f5s/RefStrains210914_NK/f5s/barcode02/*.fast5 fast5_files; rm -rf r9
+wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR9127551/ecoli_r9.tar.gz.1 | tar xzv; mv r9/f5s/RefStrains210914_NK/f5s/barcode02/*.fast5 fast5_files; rm -rf r9
 
 #Download FASTQ from SRA (Note: fastq-dump should exist in your path.) | #Processing the FASTA file so that read names contain the read ids as stored in FAST5 files
 fastq-dump -Z --fasta 0 ERR9127551 | awk '{if(substr($1,1,1) == ">"){print ">"$2}else{print $0}}' > reads.fasta

diff --git a/test/data/download_d3_yeast.sh b/test/data/download_d3_yeast.sh
@@ -1,19 +1,20 @@
 #!/bin/bash
 
-mkdir d3_yeast_r94
+mkdir d3_yeast_r94/fast5_files/
 cd d3_yeast_r94
 
 #Download FAST5 from AWS. NCBI SRA Accession: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR8648503&display=metadata
 wget -qO- https://sra-pub-sars-cov2.s3.amazonaws.com/sra-src/SRR8648503/GLU1II_basecalled_fast5_1.tar.gz.1 | tar -xzv;
 
-mkdir -p ./fast5_files/; find ./GLU1II_basecalled_fast5_1 -type f -name '*.fast5' | head -50000 | xargs -i{} mv {} ./fast5_files/; rm -rf GLU1II_basecalled_fast5_1;
+find ./GLU1II_basecalled_fast5_1 -type f -name '*.fast5' | head -50000 | xargs -i{} mv {} ./fast5_files/; rm -rf GLU1II_basecalled_fast5_1;
 
 #To extract the reads from FAST5 from this dataset, you will need to clone the following repository and make sure you have h5py <= 2.9 (if you have conda you can do the following):
 # conda create -n oldh5 h5py=2.9.0; conda activate oldh5;
 # git clone https://github.com/rrwick/Fast5-to-Fastq
 # Fast5-to-Fastq/fast5_to_fastq.py fast5_files/ | awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2,36)}else if(line % 4 == 2){print $0}}' > reads.fasta
 
-#We will provide the direct Zenodo link to this reads.fasta file as well to save from the hassle
+#We have provided the Zenodo link to this reads.fasta file to avoid the hassle above
+wget https://zenodo.org/record/7582018/files/reads.fasta
 
 #Downloading S.cerevisiae S288c (Yeast) reference genome from UCSC
 wget https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz; gunzip sacCer3.fa.gz; mv sacCer3.fa ref.fa; 

diff --git a/test/data/download_d4_green_algae.sh b/test/data/download_d4_green_algae.sh
@@ -1,14 +1,14 @@
 #!/bin/bash
 
-mkdir -p d4_green_algae_r94
+mkdir -p d4_green_algae_r94/fast5_files/
 cd d4_green_algae_r94
 
 #Download FAST5 from AWS
-wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR3237140/Chlamydomonas_0.tar.gz.1 | tar xzv
+wget -qO- https://sra-pub-src-2.s3.amazonaws.com/ERR3237140/Chlamydomonas_0.tar.gz.1 | tar xzv;
 
-find Chlamydomonas_0/reads/downloads/pass/ -type f -name '*.fast5' | xargs -i{} mv {} fast5_files/
+find ./Chlamydomonas_0/reads/downloads/pass/ -type f -name '*.fast5' | xargs -i{} mv {} fast5_files/
 
-awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2)}else if(line % 4 == 2){print $0}}' Chlamydomonas_0/reads/downloads/pass/*.fastq > reads.fasta
+awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2)}else if(line % 4 == 2){print $0}}' ./Chlamydomonas_0/reads/downloads/pass/*.fastq > reads.fasta
 
 rm -rf Chlamydomonas_0
 

diff --git a/test/data/download_d5_human_na12878.sh b/test/data/download_d5_human_na12878.sh
@@ -1,18 +1,20 @@
 #!/bin/bash
 
-mkdir -p d5_human_na12878_r94
+mkdir -p d5_human_na12878_r94/fast5_files/
 cd d5_human_na12878_r94
 
 #Download FAST5 from https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md
 wget -qO- http:https://s3.amazonaws.com/nanopore-human-wgs/rel6/MultiFast5Tars/FAB42260-4177064552_Multi_Fast5.tar | tar xv
 
-mkdir -p ./fast5_files/; find ./UBC -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/; rm -rf UBC;
+find ./UBC -type f -name '*.fast5' | xargs -i{} mv {} ./fast5_files/;
 
 #Download FASTQ
 wget -qO- http:https://s3.amazonaws.com/nanopore-human-wgs/rel6/FASTQTars/FAB42260-4177064552_Multi.tar | tar xv
 
 zcat UBC/FAB42260-4177064552_Multi/fastq/*.fastq.gz | awk 'BEGIN{line = 0}{line++; if(line %4 == 1){print ">"substr($1,2,36)}else if(line % 4 == 2){print $0}}' > reads.fasta
 
+rm -rf UBC;
+
 #Downloading CHM13v2 (hs1) Human reference genome
 wget https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz; gunzip hs1.fa.gz; mv hs1.fa ref.fa
 

diff --git a/test/data/random_community/0_generate_random_ids.sh b/test/data/random_community/0_generate_random_ids.sh
@@ -2,7 +2,9 @@
 
 echo '#!/bin/bash' > 1_symbolic_links.sh;
 echo >> 1_symbolic_links.sh;
+echo 'mkdir -p fast5_files' >> 1_symbolic_links.sh;
+echo >> 1_symbolic_links.sh;
 echo 'cd fast5_files' >> 1_symbolic_links.sh;
 echo >> 1_symbolic_links.sh;
-shuf -n50000 ../read_ids.txt | awk 'BEGIN{id=0}{print "ln -s "$0" read"count++".fast5"}' >> 1_symbolic_links.sh
+shuf -n50000 read_ids.txt | awk 'BEGIN{id=0}{print "ln -s "$0" read"count++".fast5"}' >> 1_symbolic_links.sh