assemblyStats

----------------------------------------------------
                                    \\---------//       
      ___  ___        _   ___  ___    \\-----//        
       |  (___  |  | / _   |   |_       \-//         
      _|_  ___) |__| \_/  _|_  |        // \        
                                      //-----\\       
                                    //---------\\       
      isugifNF/assemblyStats  v1.0.0       
    ----------------------------------------------------

Genome Informatics Facility |

Introduction

isugifNF/assemblyStats is a nextflow pipeline to assess the quality of your genome assembl(y/ies). It runs three separate programs

AssemblyStats

I found AssemblathonStats formatting to not be to my liking so I rewrote the script to output the same statistics but in an easier to read format with a few additional outputs such as the top 5 largest and smallest scaffolds

Example

```
Number of Scaffolds:                 3
Total Nucleotide content             4658411
Longest Scaffold:                  4553770       >NC_017168.1 Yersinia pestis A1122, complete sequence
Shortest Scaffolds:                  8431        >NC_017170.1 Yersinia pestis A1122 plasmid unnamed, complete sequence
Mean Scaffold Size                   1552803
Median Scaffold length               96210.0
N50 Scaffold length                  4553770
L50 Scaffold length                  1
N90 Scaffold length                  4553770
L90 Scaffold length                  1
                                      #Scaffs   % Scaffolds      Nucleotides     % Nucleotide Content
Number of Scaffolds [0-1K) nt         0          0.0 %             0               0.0 %
Number of Scaffolds [1K-10K) nt       1          33.33 %         8431       0.180 %
Number of Scaffolds [10K-100K) nt     1          33.33 %         96210   2.065 %
Number of Scaffolds [100K-1M) nt      0          0.0 %   0       0.0 %
Number of Scaffolds [1M-10M) nt       1          33.33 %         4553770         97.75 %
Number of Scaffolds > 10M nt          0          0.0 %   0       0.0 %

```

AssemblathonStats

Technically this is a modified version of this script that includes N90/L90 as well.

Example

```
---------------- Information for assembly 'Ecoli.fasta' ----------------
                                 Number of scaffolds          3
                             Total size of scaffolds    4658411
                                    Longest scaffold    4553770
                                   Shortest scaffold       8431
                         Number of scaffolds > 1K nt          3 100.0%
                        Number of scaffolds > 10K nt          2  66.7%
                       Number of scaffolds > 100K nt          1  33.3%
                         Number of scaffolds > 1M nt          1  33.3%
                        Number of scaffolds > 10M nt          0   0.0%
                                  Mean scaffold size    1552804
                                Median scaffold size      96210
                                 N50 scaffold length    4553770
                                  L50 scaffold count          1
                                 n90 scaffold length    4553770
                                  L90 scaffold count          1
                                         scaffold %A      26.07
                                         scaffold %C      23.69
                                         scaffold %G      23.94
                                         scaffold %T      26.30
                                         scaffold %N       0.00
                                 scaffold %non-ACGTN       0.00
                     Number of scaffold non-ACGTN nt          0
        Percentage of assembly in scaffolded contigs       0.0%
      Percentage of assembly in unscaffolded contigs     100.0%
              Average number of contigs per scaffold        1.0
              Average length of break (>25 Ns) between contigs in scaffold          0
                                   Number of contigs          3
                      Number of contigs in scaffolds          0
                  Number of contigs not in scaffolds          3
                               Total size of contigs    4658411
                                      Longest contig    4553770
                                     Shortest contig       8431
                           Number of contigs > 1K nt          3 100.0%
                          Number of contigs > 10K nt          2  66.7%
                         Number of contigs > 100K nt          1  33.3%
                           Number of contigs > 1M nt          1  33.3%
                          Number of contigs > 10M nt          0   0.0%
                                    Mean contig size    1552804
                                  Median contig size      96210
                                   N50 contig length    4553770
                                    L50 contig count          1
                                   n90 contig length    4553770
                                    L90 contig count          1
                                           contig %A      26.07
                                           contig %C      23.69
                                           contig %G      23.94
                                           contig %T      26.30
                                           contig %N       0.00
                                   contig %non-ACGTN       0.00
                       Number of contig non-ACGTN nt          0
```

BUSCO Orthologs

Benchmarking Universal Single Copy

Example


        ***** Results:*****

        C:100.0%[S:100.0%,D:0.0%],F:0.0%,M:0.0%,n:124      
        124     Complete BUSCOs (C)                        
        124     Complete and single-copy BUSCOs (S)        
        0       Complete and duplicated BUSCOs (D)         
        0       Fragmented BUSCOs (F)                      
        0       Missing BUSCOs (M)                         
        124     Total BUSCO groups searched

Run Assembly statistics on a genome assembly (BUSCO and assemblyStats and new_Assemblathon.pl)

Installation

Assuming the system already has nextflow and singularity, running the following should automatically pull the pipeline.

nextflow run isugifNF/assemblyStats --help

see usage statement

Usage:
The typical command for running the pipeline are as follows:
  nextflow run isugifNF/assemblyStats --genomes "*fasta" --outdir newStats3 --threads 16 --options "-l eukaryota_odb10" -profile condo,singularity
  nextflow run isugifNF/assemblyStats --genomes "*fasta" --outdir newStats3 --threads 16 --options "-l mollusca_odb10" -profile ceres,singularity --buscoOnly
Mandatory arguments:
  --genomes                      genome assembly fasta files to run stats on. (./data/*.fasta)
  -profile singularity (docker)  as of now, this workflow only works using singularity or docker and requires this profile [be sure singularity is in your path or loaded by a module]
Optional arguments:
  --outdir                       Output directory to place final output
  --threads                      Number of CPUs to use during the NanoPlot job [default:40]
  --queueSize                    Maximum number of jobs to be queued [default:18]
  --options                      [default:'--auto-lineage'], you may also consider  "--auto-lineage-prok","--auto-lineage-euk", "-l eukaryota_odb10"
  --listDatasets                 Display the list of available BUSCO lineage datasets to use in --options pipeline parameter.
  --buscoOnly                    When you just want to run a different lineage and not rerun the assemblathon stats
  --account                      Some HPCs require you supply an account name for tracking usage.  You can supply that here.
  --help                         This usage statement.

Singularity Container

Programs required for this workflow are included in two singularity containers docker:https://swift and docker:https://ezlabgva/busco. These containers should be automatically pulled by nextflow. (Will only need to run singularity pull if website connection is unstable. On HPC machines, the singularity pull may time out, in which case we recommend manually pulling the singularity images and placing it in the work/singularity folder.

Manually pulling the singuarlity images

Run the following to manually pull the two singularity containers. If on the HPC, you may need to be on the dtn node.

singularity pull  --name swift.img docker:https://swift > /dev/null
singularity pull  --name ezlabgva-busco-v5.1.2_cv1.img docker:https://ezlabgva/busco:v5.1.2_cv1 > /dev/null

ls *.img
#> ezlabgva-busco-v5.1.2_cv1.img  swift.img

When you attempted to run this pipeline, Nextflow should have created a work/singularity folder in the current directory. If not, create it and add the singularity images.

mkdir -p work/singularity
mv *.img work/singularity/.

Example Run

Download a dataset of Bacteria

#E. coli
wget ftp:https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Yersinia_pestis/representative/GCF_000222975.1_ASM22297v1/GCF_000222975.1_ASM22297v1_genomic.fna.gz
mv GCF_000222975.1_ASM22297v1_genomic.fna.gz Ecoli.fasta.gz
gunzip Ecoli.fasta.gz

#Yersinia Pestis
wget ftp:https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
mv GCF_000005845.2_ASM584v2_genomic.fna.gz Yerpes.fasta.gz
gunzip Yerpes.fasta.gz

Run Nextflow. I am using singularity on my remote HPC called Nova. This repo contains config files for different machines for how to submit using the SLURM scheduler.

nextflow run isugifNF/assemblyStats \
  --genomes "*.fasta"  \
  --options "-l bacteria_odb10" \
  -profile singularity,nova

Expected Output

The output directory default is out_dir and can be changed by specifying the --outdir parameter on the command line. It contains

Folder	Description
BUSCO	soft links to BUSCO output files, logs and blast results
BUSCOResults	short_summary.specific final output
assemblathonStats	text file of assemblathon statistics for the genome
assemblyStats	text file of assemblyStats statistics for the genome
Busco_version.txt	Version of BUSCO used in the container

Dependencies if running locally

Nextflow is written in groovy which requires java version 1.8 or greater (check version using java -version). But otherwise can be installed if you have a working linux command-line.

java -version
curl -s https://get.nextflow.io | bash

# Check to see if nextflow is created
ls -ltr nextflow
#> total 32
#> -rwx--x--x  1 username  staff    15K Aug 12 12:47 nextflow

First let's look at the help message

nextflow run isugifNF/assemblyStats --help

Output

              N E X T F L O W  ~  version 20.07.1
              Launching `isugifNF/assemblyStats/main.nf` [magical_colden] - revision: a156628d62
              ----------------------------------------------------
                                                  \\---------//       
                    ___  ___        _   ___  ___    \\-----//        
                     |  (___  |  | / _   |   |_       \-//         
                    _|_  ___) |__| \_/  _|_  |        // \        
                                                    //-----\\       
                                                  //---------\\       
                    isugifNF/nanoQCtrim  v1.0.0       
                  ----------------------------------------------------
              Usage:
                    The typical command for running the pipeline are as follows:

                    nextflow run isugifNF/assemblyStats --genomes "*fasta" --outdir newStats3 --threads 16 --options "-l eukaryota_odb10" -profile condo,singularity
                    nextflow run isugifNF/assemblyStats --genomes "*fasta" --outdir newStats3 --threads 16 --options "-l mollusca_odb10" -profile condo,singularity --buscoOnly

                    Mandatory arguments:

                    --genomes                      genome assembly fasta files to run stats on. (./data/*.fasta)
                    -profile singularity (docker)          as of now, this workflow only works using singularity or docker and requires this profile [be sure singularity is in your path or loaded by a module]

                    Optional arguments:
                    --outdir                       Output directory to place final output
                    --threads                      Number of CPUs to use during the NanoPlot job [16]
                    --queueSize                    Maximum number of jobs to be queued [18]
                    --options                      ["--auto-lineage"], you may also consider  "--auto-lineage-prok","--auto-lineage-euk",""-l eukaryota_odb10"
                    --listDatasets                 Display the list of available BUSCO lineage datasets to use in --options pipeline parameter.
                    buscoOnly                      When you just want to run a different lineage and not rerun the assemblathon stats
                    --help                         This usage statement.

We can get a list of the BUSCO datasets we can run using this set of parameters. The -profile docker is important as this workflow relies on containers and will error out if you don't use docker or singularity

nextflow run isugifNF/assemblyStats --listDatasets -profile docker

Output

N E X T F L O W  ~  version 20.07.1
Launching `isugifNF/assemblyStats/main.nf` [amazing_colden] - revision: a156628d62
executor >  local (1)
[6c/31848c] process > runBUSCOlist [  0%] 0 of 1
INFO:	Downloading information on latest versions of BUSCO data...
INFO:	Downloading file 'https://busco-data.ezlab.org/v4/data/information/lineages_list.2019-11-27.txt.tar.gz'
INFO:	Decompressing file '/Users/severin/work/6c/31848cd8f040c93f4047d085609d69/busco_downloads/information/lineages_list.2019-11-27.txt.tar.gz'

################################################

Datasets available to be used with BUSCOv4 as of 2019/11/27:

 bacteria_odb10
     - acidobacteria_odb10
     - actinobacteria_phylum_odb10
         - actinobacteria_class_odb10
             - corynebacteriales_odb10
             - micrococcales_odb10
             - propionibacteriales_odb10
             - streptomycetales_odb10
             - streptosporangiales_odb10
         - coriobacteriia_odb10
             - coriobacteriales_odb10
     - aquificae_odb10
     - bacteroidetes-chlorobi_group_odb10
         - bacteroidetes_odb10
             - bacteroidia_odb10
                 - bacteroidales_odb10
             - cytophagia_odb10
                 - cytophagales_odb10
             - flavobacteriia_odb10
                 - flavobacteriales_odb10
             - sphingobacteriia_odb10
         - chlorobi_odb10
     - chlamydiae_odb10
     - chloroflexi_odb10
     - cyanobacteria_odb10
         - chroococcales_odb10
         - nostocales_odb10
         - oscillatoriales_odb10
         - synechococcales_odb10
     - firmicutes_odb10
         - bacilli_odb10
             - bacillales_odb10
             - lactobacillales_odb10
         - clostridia_odb10
             - clostridiales_odb10
             - thermoanaerobacterales_odb10
         - selenomonadales_odb10
         - tissierellia_odb10
             - tissierellales_odb10
     - fusobacteria_odb10
         - fusobacteriales_odb10
     - planctomycetes_odb10
     - proteobacteria_odb10
         - alphaproteobacteria_odb10
             - rhizobiales_odb10
                 - rhizobium-agrobacterium_group_odb10
             - rhodobacterales_odb10
             - rhodospirillales_odb10
             - rickettsiales_odb10
             - sphingomonadales_odb10
         - betaproteobacteria_odb10
             - burkholderiales_odb10
             - neisseriales_odb10
             - nitrosomonadales_odb10
         - delta-epsilon-subdivisions_odb10
             - deltaproteobacteria_odb10
                 - desulfobacterales_odb10
                 - desulfovibrionales_odb10
                 - desulfuromonadales_odb10
             - epsilonproteobacteria_odb10
                 - campylobacterales_odb10
         - gammaproteobacteria_odb10
             - alteromonadales_odb10
             - cellvibrionales_odb10
             - chromatiales_odb10
             - enterobacterales_odb10
             - legionellales_odb10
             - oceanospirillales_odb10
             - pasteurellales_odb10
             - pseudomonadales_odb10
             - thiotrichales_odb10
             - vibrionales_odb10
             - xanthomonadales_odb10
     - spirochaetes_odb10
         - spirochaetia_odb10
             - spirochaetales_odb10
     - synergistetes_odb10
     - tenericutes_odb10
         - mollicutes_odb10
             - entomoplasmatales_odb10
             - mycoplasmatales_odb10
     - thermotogae_odb10
     - verrucomicrobia_odb10
 archaea_odb10
     - thaumarchaeota_odb10
     - thermoprotei_odb10
         - thermoproteales_odb10
         - sulfolobales_odb10
         - desulfurococcales_odb10
     - euryarchaeota_odb10
         - thermoplasmata_odb10
         - methanococcales_odb10
         - methanobacteria_odb10
         - methanomicrobia_odb10
             - methanomicrobiales_odb10
         - halobacteria_odb10
             - halobacteriales_odb10
             - natrialbales_odb10
             - haloferacales_odb10
 eukaryota_odb10
     - alveolata_odb10
         - apicomplexa_odb10
             - aconoidasida_odb10
                 - plasmodium_odb10
             - coccidia_odb10
     - euglenozoa_odb10
     - fungi_odb10
         - ascomycota_odb10
             - dothideomycetes_odb10
                 - capnodiales_odb10
                 - pleosporales_odb10
             - eurotiomycetes_odb10
                 - chaetothyriales_odb10
                 - eurotiales_odb10
                 - onygenales_odb10
             - leotiomycetes_odb10
                 - helotiales_odb10
             - saccharomycetes_odb10
             - sordariomycetes_odb10
                 - glomerellales_odb10
                 - hypocreales_odb10
         - basidiomycota_odb10
             - agaricomycetes_odb10
                 - agaricales_odb10
                 - boletales_odb10
                 - polyporales_odb10
             - tremellomycetes_odb10
         - microsporidia_odb10
         - mucoromycota_odb10
             - mucorales_odb10
     - metazoa_odb10
         - arthropoda_odb10
             - arachnida_odb10
             - insecta_odb10
                 - endopterygota_odb10
                     - diptera_odb10
                     - hymenoptera_odb10
                     - lepidoptera_odb10
                 - hemiptera_odb10
         - mollusca_odb10
         - nematoda_odb10
         - vertebrata_odb10
             - actinopterygii_odb10
                 - cyprinodontiformes_odb10
             - tetrapoda_odb10
                 - mammalia_odb10
                     - eutheria_odb10
                         - euarchontoglires_odb10
                             - glires_odb10
                             - primates_odb10
                         - laurasiatheria_odb10
                             - carnivora_odb10
                             - cetartiodactyla_odb10
                 - sauropsida_odb10
                     - aves_odb10
                         - passeriformes_odb10
     - stramenopiles_odb10
     - viridiplantae_odb10
         - chlorophyta_odb10
         - embryophyta_odb10
             - liliopsida_odb10
                 - poales_odb10
             - eudicots_odb10
                 - brassicales_odb10
                 - fabales_odb10
                 - solanales_odb10
executor >  local (1)
[6c/31848c] process > runBUSCOlist [100%] 1 of 1 ✔

Let's use a couple of small Bacterial genome for the test. So we will use bacteria_odb10

#E. coli
wget ftp:https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Yersinia_pestis/representative/GCF_000222975.1_ASM22297v1/GCF_000222975.1_ASM22297v1_genomic.fna.gz
mv GCF_000222975.1_ASM22297v1_genomic.fna.gz Ecoli.fasta.gz
gunzip Ecoli.fasta.gz

#Yersinia Pestis
wget ftp:https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
mv GCF_000005845.2_ASM584v2_genomic.fna.gz Yerpes.fasta.gz
gunzip Yerpes.fasta.gz
nextflow run isugifNF/assemblyStats --genomes "*.fasta"  --options "-l bacteria_odb10" -profile docker --threads 4

Expected output

N E X T F L O W  ~  version 20.07.1
Launching `isugifNF/assemblyStats` [sad_boyd] - revision: a4272e736e [master]
executor >  local (7)
[-        ] process > runBUSCOlist             -
[a1/5af149] process > runAssemblyStats (1)     [100%] 2 of 2 ✔
[17/bb92c9] process > runAssemblathonStats (2) [100%] 2 of 2 ✔
[28/1103ba] process > setupBUSCO               [100%] 1 of 1 ✔
[ba/962b47] process > runBUSCO (1)             [100%] 2 of 2 ✔
Completed at: 12-Nov-2020 12:55:13
Duration    : 1m 7s
CPU hours   : (a few seconds)
Succeeded   : 7

Credits

This workflow was built by Andrew Severin (@isugif) and Jennifer Chang (@j23414)

Potential Errors

ssh:https://[email protected]/isugifNF/assemblyStats.git: Auth fail

This occurs if you have not set up github authorization on your remote machine yet. See this Introduction to Github Tutorial on how to set up an ssh key.
WARN: Singularity cache directory has not been defined

If you are planning on running this program more than once or more than one workflow it is best to set the NXF_SINGULARITY_CACHEDIR to a common location
```
export NXF_SINGULARITY_CACHEDIR=/location/of/singularity/container/folder
```
Place that in your .bashrc file.
Singularity pull is too slow

If it takes a really long time for your singularity images to be downloaded using nextflow, you can do it manually using singularity pull the first time.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
bin		bin
configs		configs
docs		docs
.gitignore		.gitignore
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
submit_nf.slurm		submit_nf.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

assemblyStats

Introduction

Installation

Singularity Container

Example Run

Expected Output

Dependencies if running locally

Credits

Potential Errors

ssh:https://[email protected]/isugifNF/assemblyStats.git: Auth fail

WARN: Singularity cache directory has not been defined

Singularity pull is too slow

About

Releases

Packages

Contributors 2

Languages

isugifNF/assemblyStats

Folders and files

Latest commit

History

Repository files navigation

assemblyStats

Introduction

Installation

Singularity Container

Example Run

Expected Output

Dependencies if running locally

Credits

Potential Errors

ssh:https://[email protected]/isugifNF/assemblyStats.git: Auth fail

WARN: Singularity cache directory has not been defined

Singularity pull is too slow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages