Skip to content

Commit

Permalink
Merge pull request #4 from bacpop/doc_update
Browse files Browse the repository at this point in the history
Updates docs
  • Loading branch information
samhorsfield96 authored Apr 17, 2024
2 parents 61849b5 + d8433dc commit 9afcdfc
Show file tree
Hide file tree
Showing 27 changed files with 762,905 additions and 529,663 deletions.
39 changes: 35 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# CELEBRIMBOR <img src='celebrimbor_logo.png' align="right" height="250" />

Core ELEment Bias Removal In Metagenome Binned ORthologs

A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).

## Dependencies:

* Snakemake
* mmseqs2
* MMseqs2
* Bakta
* Biopython
* CheckM
Expand All @@ -20,8 +22,8 @@ A pipeline written in Snakemake to automatically generate pangenomes from metage
Install the required packages using [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)/[mamba](https://github.com/mamba-org/mamba):

```
git clone https://github.com/bacpop/MAG_pangenome_pipeline.git
cd MAG_pangenome_pipeline
git clone git@github.com:bacpop/CELEBRIMBOR.git
cd CELEBRIMBOR
mamba env create -f environment.yml
mamba activate celebrimbor
```
Expand Down Expand Up @@ -78,23 +80,52 @@ Update `config.yaml` to specify workflow and directory paths.
- `cgt_exe`: path to cgt executable.
- `cgt_breaks`: frequency for rare/core gene cutoff, e.g. `0.1,0.9`, meaning genes predicted at `<0.1` frequency will be `rare`, `0.1<=x<0.9` will be `middle` and `>=0.9` will be `core`.
- `cgt_error`: sets false assignment rate of gene to particular frequency compartment.
- `clustering_method`: choice of either `mmseqs2` (for speed) or `panaroo` (for accuracy).
- `panaroo_stringency`: Stringency of Panaroo quality control measures. One of `strict`, `moderate` or `sensitive`.

Run snakemake (must be in same directory as `Snakemake` file):

```
snakemake --cores <cores>
```

To test running of the workflow, download this repository, replace `path/to` with actual paths, and run:

```
snakemake --cores 1 --config genome_fasta=test/fasta output_dir=test_output bakta_db=path/to/bakta_db/db-light cgt_exe=path/to/cgt_bacpop cgt_breaks=0.05,0.95 cgt_error=0.05 clustering_method=panaroo panaroo_stringency=moderate
```

This test directory contains simulated MAGs from [Kallonen et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5538559/).

The output directory `test_output` will contain:

- `annotated` directory, containing gene annotations from [bakta](https://github.com/oschwengers/bakta).
- `mmseqs2` or `panaroo` directory, containing gene clusters from [mmseqs2](https://github.com/soedinglab/MMseqs2) or [Panaroo](https://github.com/gtonkinhill/panaroo) respectively.
- `presence_absence_matrix.txt`, a tab-separated file describing the presence/absence of genes (rows) in each genome (columns).
- `pangenome_summary.tsv`, a tab-separated file detailing gene annotations, frequencies and pre-adjustment frequency compartments in the pangenome.
- `checkm_out.tsv`, a summary file generated by [CheckM](https://github.com/Ecogenomics/CheckM) describing genome completeness and contamination.
- `cgt_output.txt`, a summary file detailing the observed frequency and adjusted frequency compartment of each gene in the pangenome.

## Overview of workflow

This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.

1. Predict genes in all FASTA files in given directory using [bakta](https://github.com/oschwengers/bakta)
1. Cluster genes using [mmseqs2](https://github.com/soedinglab/MMseqs2) and generate a gene presence/absence matrix
1. Cluster genes using [mmseqs2](https://github.com/soedinglab/MMseqs2) or [Panaroo](https://github.com/gtonkinhill/panaroo) and generate a gene presence/absence matrix
1. Generate a pangenome summary of observed gene frequencies
1. Calculate genome completeness using [CheckM](https://github.com/Ecogenomics/CheckM)
1. Probabistically assign each gene family as `core|middle|rare` using [cgt](https://github.com/bacpop/cgt)

## Citations

When using CELEBRIMBOR, please cite:

- [Bakta](https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000685#tab2)
- [mmseqs2](https://www.nature.com/articles/nbt.3988)
- [Panaroo](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02090-4)
- [CheckM](https://genome.cshlp.org/content/25/7/1043)





8,181 changes: 8,181 additions & 0 deletions test/example_output/cgt_output.txt

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions test/example_output/checkm_out.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Bin_Id Marker_lineage genomes markers marker_sets 0 1 2 3 4 5+ Completeness Contamination Strain_heterogeneity
11657_5_17.contigs_spades_comp_0.5948 f__Enterobacteriaceae_(UID5124) 134 1173 336 474 698 1 0 0 0 58.27 0.04 0.00
11657_5_79.contigs_spades_comp_0.9079 f__Enterobacteriaceae_(UID5124) 134 1173 336 73 1099 1 0 0 0 91.93 0.04 0.00
11658_4_24.contigs_spades_comp_0.877 f__Enterobacteriaceae_(UID5124) 134 1173 336 95 1073 5 0 0 0 87.12 0.39 0.00
11679_4_5.contigs_spades_comp_0.9853 f__Enterobacteriaceae_(UID5124) 134 1173 336 20 1151 2 0 0 0 98.05 0.14 50.00
11679_7_49.contigs_spades_comp_0.731 f__Enterobacteriaceae_(UID5124) 134 1173 336 295 873 5 0 0 0 69.61 0.12 0.00
11679_8_27.contigs_spades_comp_0.8756 f__Enterobacteriaceae_(UID5124) 134 1173 336 89 1082 2 0 0 0 88.77 0.08 0.00
11679_8_64.contigs_spades_comp_0.9555 f__Enterobacteriaceae_(UID5124) 134 1173 336 46 1122 5 0 0 0 95.95 0.39 0.00
11679_8_82.contigs_spades_comp_0.7923 f__Enterobacteriaceae_(UID5124) 134 1173 336 234 935 4 0 0 0 79.13 0.35 0.00
11791_3_21.contigs_spades_comp_0.9638 f__Enterobacteriaceae_(UID5124) 134 1173 336 37 1134 2 0 0 0 96.99 0.08 0.00
11791_7_13.contigs_spades_comp_0.9154 f__Enterobacteriaceae_(UID5124) 134 1173 336 90 1076 7 0 0 0 92.53 0.69 0.00
Loading

0 comments on commit 9afcdfc

Please sign in to comment.