Photo source: https://www.karmactive.com/mount-logan-the-crown-jewel-of-canadas-peaks/
Logan is a dataset of DNA and RNA sequences. It has been constructed by performing genome assembly over a December 2023 freeze of the entire NCBI Sequence Read Archive, which at the time contained 50 petabases of public raw data. Two related sets of assembled sequences are released: unitigs and contigs. Unitigs preserve nearly all the information present in the original sample, whereas contigs get rid of sequencing errors and biological variation for the benefit of increased sequence length. Both sets are hosted on a public S3 bucket provided by the Registry of Open Data at AWS, in compressed form. By downloading either unitigs or contigs, users can access the wealth of information contained in the SRA 10x (respectively 40x for contigs) more efficiently in time and disk space compared to raw reads, with minor loss of sensitivity and higher contiguity.
Read more about Logan here: https://www.biorxiv.org/content/10.1101/2024.07.30.605881v1
In December 2023, the public part of SRA consisted of raw sequencing data totalling 50 petabases across 27 million accessions. Each accession corresponds to a DNA sequencing experiment performed by a biology lab somewhere around the world. As you can see below, the majority of accessions and bases are DNA, followed by Covid samples and RNA, then metagenomes and single-cell.
Note, the y-axis of the second panel is in megabases, as this is the preferred unit for SRA accession sizes. 20 G megabases is 20 petabases.
All sequencing experiments from a December 2023 freeze of the SRA have been reconstructed and made available as unitigs and contigs as a v1 release of Logan. See the Stats v1 page for more details on this data.
See Unitigs and Contigs pages.
Downloading unitigs of several accessions
Search for sequences inside unitigs or contigs
Downloading, mapping many contigs to a gene of interest
Official Logan dataset URL: https://registry.opendata.aws/pasteur-logan/
BibTeX:
@article {logan,
author = {Chikhi, Rayan and Raffestin, Brice and Korobeynikov, Anton and Edgar, Robert and Babaian, Artem},
title = {Logan: Planetary-Scale Genome Assembly Surveys Life{\textquoteright}s Diversity},
elocation-id = {2024.07.30.605881},
year = {2024},
doi = {10.1101/2024.07.30.605881},
publisher = {Cold Spring Harbor Laboratory},
eprint = {https://www.biorxiv.org/content/early/2024/07/31/2024.07.30.605881.full.pdf},
journal = {bioRxiv}
}
- Lead: Rayan Chikhi
- co-Lead: Artem Babaian
- Robert C. Edgar
- Anton Korobeynikov
- Brice Raffestin
- AWS engineering:
- Greg Autric
- Maxime Hugues
- AWS management:
- Dorian Schaal
- Adrien Lainé
- Institut Pasteur IT:
- Thomas Menard
- Stéphane Fournier
- AWS Registry of Open Data
- Peter Schmiedeskamp
- Chris Stoner
- Erin Chu
- NCBI SRA
- Ryan Connor
- Yuriy Skripchenko
- Matthieu Falce
- Institut Pasteur admin:
- Melanie Ridel
- Loïc Orellou
- Florence Percie du Sert
- Institut Pasteur G5 Sequence Bioinformatics, ANR INCEPTION
- ERC Consolidator grant number 101088572 (IndexThePlanet)
- ANR-19-CE45-0008 (SeqDigger), ANR-22-CE45-0007 (Full-RNA)
- EU H2020 Marie Sklodowska-Curie grants agreements No 956229 (Alpaca) and 872539 (Pangaia)
- PaRis Artificial Intelligence Research Institute (PRAIRIE)
- Amazon Web Services
- Registry of Open Data on AWS