NASQQ: Nextflow Automatization and Standardization for Qualitative and Quantitative 1H NMR Metabolomics
Table of Contents
NASQQ is a comprehensive pipeline designed to automate the preparation and analysis of 1H NMR metabolomics data. It streamlines the process from raw Bruker FIDs through spectral preprocessing and metabolite identification to data analysis and pathway enrichment. This approach accelerates the comprehension of metabolomics in analyzed subjects, eliminating the need for specialized domain knowledge.
- Automated Workflow: NASQQ automates the entire metabolomic analysis process, reducing manual intervention and ensuring reproducibility.
- Comprehensive Analysis: The pipeline covers spectral preprocessing, metabolite identification, data analysis, and pathway enrichment, providing a holistic view of the metabolomic data.
- Machine Learning Integration: NASQQ incorporates machine learning methods to bridge the gap between raw spectral information and biological insights.
- Load FIDs: Retrieve raw FIDs from a specified location, extract sample names, and filter pulse program.
- Group Delay Correction: Eliminate Bruker Group Delay from the FIDs.
- Solvent Suppression: Estimate and eliminate residual solvent signals from the FIDs.
- Apodization: Enhance the Signal-to-Noise ratio in the spectra.
- Zero Filling: Enhance the visual clarity of spectra by inserting zeros.
- Fourier Transformation: Convert FIDs from the time domain to frequency domain spectra using Fourier Transformation.
- Zero Order Phase Correction: Adjust spectra phase to ensure pure absorptive mode in the real part.
- Internal Referencing: Align spectra with an internal reference compound.
- Baseline Correction: Estimate and remove spectral baseline from the spectral profiles.
- Negative Values Zeroing: Set all negative values in spectra to zero.
- (Optional) Warping: Apply Semi-Parametric Time Warping technique to warp and realign spectra.
- Window Selection: Choose the informative segment of spectra.
- (Optional) Bucketing: Simplify density of spectra peaks.
- Normalization: Normalize the spectra.
- Metabolites Quantification: Identify and quantify metabolites based on normalized spectra.
- Add Metadata: Merge metadata with quantified metabolites' relative abundances.
- (Optional) Combine Dataset Batches: Merge batches from the dataset for streamlined analysis.
- Features Processing: Load data and perform sanity checks.
- Exploratory Data Analysis: Conduct Principal Component Analysis and generate exploratory analysis visualizations.
- Univariate Analysis: Identify outliers, assess data normality, and conduct univariate statistical tests.
- Multivariate Analysis: Utilize machine learning models to analyze metabolite data.
- Pathway Analysis: Perform pathway enrichment analysis using KEGG database entries.
For detailed information on each stage of the analysis and scripts, refer to docs folder, where separate README.md files are provided.
Note: NASQQ is an extension of existing solutions, aimed at enhancing the accessibility and efficiency of metabolomic data analysis. The Workflow is designed to be system agnostic, however it was tested only on MacOS (M1 chip) and Linux (Ubuntu 22.04). In order to use pipeline on Windows system please refer to WSL
To begin using the pipeline, it's essential to ensure that certain prerequisites are met and project is properly set up. Please review the following sections:
Clone the project's Github repository to your local machine:
git clone https://github.com/ardigen/nasqq
Note: Grant appropriate permissions to the workflow directory:
chmod 777 -R <location>/nasqq
Next, build Docker images as the workflow requires Docker containers for both R and Python environments. There are R and Python Dockerfiles needed to execute the workflow, which are compatible with Linux and MacOS (M1 chips) systems.
For Linux user execute:
cd nasqq/docker/Python
./build_docker_linux.sh
cd nasqq/docker/R
./build_docker_linux.sh
For MacOS (M1) user execute:
cd nasqq/docker/Python
./build_docker_macos.sh
cd nasqq/docker/R
./build_docker_macos.sh
After setting up project create coma separated manifest.csv file, with following structure and headers:
dataset,batch,input_path,metadata_file,selected_sample_names,target_value,referencing_range,window_selection_range
test1,test1,./testthat/data/dataset/dataset1,./testthat/data/metadata/metadata1.csv,500;501;503;504,0,None,0;10
test2,test2,./testthat/data/dataset/dataset2,./testthat/data/metadata/metadata2.csv,all,0,None,0;5
test3,None,./testthat/data/dataset/dataset3,./testthat/data/metadata/metadata3.csv,502;505;507;508;509;510,2,2.5;4.55,0;10
dataset
- name of dataset.batch
- batch name (Default:None
).input_path
- absolute path to NMR dataset in Bruker format.metadata_file
- absolute path to metadata file to be merged with dataset.selected_sample_names
- selection of sample names, ";" separated (Default:all
).target_value
- PPM value of the signal used as the internal reference spectra (Default: 0).referencing_range
- iftarget_value
is different from the default, the range where the referencing signal will be searched (Defaul:None
).window_selection_range
- range of the informative part of the spectra, separated by ";" (Default:0;10
).
Another file that needs to be created is params.yml. This document outlines the required inputs for configuring the data processing pipeline. Make sure to fill in the necessary values according to table below.
Input | Description | Datatype |
---|---|---|
manifest | Absolute path to the manifest.csv file containing metadata information for the analysis | string |
outDir | Absolute path to the directory where the output files will be stored | string |
reportsDir | Absolute path to the directory where the analysis reports will be generated | string |
workDir | Absolute path to the directory where the intermediate work files will be stored | string |
launchDir | Absolute path to the directory from which the pipeline is launched | string |
maxRetries | Number of attempts the pipeline should make to process a task before giving up | integer |
errorStrategy | The strategy to handle errors during pipeline execution (terminate/ignore/retry) | string |
check_pulse_samples | The pulse program specified in the manifest file for processing | string |
run_bucketing | Enable/disable bucketing for simplifying the density of peaks before metabolite quantification | boolean |
run_warping | Enable/disable warping for spectra re-alignment based on a reference spectrum | boolean |
run_combine_project_batches | Enable/disable merging datasets for data analysis where batch is not "None" | boolean |
ncores | The number of threads allocated for the ASICS quantification task | integer |
log1p | Enable/disable log1p normalization of metabolites before data analysis | boolean |
metadata_column | The column containing binary state information for the data analysis module | string |
reverse_axis_samples | Specifies whether to reverse the axis for all samples or selected samples based on a threshold | string |
After completing every step open run.sh and adjust paths for execution of workflow or run manually using command:
nextflow run ../main.nf \
-c ../nextflow.config \
-profile standard \
-params-file params.yml
In order to run the test data simply go the tests directory and run the test run:
./tests/run.sh
Please remember that based on the number of datasets provided in the manifest your local machine has to have that many resources. [visit this thread: nextflow-io/nextflow#1787] The lack of resources can lead to incorrect memory allocation in the script. It is recommended to change max_cpus and max_memory params in nextflow.config file accordingly to resources avaibale on your local machine.
example:
*** caught segfault ***
address 0x7ff0000000000003, cause 'memory not mapped'
Be aware that NextFlow is not a resource orchestration system. If you need it, there is a need of creation of custom executor like aws or kubernetess.
Note: The default setting for the computation cannot be lower than:
- cpus = 2
- memory = 2.GB RAM
NASQQ is distributed under the MIT License. See LICENSE.md
for more information.
For contact purposes, there is a dedicated email address: [email protected]
The scripts and workflow was originally created as a part of Łukasz Pruss's PhD project, in collaboration between Ardigen S.A. and Wrocław University of Science and Technology (WUST). A special acknowledgment goes to Oskar Gniewek, whose expertise and critical feedback significantly contributed to the implementation of NextFlow. He also played a crucial role in managing unit and integration tests, as well as handling dependencies across various systems for pipeline execution.
Furthermore, many people were involved in the evolution of the pipeline, turning it from a concept into an end-to-end solution. These contributors include:
Special thanks for the assistance in development process, code reviews and tips are extend to:
An extensive list of references and packages used by the pipeline can be found in our publication:
NASQQ: Nextflow automatization and standarization for qualitative and quantitative 1H NMR metabolomics data preparation and analysis.
Łukasz Pruss, Oskar Gniewek, Tomasz Jetka, Wojciech Wojtowicz, Kaja Milanowska-Zabel, Piotr Młynarz.
DOI: --
If you want to utilize NASQQ for your analysis, please refer to LICENSE.md
To cite the nf-core
publication use:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.