Skip to content

ptrebert/sciddo

Repository files navigation

SCIDDO: Score-based identification of differential chromatin domains

Publication

Manuscript: DOI: 10.1093/bioinformatics/btaa960

bioRxiv preprint: DOI: 10.1101/441766

Use cases

SCIDDO is a tool for the differential analysis of histone chromatin data. SCIDDO uses chromatin state segmentation maps, e.g., as generated by ChromHMM or EpiCSeg, for identifying regions of differential chromatin state between individual samples or groups of replicated samples.

The detected differential chromatin domains can be expected to overlap largely with regulatory regions or differentially expressed genes (see our manuscript preprint for detailed results). Moreover, the score-based approach implemented in SCIDDO affords a straightforward customization of scoring chromatin state differences to emphasize different aspects of chromatin dynamics.

Code maturity

SCIDDO is currently in BETA status

master branch: Build Status

dev branch: Build Status

Setup

SCIDDO supports only Linux environments (that is unlikely to change in the future) and is developed using Python3.6. Other Python3.x versions may or may not work, but are not officially supported.

For easy setup, it is highly recommended to install SCIDDO inside a dedicated Conda environment. A suitable environment is specified in environments/sciddo_env.yml.

Otherwise, install the HDF5 library (tested with version 1.8.18) as appropriate for your local environment, and the necessary Python dependencies from the requirements.txt file:

sudo apt-get install libhdf5
sudo pip install -r requirements.txt

Empirically, the setup of PyTables and HDF5 can create some headaches. In this case, the best advice is to use Conda.

After all dependencies have been installed successfully, run the SCIDDO setup as appropriate for your environment:

[sudo] python setup.py install

Execution

Input and output data formats

SCIDDO supports common text-based input and output data formats. Chromatin state segmentations as tabular (BED-like) files should be compatible as long as they have a fixed bin width of at least 100 bp. Output files from ChromHMM or EpiCSeg are supported out-of-the-box, and SCIDDO is designed to be used immediately downstream of these tools (e.g., SCIDDO knows that ChromHMM segmentation files have the suffix "_segments.bed" and will strip that from file names before determining possible sample labels). Auxiliary files such as chromatin state label or color mappings are supoprted in form of simple tab-separated "key-value" text files.

SCIDDO's internal data managements is realized with the popular pandas Python package, and data are stored in HDF5 files (*.h5) that are created with pandas. The main reason for using HDF5 files for storing data and metadata is efficiency, but all contents of a HDF5 file can be dumped to text. After the first step in a SCIDDO analysis of converting the input data to HDF5, all subsequent operations will be performed on this HDF5 file.

When dumping identified differential chromatin domains (DCDs) or raw candidate regions to text, the output adheres to the BED column layout (with header) chromosome, start, end, name, score, plus additional columns containing statistics and sample/group names. If downstream tools cannot work with non-standard BED-like text files, a simple cut -f 1,2,3,4,5 <SCIDDO_TABLE>.tsv > <SCIDDO_TABLE>.bed can be used to restrict the output to the first five, BED-compliant columns.

Getting help

sciddo.py --help or sciddo.py <SUBCOMMAND> --help is your friend.

For a step-by-step help on how to use SCIDDO, please refer to the tutorial hosted as part of this repositry.

Standard analysis run

A standard SCIDDO analysis run is split into several distinct steps that are realized by different code modules. Besides module specific parameters, there are several global parameters to adjust SCIDDO's runtime behavior. Importantly, these global parameters always have to be specified before the subcommand, i.e.,

sciddo.py [GLOBAL_PARAMETERS] <SUBCOMMAND> [MODULE_PARAMETERS]

The global parameters are:

--workers: number of CPUs to use (no sanity checks!)
--debug: print debug messages to stderr; otherwise, SCIDDO operates silently
--config-dump: folder to dump run configuration (JSON); defaults to current working directory
--no-dump: do not dump run configuration

Step 1: convert

Convert all input data (state segmentations plus metadata) into a binary HDF5 file. Currently, ChromHMM and EpiCSeg output files are supported out-of-the-box. This creates the SCIDDO DATA file.

sciddo.py [GLOBAL_PARAMETERS] convert --help

Step 2: stats

Compute a bunch of statistics (e.g., state composition per sample) that are potentially needed downstream.

sciddo.py [GLOBAL_PARAMETERS] stats --help

Step 3: score

Add scoring schemes (matrices) to the dataset. These can be derived automatically from the state segmentation model emissions (if provided during the convert step), or can be supplied in form of a user-defined file. Note that, in principle, an arbitrary number of scoring schemes can be added to a dataset.

sciddo.py [GLOBAL_PARAMETERS] score --help

Step 4: scan

Scan the dataset for differential chromatin domains. As opposed to the previous commands, this creates a separate output file per run, i.e., the SCIDDO RUN file.

sciddo.py [GLOBAL_PARAMETERS] scan --help

Step 5: dump

All data and metadata in the SCIDDO DATA and RUN file can be dumped to text files (e.g., TSV tables or BED files) for downstream analysis.

sciddo.py [GLOBAL_PARAMETERS] dump --help