Manuscript: DOI: 10.1093/bioinformatics/btaa960
bioRxiv preprint: DOI: 10.1101/441766
SCIDDO is a tool for the differential analysis of histone chromatin data. SCIDDO uses chromatin state segmentation maps, e.g., as generated by ChromHMM or EpiCSeg, for identifying regions of differential chromatin state between individual samples or groups of replicated samples.
The detected differential chromatin domains can be expected to overlap largely with regulatory regions or differentially expressed genes (see our manuscript preprint for detailed results). Moreover, the score-based approach implemented in SCIDDO affords a straightforward customization of scoring chromatin state differences to emphasize different aspects of chromatin dynamics.
SCIDDO is currently in BETA status
SCIDDO supports only Linux environments (that is unlikely to change in the future) and is developed using Python3.6. Other Python3.x versions may or may not work, but are not officially supported.
For easy setup, it is highly recommended to install SCIDDO inside a dedicated Conda environment.
A suitable environment is specified in environments/sciddo_env.yml
.
Otherwise, install the HDF5 library (tested with version 1.8.18) as appropriate for your local environment,
and the necessary Python dependencies from the requirements.txt
file:
sudo apt-get install libhdf5
sudo pip install -r requirements.txt
Empirically, the setup of PyTables and HDF5 can create some headaches. In this case, the best advice is to use Conda.
After all dependencies have been installed successfully, run the SCIDDO setup as appropriate for your environment:
[sudo] python setup.py install
SCIDDO supports common text-based input and output data formats. Chromatin state segmentations as tabular (BED-like) files should be compatible as long as they have a fixed bin width of at least 100 bp. Output files from ChromHMM or EpiCSeg are supported out-of-the-box, and SCIDDO is designed to be used immediately downstream of these tools (e.g., SCIDDO knows that ChromHMM segmentation files have the suffix "_segments.bed" and will strip that from file names before determining possible sample labels). Auxiliary files such as chromatin state label or color mappings are supoprted in form of simple tab-separated "key-value" text files.
SCIDDO's internal data managements is realized with the popular pandas Python package, and data are stored in HDF5 files (*.h5) that are created with pandas. The main reason for using HDF5 files for storing data and metadata is efficiency, but all contents of a HDF5 file can be dumped to text. After the first step in a SCIDDO analysis of converting the input data to HDF5, all subsequent operations will be performed on this HDF5 file.
When dumping identified differential chromatin domains (DCDs) or raw candidate regions to text, the output adheres to the
BED column layout (with header) chromosome, start, end, name, score
, plus additional columns containing statistics and sample/group names.
If downstream tools cannot work with non-standard BED-like text files, a simple
cut -f 1,2,3,4,5 <SCIDDO_TABLE>.tsv > <SCIDDO_TABLE>.bed
can be used to restrict the output to the first five,
BED-compliant columns.
sciddo.py --help
or sciddo.py <SUBCOMMAND> --help
is your friend.
For a step-by-step help on how to use SCIDDO, please refer to the tutorial hosted as part of this repositry.
A standard SCIDDO analysis run is split into several distinct steps that are realized by different code modules. Besides module specific parameters, there are several global parameters to adjust SCIDDO's runtime behavior. Importantly, these global parameters always have to be specified before the subcommand, i.e.,
sciddo.py [GLOBAL_PARAMETERS] <SUBCOMMAND> [MODULE_PARAMETERS]
The global parameters are:
--workers: number of CPUs to use (no sanity checks!)
--debug: print debug messages to stderr; otherwise, SCIDDO operates silently
--config-dump: folder to dump run configuration (JSON); defaults to current working directory
--no-dump: do not dump run configuration
Convert all input data (state segmentations plus metadata) into a binary HDF5 file. Currently, ChromHMM and EpiCSeg output files are supported out-of-the-box. This creates the SCIDDO DATA file.
sciddo.py [GLOBAL_PARAMETERS] convert --help
Compute a bunch of statistics (e.g., state composition per sample) that are potentially needed downstream.
sciddo.py [GLOBAL_PARAMETERS] stats --help
Add scoring schemes (matrices) to the dataset. These can be derived automatically from the state segmentation model emissions (if provided during the convert step), or can be supplied in form of a user-defined file. Note that, in principle, an arbitrary number of scoring schemes can be added to a dataset.
sciddo.py [GLOBAL_PARAMETERS] score --help
Scan the dataset for differential chromatin domains. As opposed to the previous commands, this creates a separate output file per run, i.e., the SCIDDO RUN file.
sciddo.py [GLOBAL_PARAMETERS] scan --help
All data and metadata in the SCIDDO DATA and RUN file can be dumped to text files (e.g., TSV tables or BED files) for downstream analysis.
sciddo.py [GLOBAL_PARAMETERS] dump --help