Warning
A paper is in preparation about this work. If you consider to use this tool, please contact the author for attribution.
Implementations of many functions for performing various actions on GFA-like graphs in a command-line tool, such as extracting or offseting a pangenome graph. Is capable of comparing graphs topology between graphs that happen to contain the same set of sequences. Does pangenome graphs visualisation with interactive html files. Uses the gfagraphs library to load and manipulate pangenome graphs. Details about implementation can be found here (in french only, sorry).
Note
Want to contribute? Feel free to open a PR on an issue about a missing, buggy or incomplete feature!
Requires python
Installation can be made with the following command line, and updates may be run using just
(requires just)
git clone https://github.com/Tharos-ux/pancat.git
cd pancat
pip install -r requirements.txt --upgrade
python -m pip install . --quiet
Warning
This tool is under heavy devlopment, and so it's associated library. I advise to update pip install gfagraphs --upgrade
every now and then, when you update the tool. Any issue to this project is more than welcome, as I could not test all usecases! Feel free to open one here if any problems occurs.
This program is a collection of tools. Not every function or script is accessible through the front-end pancat
, but this front-end showcase what the tools can do.
Other tools are in the scripts
folder.
Are available through pancat
:
- offset adds relative position information as a tag in GFA file
- correct (WIP, experimental) corrects the graph by adding missing information back into it.
- grapher creates interactive graph representation from a GFA file
- multigrapher creates interactive graph representation of the differnces between two pangenome graphs
- stats gathers basic stats from the input GFA
- complete assesses if the graph is a complete pangenome graph (all genomes fully embedded in the graph)
- reconstruct recreates the linear sequences from the graph
- edit computes a edit distance between variation graphs
- compress (WIP, experimental) compresses the graph by collapsing substitution bubbles, losselessly
- unfold (WIP, experimental) break cycles in the graph by adding nodes and edges in it
Were available before (and will be back soon):
- isolate extracts a subgraph from positions in the paths
- neigborhood extracts a subgraph from a set of nodes around a node
- cycles detect and (optionnally) linearizes all loops in graph
With this command, you can create a html interactive view of your graph, with sequence in the nodes (S-lines) and nodes connected by edges (L-lines). If additional information is given (as such as W-lines or P-lines), supplementary edges will be drawn in order to show the path that the genomes follows in the graph.
pancat grapher [-h] [-b BOUNDARIES [BOUNDARIES ...]] file output
positional arguments:
file Path to a gfa-like file
output Output path for the html graph file.
options:
-h, --help show this help message and exit
-b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
and one for nodes in range 2001-inf bp).
When using this command, please only work with graphs with under 10k nodes. To do so, you may flatten the graph or extract subgraphs (using for instance pancat neighborhood or pancat isolate).
The -b
/--boundaries
option lets you choose size classes to differentiate. They will have a different color, and their number will be computed separately.
The output
argument may be : a path to a folder (existing or not) or a path to a file (with .HTML extension or not).
With this command, you can output basic stats on your graph.
pancat stats [-h] [-b BOUNDARIES [BOUNDARIES ...]] file
positional arguments:
file Path to a gfa-like file
options:
-h, --help show this help message and exit
-b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
and one for nodes in range 2001-inf bp).
This program displays stats in command-line (stdout). You may pipe it to a file if you want to use it on a cluster. (pancat stats graph.gfa > out.txt)
The -b
/--boundaries
option lets you choose size classes to differentiate. Their number will be computed separately.
With this command, you can reconstruct linear sequences from the graph.
pancat reconstruct [-h] -r REFERENCE [--start START] [--stop STOP] [-s] file out
positional arguments:
file Path to a gfa-like file
out Output path (without extension)
options:
-h, --help show this help message and exit
-r REFERENCE, --reference REFERENCE
Tells the reference sequence we seek start and stop into
--start START To specifiy a starting node on reference to create a subgraph
--stop STOP To specifiy a ending node on reference to create a subgraph
-s, --split Tells to split in different files
For this function, the -r
/--reference
option is needed only if you specify starting and ending points.
With this command, you ca add a JSON GFA-compatible string to each S-line of the graph (each node). This field will contain starting position, ending position and orientation, for each path in the graph.
pancat offset [-h] file out
positional arguments:
file Path to a gfa-like file
out Output path (with extension)
options:
-h, --help show this help message and exit
In order to compare two graphs, they need to :
- have at least some shared paths
- the reconstruction of those shared paths must yield the same sequences
If those criteria are met, you may compare your graphs.
pancat edit [-h] -o OUTPUT_PATH [-p PATTERN] [-g] [-c CORES] [-s [SELECTION ...]] [-t] graph_A graph_B
positional arguments:
graph_A Path to a GFA-like file.
graph_B Path to a GFA-like file.
options:
-h, --help show this help message and exit
-o OUTPUT_PATH, --output_path OUTPUT_PATH
Path to a .json output for results.
-p PATTERN, --pattern PATTERN
Regexp to filter if present in path/walks names.
-g, --graph_level Asks to perform edition computation at graph level.
-c CORES, --cores CORES
Number of cores for computing edition
-s [SELECTION ...], --selection [SELECTION ...]
Names of the paths you want to compute edition on.
-t, --trace_memory Print to log file memory usage of data structures.
It also now supports regexp to easily match paths that are differing, as for instance in HPRC files where pancat edit $CACTUS $PGGB --output_path $WD"hprc_21_edition.json" --graph_level --cores 16 --pattern "^(.+?)#" --trace_memory
can be used to compare individual chromosoms.