Skip to content
Sergey Aganezov jr edited this page Oct 30, 2016 · 6 revisions

CAMSA works with assemblies, that are represented as sets of individual assembly points (links between (oriented) scaffolds). Such format allows for a very broad range of scaffold assemblies obtained from multiple techniques to be processed using the single framework. In the text below we will use both terms scaffold and fragment in the same meaning.

CAMSA looks at order and orientation of scaffolds along chromosomes, thus it is expected that all of the input scaffold assemblies are built on the same set of input scaffolds (some scaffolds might be missing from some of the assemblies). It is expected that each genomic region is represented uniquely with a scaffold or a gap, and all of the scaffolders were working in terms of ordering and orienting scaffolds.

Format specs

As was mentioned previously, CAMSA expects each assembly to be represented as a set of individual assembly points between pairs of scaffolds, on which all of the input scaffold assemblies are comprised of.

The standard CAMSA input file is a tab separated text-based file with a header and then a list of assembly points (one per line). An example is shown below:

origin    seq1    seq1_or    seq2    seq2_or
A1        s1      +          s2      -      
A1        s2      -          s3      +      
...

Fields description:

mandatory

  • origin: id of the assembly, that produced a corresponding assembly point.
  • seq1: id of the first scaffold, that participates in the assembly point.
  • seq1_or: relative orientation (+/-/?), of the first scaffold in the assembly point.
  • seq2: id of the second scaffold, that participates in the assembly point.
  • seq2_or: relative orientation (+/-/?), of the second scaffold in the assembly point.

optional

  • gap_size: integer value (>=0/?), determining a gap size between two assembled scaffolds.
  • cw: confidence weight, of the reported assembly point ([0, 1]/?). By default for oriented assembly points cw=1, while for semi-oriented and non-oriented assembly points realizations cw=0.75. These values can be overwritten in CAMSA, please refer to usage wiki page on how to do so and more.

The order of fields is determined by the header, so, theoretically, there are no restrictions on how you can organize the input file, but we recommend to stick with the shown order for the main fields. There are no restrictions on the additional optional fields, that can be thrown into the input files, as they will simply be ignored.

Preparing input

CAMSA input format is described on the input wiki page. This format is not common for conventional scaffolders, and thus some data preparation can be in order for CAMSA to be able to process it. We include multiple built-in conversion scripts, that can automate the translation of the more common scaffold assemblies formats (FASTA, AGPv2.0, GRIMM etc) into the CAMSA one.

The overall usage description for these conversion utils is as follows:

xxx2camsa_points.py args

where xxx stands for the format, that input scaffold assembly is in. For each specific conversion util script on all supported format script please refer to the utils wiki page.