Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
ocxtal committed Jun 6, 2018
1 parent 0b23847 commit 6cf9115
Showing 1 changed file with 33 additions and 30 deletions.
63 changes: 33 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ SIMD-parallel BLAST X-drop alignment implementation. (work in progress...)
* BLAST X-drop DP algorithm ([semi-global **extension** aligmnent of affine-gap penalty](https://github.com/elucify/blast-docs/wiki/Gapped-Alignment-Phase)); expected to be sensitive enough for most usecases
* Vectorized with SSE4.1 (run on Core2 (2007) or later; 16-bit signed int is used for each cell)
* Linear-to-graph alignment (first developed for [vg map](https://github.com/vgteam/vg) subcommand; [POA](https://academic.oup.com/bioinformatics/article/18/3/452/236691)-like method)
* Header-only
* Header-only and C++-compatible

## Example

Expand Down Expand Up @@ -36,57 +36,58 @@ int8_t const score_matrix[16] = {
/* T */ X, X, X, M
};
struct dz_s *dz = dz_init(
score_matrix,
GI, GE,
xdrop_threshold
score_matrix,
GI, GE,
xdrop_threshold
);

/* pack query */
char const *query = "ACACTTCTAGACTTTACCACTA";
struct dz_query_s const *q = dz_pack_query_forward(
dz,
query, /* uint8_t const *seq: query sequence */
strlen(query) /* length */
query, /* char const *seq: query sequence */
strlen(query) /* length */
);

/* init node array */
struct dz_forefront_s const *ff[10] = { 0 };

/* fill root: node 0 */
ff[0] = dz_extend(
dz,
dz, q,
dz_root(dz), 1, /* struct dz_forefront_s const *ff[], uint64_t n_ffs; incoming forefront and degree; dz_root(dz) for root node */
"ACAC", strlen("ACAC"), 0 /* reference-side sequence, its length, and node id */
);

/* branching paths: 1, 2 -> 3 */
ff[1] = dz_extend(dz, &ff[0], 1, "TTGT", strlen("TTGT"), 1);
ff[2] = dz_extend(dz, &ff[0], 1, "ATCC", strlen("ATCC"), 2);
ff[3] = dz_extend(dz, &ff[1], 2, "AGAC", strlen("AGAC"), 3); /* "&ff[1], 2" indicates ff[1] and ff[2] are incoming nodes */
ff[1] = dz_extend(dz, q, &ff[0], 1, "TTGT", strlen("TTGT"), 1);
ff[2] = dz_extend(dz, q, &ff[0], 1, "ATCC", strlen("ATCC"), 2);
ff[3] = dz_extend(dz, q, &ff[1], 2, "AGAC", strlen("AGAC"), 3); /* "&ff[1], 2" indicates ff[1] and ff[2] are incoming nodes */

/* insertion: -, 4 -> 5 */
ff[4] = dz_extend(dz, &ff[3], 1, "T", strlen("T"), 4);
ff[5] = dz_extend(dz, &ff[3], 2, "TTCTA", strlen("TTCTA"), 5); /* "&ff[3], 2" indicates ff[3] and ff[4] are incoming nodes */
ff[4] = dz_extend(dz, q, &ff[3], 1, "T", strlen("T"), 4);
ff[5] = dz_extend(dz, q, &ff[3], 2, "TTCTA", strlen("TTCTA"), 5); /* "&ff[3], 2" indicates ff[3] and ff[4] are incoming nodes */

/* SNVs: 6, 7, 8 -> 9 */
ff[6] = dz_extend(dz, &ff[5], 1, "A", strlen("A"), 6);
ff[7] = dz_extend(dz, &ff[5], 1, "C", strlen("C"), 7);
ff[8] = dz_extend(dz, &ff[5], 1, "G", strlen("G"), 8);
ff[9] = dz_extend(dz, &ff[6], 3, "CACTA", strlen("CACTA"), 9);
ff[6] = dz_extend(dz, q, &ff[5], 1, "A", strlen("A"), 6);
ff[7] = dz_extend(dz, q, &ff[5], 1, "C", strlen("C"), 7);
ff[8] = dz_extend(dz, q, &ff[5], 1, "G", strlen("G"), 8);
ff[9] = dz_extend(dz, q, &ff[6], 3, "CACTA", strlen("CACTA"), 9);

/* traceback */
struct dz_alignment_s const *aln = dz_trace(
dz,
ff[9]
dz,
ff[9]
);
printf("ref_length(%u), query_length(%u), path(%s)", aln->ref_length, aln->query_length, aln->path);

printf("ref_length(%u), query_length(%u), path(%s)\n", aln->ref_length, aln->query_length, aln->path);
for(size_t i = 0; i < aln->span_length; i++) {
struct dz_path_span_s const *s = &aln->span[i];
printf("node_id(%u), subpath_length(%u), subpath(%.*s)\n",
s->id,
s[1].offset - s[0].offset,
s[1].offset - s[0].offset, &aln->path[s->offset]
);
struct dz_path_span_s const *s = &aln->span[i];
printf("node_id(%u), subpath_length(%u), subpath(%.*s)\n",
s->id,
s[1].offset - s[0].offset,
s[1].offset - s[0].offset, &aln->path[s->offset]
);
}

/* clean up */
Expand Down Expand Up @@ -121,14 +122,16 @@ Query sequence is first packed into an array of SIMD vectors with its encoding c
For the detailed description of the BLAST X-drop DP algorithm, see the [document by @elucify](https://github.com/elucify/blast-docs/wiki/Gapped-Alignment-Phase). The following a simplified illustration of POA-like graph-capable extension of the algorithm.
[figure]
![An illustration of linear-to-graph SIMD-parallel X-drop DP](https://raw.githubusercontent.com/ocxtal/dozeu/master/fig/xdrop.png)
*An illustration of the SIMD-parallel X-drop DP aligning a sequence to a graph.*
The fill-in function takes an array of preceding nodes. The function first merges the incoming vectors by taking the maximum for each DP cell. The DP matrix is extended column by column until the end of the reference sequence. A reference sequence with negative length is treated as its reverse-complemented scaned from the last character, for example, (seq, len) = (&"AACG"[4], -4) gives the same result as ("CGTT", 4).
The maximum score position is updated when newer cell value takes greater than or equal to the current maximum. It result in reporting longer path for the reference direction. On the other hand, the shortest path is reported for the query direction. (The former cmpeq behavior can be overwritten by defining `dz_cmp_max` macro).
The maximum-scoring node is not kept tracked internally; it must be tracked by caller. The reason the library is unaware of maximum-scoring node is that the score cannot be compared between two diverging paths (i.e. there is no way to know what the maximum of a leaf node is from another leaf when each node-filling function does not have knowledge about the whole tree structure).
The maximum-scoring node is not kept tracked internally; it must be tracked by caller. The reason the library is unaware of maximum-scoring node is that the scores cannot be compared between two diverging paths (i.e. there is no way to know what the maximum of another leaf node is from a leaf when each node-filling function does not have knowledge about the whole tree structure).
The traceback function takes a node (forefront object) pointer. It first determines the maximum-scoring cell for the node then begin traceback from the cell. It automatically roll the node chain up to the root and returns a pair of alignment path and mapping section array.
Expand All @@ -143,17 +146,17 @@ The query-sequence packing part can be removed when the 2-bit encoding (A, C, G,
#### Larger score matrix
The on-the-fly score profile calculation algorithm is unable to be applied to larger score matrix such as BLOSUM and PAM. These large matrices, which cannot be kept on SIMD registers, are better handled by the precalculated score profile vector algorithm. The modified alogirhtm would be much similar to the Rognes's composition. The array should be allocated in the local memory arena just after the `struct dz_query_s` object by `dz_mem_malloc` so that the overhead of allocation and access to be as small as possible. For typical protein score matrix we need to build 20 lanes, each of which length is equal to the query length rounded up by the SIMD register width.
The on-the-fly score profile calculation algorithm is unable to be applied to larger score matrices such as BLOSUM and PAM. These large matrices, which cannot be kept on SIMD registers, are better handled by the precalculated score profile vector algorithm. The modified alogirthm would be much similar to the Rognes's composition. The array should be allocated in the local memory arena just after the `struct dz_query_s` object by `dz_mem_malloc` so that the overhead of allocation and access to be as small as possible. For typical protein score matrix we need to build 20 lanes, each of which length is equal to the query length rounded up by the SIMD register width.
#### *k*-mer x *k*-mer matrix for Nanopore read polishing
We can also extend the precalculated score matrix algorithm to *k*-mer x *k*-mer matrix (e.g. 4096 x 4096 for *k* = 6), though, the number of vectors, 4096, is too huge to be filled on initialization. The vectors are also unable to be kept inside the last-level cache (4096 x 1 kbp = 4 MB > 3 MB). I believe we would get better performance by not creating the precalculated vectors but just fetching each score from the matrix when it is needed (in the unparallelized manner). It consumes a lot of CPU cycles waiting for the data to arrive however the situation is inevitable when we adopt such huge score matrix. So it would be even reasonable to hand the cache over to the DP matrix. It is also worth considering to keep locality of the score matrix with regards to the overlapping *k*-mers by adopting de Bruijn sequence ordering for each axis but the expected cache hit rate will still be low (I have not verified it yet).
In contrast, the 4 x *k*-mer matrix to align Nanopore signal to consensus sequences, the situation is almost the same as the original 4 x 4 score matrix. We still need the precalculated score profile vectors because the entire score matrix cannot be kept on the SIMD register. However, the vector construction is not a heavy task because the number of lanes is small.
In contrast, the 4 x *k*-mer matrix to align Nanopore signal to consensus sequences is treated in almost the same way as the original 4 x 4 score matrix. The difference is that we still need the precalculated score profile vectors because the entire score matrix cannot be kept on the SIMD register. However, the vector construction is not a heavy task because the number of lanes is small.
## Acknowledgements
I would like to give thank to Erik Garrison for his description on the internals of the vg toolkit, and insightful comments about alignment algorithms. I also thank very much to Toshiaki Katayama and DBCLS staffs to hold RDF3 summit at Kyoto where this work has begun.
I would like to give thank to Erik Garrison for his description on the internals of the vg toolkit, and insightful comments about alignment algorithms. I also thank Toshiaki Katayama and DBCLS staffs for holding RDF3 summit at Kyoto where this work has begun.
## License
Expand Down

0 comments on commit 6cf9115

Please sign in to comment.