Indexing corpora

Once a corpus has been Manatee'd, add it to the interface and index it with these steps:

If there are interesting subcorpora, run e.g. mksubc ~/storage/registry/dan_twitter ~/storage/corpora/dan_twitter/subc/ ~/storage/registry/dan_twitter.subc

mkdir -pv ~/storage/corpora/dan_twitter/meta  ~/storage/corpora/dan_twitter/tmp
cd ~/storage/corpora/dan_twitter/tmp

# Count tokens, absolute frequencies, and histograms. Use -total if there are no lstamp with years
~/public_html/_bin/decodevert-word-lex-pos ~/storage/registry/dan_twitter | time ~/public_html/_src/build/index-corpus-year-lstamp
cat commands.sql | time sqlite3 stats.sqlite

# Calculate relative frequencies
time ~/public_html/_bin/stats-calc ~/storage/corpora/dan_twitter/tmp/stats.sqlite

mv -v ~/storage/corpora/dan_twitter/tmp/stats.sqlite ~/storage/corpora/dan_twitter/meta/stats.sqlite
rm -rf ~/storage/corpora/dan_twitter/tmp

Edit _inc/config.php to add it and all subcorpora to the $GLOBALS['-corpora'] array.
Update global stats for the language with time ~/public_html/_bin/stats-combine dan
If there are group-by attributes, index those, passing a colon-separated list of attributes:

cd ~/storage/corpora/dan_literature/meta
~/public_html/_bin/decodevert-word-lex ~/storage/registry/dan_literature | grep -v '===NONE===' | time ~/public_html/_bin/group-by group-by.sqlite 'author:title:year'

TODO

(Frequencies should include POS) - No, restrict it in the search and compare manually instead
Share corpus search without password
If no corpora are selected, pick the largest unprotected ones
Adjustable context size
Implement sibling search
Highlight parents if searched for
Per-language help links in top to CG grammar docs
Break down Group By hits into per-s histogram
Annotate Group-By bars with unique column values not part of the group-by
Group-By type-token relation via lex_POS
- FIX sparse calculation
View the whole work (for open corpora)
2D queries as scatter plots (E.g., Q+/- and a semantic class)
- Fields to limit on absolute X/Y value
- User-defined cutoff, default 0.1 or 0.05
- Colors by max of all, x, or y
- Toggle text
Sparse show only in table
Use semantic vector model to disambiguate semantics
Double-check c_words + c_numbers + c_alnums in stats-combine

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
_bin		_bin
_inc		_inc
_src		_src
_static		_static
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.htaccess		.htaccess
LICENSE.md		LICENSE.md
README.md		README.md
callback.php		callback.php
composer.json		composer.json
export.php		export.php
index.php		index.php
info.php		info.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indexing corpora

TODO

About

Contributors 2

Languages

License

GrammarSoft/corp-ui

Folders and files

Latest commit

History

Repository files navigation

Indexing corpora

TODO

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages