Skip to content

evyatarmeged/wiki-dumps-word-counter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Wiki-dumps-word-counter

Counting word occurrence in hewiki dumps downloaded from https://dumps.wikimedia.org/hewiki/.
Using WikiExtractor to extract text from the XML dump, parsing each article with regular expressions to strip it from any non-hebrew characters.
Finally, writes results to csv.

Update

Added Python version in branch "python" for comparsion/benchmarking purposes.

Releases

No releases published

Packages

No packages published

Languages