Skip to content

A tool for extracting plain text from Wikipedia dumps

License

Notifications You must be signed in to change notification settings

verotne/wikiextractor_edited

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Remove Multiprocessing dependency from WikiExtractor.py

Modify the original WikiExtractor.py so that it could be executed on windows.

Multiprocessing is not applied which means that the computational performance is not at its best.

Usage: python -m WikiExtractor.py enwiki-20231201-pages-articles-multistream1.xml-p1p41242.bz2 --output C:/output/

Caveat: did not test on other use cases.

Wikipedia database backup dump

About

A tool for extracting plain text from Wikipedia dumps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.2%
  • Shell 0.8%