Skip to content

Python script to split the text generated by 'wikipedia parallel title extractor' into separate text files (separate file for each language)

License

Notifications You must be signed in to change notification settings

moodser/splitter-transliteration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

splitter for generating transliteration corpus

Description

  • This is Python script that uses the text file generated by 'Wikipedia Parallel Title Extractor - https://github.com/clab/wikipedia-parallel-titles' as an input.
  • This script process the input text file (mentioned above) to generate a parallel corpus.
  • Output of this script (parallel corpus) can be used to train transliteration model on MOSES.

Author

Acknowledgement

Special thanks to Dr. Rao Muhammad Adeel Nawab and Sir Muhammad Sharjeel for their continous support.

Usage

  • Download the script file (splitter.py)
  • Copy the input file (generated by wikipedia parallel title script) in same directory
  • run the terminal/cmd command 'python splitter.py '
  • Two output files will be generated for each language seperately.

Caution

About

Python script to split the text generated by 'wikipedia parallel title extractor' into separate text files (separate file for each language)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages