Skip to content

Word Segmentation for Sanskrit (Unsandhied output from Sandhied sentences)

Notifications You must be signed in to change notification settings

kpiyush16/sanskrit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sanskrit Word Segmentation by CopyNet + nmt (tensorflow)

Requirements:

  • Python 2.7.
  • subword-nmt
pip install subword-nmt
  • tensorflow > 1.5.0 (consider to use the latest [Ana-]conda release)

Instruction to generate Byte Pair Encoded (BPE) data

  • Change directory to nmt/BPE_2ch
  • Unzip "data_split.tar.gz" and "combined.tar.gz"
  • run the "inps.sh" file
./inps.sh

Testing while Training

python2.7 -m nmt.nmt --copynet=True --share_vocab=True --attention=scaled_luong --src=src --tgt=trg --vocab_prefix=nmt/BPE_2ch/vocab --train_prefix=nmt/BPE_2ch/train_BPE --test_prefix=nmt/BPE_2ch/test_BPE --dev_prefix=nmt/BPE_2ch/valid_BPE --out_dir=nmt/models_2ch/ --num_train_steps=80000 --steps_per_stats=100 --encoder_type=bi --num_layers=4 --num_units=128 --dropout=0.4 --metrics=bleu --num_keep_ckpts=80 --decay_scheme=luong10 --batch_size=64 --src_max_len=150 --tgt_max_len=150

Only Training

python2.7 -m nmt.nmt --copynet=True --share_vocab=True --attention=scaled_luong --src=src --tgt=trg --vocab_prefix=nmt/BPE_2ch/vocab --train_prefix=nmt/BPE_2ch/train_BPE --dev_prefix=nmt/BPE_2ch/valid_BPE --out_dir=nmt/models_2ch/ --num_train_steps=80000 --steps_per_stats=100 --encoder_type=bi --num_layers=4 --num_units=128 --dropout=0.4 --metrics=bleu --num_keep_ckpts=80 --decay_scheme=luong10 --batch_size=64 --src_max_len=150 --tgt_max_len=150

Only Testing

python2.7 -m nmt.nmt --out_dir=nmt/models_2ch --inference_input_file=nmt/infer_file.L1 --inference_output_file=nmt/models_2ch/output_infer

Vocabulary Setting

Since in copynet scenarios the target sequence contains words from source sentences, the best choice is to use a shared vocabulary for source vocabulary and target vcabulary. And we also use a parameter generated vocabulary size, namely, the number of target vocabulary excluding words from source sequences (copied words), to indicate that the first N(=generated vocabulary size) words in shared vocabulary are in generate mode and target word indexes larger than N are copied.

In this codebase, vocab_size and gen_vocab_size are variables representing shared vocabulary size and generated vocabulalry size.

About

Word Segmentation for Sanskrit (Unsandhied output from Sandhied sentences)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published