Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small bugfix for yomdle_zh #2791

Merged
merged 29 commits into from
Oct 19, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
76dcd08
initial commit of yomdle farsi
Sep 13, 2018
0dee6e1
added README
Sep 13, 2018
dadb232
added some more comments
Sep 13, 2018
317b5b8
added option to use utf8 to prepend wordswq
Sep 13, 2018
8ecd648
changed normalized scoring to use data/test/text.old for ref files. A…
Sep 18, 2018
18970f0
adding normalization scripts to local/wer_output_filter
Sep 19, 2018
56f8dad
merged upstream and fixed conflicts with utils/lang/bpe/prepend_words…
Sep 19, 2018
d74410d
minor bug fix
Sep 19, 2018
97d23e5
initial commit for yomdle_zh
Sep 19, 2018
98cbe82
forgot to flip augment data
Sep 19, 2018
e3cb43e
fixed problems with nbsp and ideographic space
Sep 19, 2018
e2d5e84
Merge remote-tracking branch 'ChunChiehChang/yomdle2' into yomdle
Sep 20, 2018
98d538f
add changjie mapping
Sep 20, 2018
4d5d221
Merge remote-tracking branch 'ChunChiehChang/yomdle2' into yomdle
Sep 20, 2018
fe7e607
decrease number of leaves and minibatch size
Sep 21, 2018
a45b94b
added results to top of script and fixed bug in run_end2end.sh
Sep 24, 2018
b82b9a1
Merge remote-tracking branch 'ChunChiehChang/yomdle2' into yomdle
Sep 24, 2018
9d3156d
fixed minor bug
Sep 24, 2018
5dca66a
removed unused local/normalized_scoring and unused commented out code
Oct 1, 2018
e0af59e
modified README
Oct 1, 2018
65a18fe
changed file names
Oct 1, 2018
6679a9d
added examples to gedi2csv and yomdle2csv scripts. Also added code to…
Oct 1, 2018
654992b
removed unused comment
Oct 1, 2018
cc87549
minor change
Oct 1, 2018
1736397
Merge remote-tracking branch 'upstream/master' into yomdle2
Oct 15, 2018
4cebf12
fix minor changes
Oct 15, 2018
3a2a88d
merged yomdle into yomdle2
Oct 15, 2018
cacada4
don't use gpu for alignwq
Oct 15, 2018
3f9135e
adding bidi script to bpe, this is an alternative to /utils/lang/bpe/…
Oct 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions egs/wsj/s5/utils/lang/bpe/bidi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/usr/bin/env python3
# Copyright 2018 Chun-Chieh Chang

# This script is largely written by Stephen Rawls
# and uses the python package https://pypi.org/project/PyICU_BiDi/
# The code leaves right to left text alone and reverses left to right text.

import icu_bidi
import io
import sys
import unicodedata
# R=strong right-to-left; AL=strong arabic right-to-left
rtl_set = set(chr(i) for i in range(sys.maxunicode)
if unicodedata.bidirectional(chr(i)) in ['R','AL'])
def determine_text_direction(text):
# Easy case first
for char in text:
if char in rtl_set:
return icu_bidi.UBiDiLevel.UBIDI_RTL
# If we made it here we did not encounter any strongly rtl char
return icu_bidi.UBiDiLevel.UBIDI_LTR

def utf8_visual_to_logical(text):
text_dir = determine_text_direction(text)

bidi = icu_bidi.Bidi()
bidi.inverse = True
bidi.reordering_mode = icu_bidi.UBiDiReorderingMode.UBIDI_REORDER_INVERSE_LIKE_DIRECT
bidi.reordering_options = icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_DEFAULT # icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_INSERT_MARKS

bidi.set_para(text, text_dir, None)

res = bidi.get_reordered(0 | icu_bidi.UBidiWriteReorderedOpt.UBIDI_DO_MIRRORING | icu_bidi.UBidiWriteReorderedOpt.UBIDI_KEEP_BASE_COMBINING)

return res

def utf8_logical_to_visual(text):
text_dir = determine_text_direction(text)

bidi = icu_bidi.Bidi()

bidi.reordering_mode = icu_bidi.UBiDiReorderingMode.UBIDI_REORDER_DEFAULT
bidi.reordering_options = icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_DEFAULT #icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_INSERT_MARKS

bidi.set_para(text, text_dir, None)

res = bidi.get_reordered(0 | icu_bidi.UBidiWriteReorderedOpt.UBIDI_DO_MIRRORING | icu_bidi.UBidiWriteReorderedOpt.UBIDI_KEEP_BASE_COMBINING)

return res


##main##
sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="utf8")
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf8")
for line in sys.stdin:
line = line.strip()
line = utf8_logical_to_visual(line)[::-1]
sys.stdout.write(line + '\n')
1 change: 1 addition & 0 deletions egs/yomdle_zh/v1/local/create_download.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,4 @@ local/create_line_image_from_page_image.py \

echo "Downloading table for CangJie."
wget -P $download_dir/ $cangjie_url || exit 1;
sed -ie '1,8d' $download_dir/cj5-cc.txt
2 changes: 1 addition & 1 deletion egs/yomdle_zh/v1/local/train_lm_lr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ if [ $stage -le 0 ]; then

rm ${dir}/data/text/* 2>/dev/null || true

cat ${extra_lm} | local/bidi.py | utils/lang/bpe/prepend_words.py --encoding 'utf-8' | python3 utils/lang/bpe/apply_bpe.py -c $data_dir/train/bpe.out | sed 's/@@//g' > ${dir}/data/text/extra_lm.txt
cat ${extra_lm} | utils/lang/bpe/prepend_words.py | python3 utils/lang/bpe/apply_bpe.py -c $data_dir/train/bpe.out | sed 's/@@//g' > ${dir}/data/text/extra_lm.txt

# Note: the name 'dev' is treated specially by pocolm, it automatically
# becomes the dev set.
Expand Down
2 changes: 1 addition & 1 deletion egs/yomdle_zh/v1/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ fi
if [ $stage -le 6 ]; then
echo "$0: Aligning the training data using the e2e chain model..."
echo "Date: $(date)."
steps/nnet3/align.sh --nj $nj --cmd "$cmd" \
steps/nnet3/align.sh --nj $nj --cmd "$cmd" --use-gpu false \
--scale-opts '--transition-scale=1.0 --acoustic-scale=1.0 --self-loop-scale=1.0' \
$data_dir/train_aug $data_dir/lang $exp_dir/chain/e2e_cnn_1a $exp_dir/chain/e2e_ali_train
fi
Expand Down