Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iam #2658

Merged
merged 38 commits into from
Sep 12, 2018
Merged

Iam #2658

Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a3a18e2
adding changes for language modelling
aarora8 Aug 30, 2018
91508b5
adding modifications for augmentation, topology, shearing, run.sh
aarora8 Aug 31, 2018
5f273d6
fixing bugs
aarora8 Aug 31, 2018
2645f14
fixing bug
aarora8 Aug 31, 2018
6ebfdb2
adding parameter tuning
aarora8 Sep 1, 2018
b532978
cosmetic fixes and updating results
aarora8 Sep 1, 2018
f383334
cosmetic fixes
aarora8 Sep 1, 2018
44c9e58
adding results
aarora8 Sep 1, 2018
2d11672
removing local/prepare_lang and adding gen_topo in run.sh
aarora8 Sep 1, 2018
4fc6705
fixing bugs
aarora8 Sep 1, 2018
8877530
updating result
aarora8 Sep 2, 2018
59e2c8b
updating documentation, results and parameter tuning
aarora8 Sep 2, 2018
5fc0d17
fixing chain scripts
aarora8 Sep 2, 2018
1138ee3
updating parameters
aarora8 Sep 2, 2018
b3532ce
updating parameters and results
aarora8 Sep 3, 2018
9b67d9d
adding overwrite option and punctuation topology
aarora8 Sep 3, 2018
89c9ec7
adding overwrite option
aarora8 Sep 4, 2018
c05cd4d
adding aachen splits
aarora8 Sep 4, 2018
5dfe8fc
fixing bugs
aarora8 Sep 4, 2018
d7448df
modification from review
aarora8 Sep 5, 2018
d7d5c22
updating parameter and result
aarora8 Sep 6, 2018
43e9af9
updating parameter and result
aarora8 Sep 6, 2018
17c506b
adding data preprocessing in test and val
aarora8 Sep 7, 2018
d640742
updating results
aarora8 Sep 7, 2018
7dfd0b5
Merge branch 'master' of https://github.com/kaldi-asr/kaldi into iam_4
aarora8 Sep 7, 2018
94a80ad
replacing prepend words with common prepend words
aarora8 Sep 7, 2018
711c3c9
updating remove_test_utterances_from_lob for aachen split
aarora8 Sep 7, 2018
5f2d960
removing data/val/text from train_lm
aarora8 Sep 7, 2018
7f2ad0b
cosmetic fixes in unk arc decoding
aarora8 Sep 7, 2018
8f2ac25
adding val data for decoding
aarora8 Sep 7, 2018
b8e71b2
modification from the review
aarora8 Sep 10, 2018
e9a75f6
modification from review
aarora8 Sep 10, 2018
ae674ed
modification from review
aarora8 Sep 10, 2018
7651f37
modification for downloading aachen splits
aarora8 Sep 10, 2018
417d97c
fixing bug in rescoring
aarora8 Sep 11, 2018
6a86531
hardcoding for removing only remaining long utterence
aarora8 Sep 12, 2018
ba07ff0
fix in hardcoding
aarora8 Sep 12, 2018
5398412
modification from review
aarora8 Sep 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
adding changes for language modelling
  • Loading branch information
aarora8 committed Aug 30, 2018
commit a3a18e2527f911d5d3425b7caa0911b868d31691
Empty file modified egs/iam/v2/cmd.sh
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion egs/iam/v2/local/chain/run_cnn_e2eali.sh
2 changes: 1 addition & 1 deletion egs/iam/v2/local/chain/tuning/run_cnn_e2eali_1c.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ stage=0
nj=30
train_set=train
nnet3_affix= # affix for exp dirs, e.g. it was _cleaned in tedlium.
affix=_1b6 #affix for TDNN+LSTM directory e.g. "1a" or "1b", in case we change the configuration.
affix=_1c #affix for TDNN+LSTM directory e.g. "1a" or "1b", in case we change the configuration.
e2echain_model_dir=exp/chain/e2e_cnn_1a
common_egs_dir=
reporting_email=
Expand Down
246 changes: 246 additions & 0 deletions egs/iam/v2/local/chain/tuning/run_cnn_e2eali_1d.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
#!/bin/bash

# e2eali_1c is the same as e2eali_1b but has fewer CNN layers, smaller
# l2-regularize, more epochs and uses dropout.


# local/chain/compare_wer.sh exp/chain/cnn_e2eali_1b exp/chain/cnn_e2eali_1c
# System cnn_e2eali_1b cnn_e2eali_1c
# WER 10.33 10.05
# WER (rescored) 10.10 9.75
# CER 5.00 4.76
# CER (rescored) 4.88 4.68
# Final train prob -0.0428 -0.0317
# Final valid prob -0.0666 -0.0630
# Final train prob (xent) -0.9210 -0.5413
# Final valid prob (xent) -1.0264 -0.7096
# Parameters 3.98M 5.12M

# steps/info/chain_dir_info.pl exp/chain/cnn_e2eali_1c
# exp/chain/cnn_e2eali_1c: num-iters=21 nj=2..4 num-params=5.1M dim=40->392 combine=-0.034->-0.034 (over 1) xent:train/valid[13,20,final]=(-0.953,-0.800,-0.541/-1.03,-0.933,-0.710) logprob:train/valid[13,20,final]=(-0.069,-0.048,-0.032/-0.091,-0.078,-0.063)

set -e -o pipefail

stage=0

nj=30
train_set=train
nnet3_affix= # affix for exp dirs, e.g. it was _cleaned in tedlium.
affix=_1d #affix for TDNN+LSTM directory e.g. "1a" or "1b", in case we change the configuration.
e2echain_model_dir=exp/chain/e2e_cnn_1a
common_egs_dir=
reporting_email=

# chain options
train_stage=-10
xent_regularize=0.1
frame_subsampling_factor=4
# training chunk-options
chunk_width=340,300,200,100
num_leaves=500
# we don't need extra left/right context for TDNN systems.
chunk_left_context=0
chunk_right_context=0
tdnn_dim=550
# training options
srand=0
remove_egs=true
#lang_decode=data.new/lang
#lang_rescore=data.new/lang_rescore_6g
lang_decode=data/lang
lang_rescore=data/lang_rescore_6g

dropout_schedule='0,[email protected],[email protected],0'
# End configuration section.
echo "$0 $@" # Print the command line for logging


. ./cmd.sh
. ./path.sh
. ./utils/parse_options.sh


if ! cuda-compiled; then
cat <<EOF && exit 1
This script is intended to be used with GPUs but you have not compiled Kaldi with CUDA
If you want to use GPUs (and have them), go to src/, and configure and make on a machine
where "nvcc" is installed.
EOF
fi

ali_dir=exp/chain/e2e_ali_train
lat_dir=exp/chain${nnet3_affix}/e2e_${train_set}_lats
dir=exp/chain${nnet3_affix}/cnn_e2eali${affix}
#dir=exp/chain/cnn_e2eali_1c
train_data_dir=data/${train_set}
tree_dir=exp/chain${nnet3_affix}/tree_e2e

# the 'lang' directory is created by this script.
# If you create such a directory with a non-standard topology
# you should probably name it differently.
lang=data/lang_chain
for f in $train_data_dir/feats.scp $ali_dir/ali.1.gz $ali_dir/final.mdl; do
[ ! -f $f ] && echo "$0: expected file $f to exist" && exit 1
done


if [ $stage -le 1 ]; then
echo "$0: creating lang directory $lang with chain-type topology"
# Create a version of the lang/ directory that has one state per phone in the
# topo file. [note, it really has two states.. the first one is only repeated
# once, the second one has zero or more repeats.]
if [ -d $lang ]; then
if [ $lang/L.fst -nt data/lang/L.fst ]; then
echo "$0: $lang already exists, not overwriting it; continuing"
else
echo "$0: $lang already exists and seems to be older than data/lang..."
echo " ... not sure what to do. Exiting."
exit 1;
fi
else
cp -r data/lang $lang
silphonelist=$(cat $lang/phones/silence.csl) || exit 1;
nonsilphonelist=$(cat $lang/phones/nonsilence.csl) || exit 1;
# Use our special topology... note that later on may have to tune this
# topology.
steps/nnet3/chain/gen_topo.py $nonsilphonelist $silphonelist >$lang/topo
fi
fi

if [ $stage -le 2 ]; then
# Get the alignments as lattices (gives the chain training more freedom).
# use the same num-jobs as the alignments
steps/nnet3/align_lats.sh --nj $nj --cmd "$cmd" \
--acoustic-scale 1.0 \
--scale-opts '--transition-scale=1.0 --self-loop-scale=1.0' \
${train_data_dir} data/lang $e2echain_model_dir $lat_dir
echo "" >$lat_dir/splice_opts
fi

if [ $stage -le 3 ]; then
# Build a tree using our new topology. We know we have alignments for the
# speed-perturbed data (local/nnet3/run_ivector_common.sh made them), so use
# those. The num-leaves is always somewhat less than the num-leaves from
# the GMM baseline.
if [ -f $tree_dir/final.mdl ]; then
echo "$0: $tree_dir/final.mdl already exists, refusing to overwrite it."
exit 1;
fi

steps/nnet3/chain/build_tree.sh \
--frame-subsampling-factor $frame_subsampling_factor \
--alignment-subsampling-factor 1 \
--context-opts "--context-width=2 --central-position=1" \
--cmd "$cmd" $num_leaves ${train_data_dir} \
$lang $ali_dir $tree_dir
fi


if [ $stage -le 4 ]; then
mkdir -p $dir
echo "$0: creating neural net configs using the xconfig parser";

num_targets=$(tree-info $tree_dir/tree | grep num-pdfs | awk '{print $2}')
learning_rate_factor=$(echo "print 0.5/$xent_regularize" | python)
cnn_opts="l2-regularize=0.03 dropout-proportion=0.0"
tdnn_opts="l2-regularize=0.03"
output_opts="l2-regularize=0.04"
common1="$cnn_opts required-time-offsets= height-offsets=-2,-1,0,1,2 num-filters-out=36"
common2="$cnn_opts required-time-offsets= height-offsets=-2,-1,0,1,2 num-filters-out=70"
common3="$cnn_opts required-time-offsets= height-offsets=-1,0,1 num-filters-out=90"
mkdir -p $dir/configs
cat <<EOF > $dir/configs/network.xconfig
input dim=40 name=input

conv-relu-batchnorm-dropout-layer name=cnn1 height-in=40 height-out=40 time-offsets=-3,-2,-1,0,1,2,3 $common1
conv-relu-batchnorm-dropout-layer name=cnn2 height-in=40 height-out=20 time-offsets=-2,-1,0,1,2 $common1 height-subsample-out=2
conv-relu-batchnorm-dropout-layer name=cnn3 height-in=20 height-out=20 time-offsets=-4,-2,0,2,4 $common2
conv-relu-batchnorm-dropout-layer name=cnn4 height-in=20 height-out=20 time-offsets=-4,-2,0,2,4 $common2
conv-relu-batchnorm-dropout-layer name=cnn5 height-in=20 height-out=10 time-offsets=-4,-2,0,2,4 $common3 height-subsample-out=2
relu-batchnorm-dropout-layer name=tdnn1 input=Append(-4,-2,0,2,4) dim=$tdnn_dim $tdnn_opts dropout-proportion=0.0
relu-batchnorm-dropout-layer name=tdnn2 input=Append(-4,0,4) dim=$tdnn_dim $tdnn_opts dropout-proportion=0.0
relu-batchnorm-dropout-layer name=tdnn3 input=Append(-4,0,4) dim=$tdnn_dim $tdnn_opts dropout-proportion=0.0

## adding the layers for chain branch
relu-batchnorm-layer name=prefinal-chain dim=$tdnn_dim target-rms=0.5 $tdnn_opts
output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5 $output_opts
relu-batchnorm-layer name=prefinal-xent input=tdnn3 dim=$tdnn_dim target-rms=0.5 $tdnn_opts
output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5 $output_opts
EOF
steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/
fi


if [ $stage -le 5 ]; then
if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d $dir/egs/storage ]; then
utils/create_split_dir.pl \
/export/b0{3,4,5,6}/$USER/kaldi-data/egs/iam-$(date +'%m_%d_%H_%M')/s5/$dir/egs/storage $dir/egs/storage
fi

steps/nnet3/chain/train.py --stage=$train_stage \
--cmd="$cmd" \
--feat.cmvn-opts="--norm-means=false --norm-vars=false" \
--chain.xent-regularize $xent_regularize \
--chain.leaky-hmm-coefficient=0.1 \
--chain.l2-regularize=0.00005 \
--chain.apply-deriv-weights=true \
--chain.lm-opts="--ngram-order=2 --no-prune-ngram-order=1 --num-extra-lm-states=1000" \
--chain.frame-subsampling-factor=$frame_subsampling_factor \
--chain.alignment-subsampling-factor=1 \
--chain.left-tolerance 3 \
--chain.right-tolerance 3 \
--trainer.srand=$srand \
--trainer.max-param-change=2.0 \
--trainer.num-epochs=8 \
--trainer.frames-per-iter=1500000 \
--trainer.optimization.num-jobs-initial=2 \
--trainer.optimization.num-jobs-final=4 \
--trainer.dropout-schedule $dropout_schedule \
--trainer.optimization.initial-effective-lrate=0.001 \
--trainer.optimization.final-effective-lrate=0.0001 \
--trainer.optimization.shrink-value=1.0 \
--trainer.num-chunk-per-minibatch=32,16 \
--trainer.optimization.momentum=0.0 \
--egs.chunk-width=$chunk_width \
--egs.chunk-left-context=$chunk_left_context \
--egs.chunk-right-context=$chunk_right_context \
--egs.chunk-left-context-initial=0 \
--egs.chunk-right-context-final=0 \
--egs.dir="$common_egs_dir" \
--egs.opts="--frames-overlap-per-eg 0 --constrained false" \
--cleanup.remove-egs=$remove_egs \
--use-gpu=true \
--reporting.email="$reporting_email" \
--feat-dir=$train_data_dir \
--tree-dir=$tree_dir \
--lat-dir=$lat_dir \
--dir=$dir || exit 1;
fi

if [ $stage -le 6 ]; then
# The reason we are using data/lang here, instead of $lang, is just to
# emphasize that it's not actually important to give mkgraph.sh the
# lang directory with the matched topology (since it gets the
# topology file from the model). So you could give it a different
# lang directory, one that contained a wordlist and LM of your choice,
# as long as phones.txt was compatible.

utils/mkgraph.sh \
--self-loop-scale 1.0 $lang_decode \
$dir $dir/graph || exit 1;
fi

if [ $stage -le 7 ]; then
frames_per_chunk=$(echo $chunk_width | cut -d, -f1)
steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
--extra-left-context $chunk_left_context \
--extra-right-context $chunk_right_context \
--extra-left-context-initial 0 \
--extra-right-context-final 0 \
--frames-per-chunk $frames_per_chunk \
--nj $nj --cmd "$cmd" \
$dir/graph data/test $dir/decode_test || exit 1;

steps/lmrescore_const_arpa.sh --cmd "$cmd" $lang_decode $lang_rescore \
data/test $dir/decode_test{,_rescored} || exit 1
fi
30 changes: 30 additions & 0 deletions egs/iam/v2/local/process_corpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env python3
# Copyright 2018 Ashish Arora
# Apache 2.0
# This script reads valid phones and removes the lines in the corpus
# which have any other phone.

import os
import sys, io

phone_file = os.path.join('data/local/phones.txt')
infile = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
output = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
phone_dict = dict()
with open(phone_file, 'r', encoding='utf-8') as phone_fh:
for line in phone_fh:
line = line.strip().split()[0]
phone_dict[line] = line

phone_dict[' '] = ' '
corpus_text = list()
for line in infile:
text = line.strip()
skip_text = False
for phone in text:
if phone not in phone_dict.keys():
skip_text = True
break
if not skip_text:
output.write(text+ '\n')