Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iam #2658

Merged
merged 38 commits into from
Sep 12, 2018
Merged

Iam #2658

Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a3a18e2
adding changes for language modelling
aarora8 Aug 30, 2018
91508b5
adding modifications for augmentation, topology, shearing, run.sh
aarora8 Aug 31, 2018
5f273d6
fixing bugs
aarora8 Aug 31, 2018
2645f14
fixing bug
aarora8 Aug 31, 2018
6ebfdb2
adding parameter tuning
aarora8 Sep 1, 2018
b532978
cosmetic fixes and updating results
aarora8 Sep 1, 2018
f383334
cosmetic fixes
aarora8 Sep 1, 2018
44c9e58
adding results
aarora8 Sep 1, 2018
2d11672
removing local/prepare_lang and adding gen_topo in run.sh
aarora8 Sep 1, 2018
4fc6705
fixing bugs
aarora8 Sep 1, 2018
8877530
updating result
aarora8 Sep 2, 2018
59e2c8b
updating documentation, results and parameter tuning
aarora8 Sep 2, 2018
5fc0d17
fixing chain scripts
aarora8 Sep 2, 2018
1138ee3
updating parameters
aarora8 Sep 2, 2018
b3532ce
updating parameters and results
aarora8 Sep 3, 2018
9b67d9d
adding overwrite option and punctuation topology
aarora8 Sep 3, 2018
89c9ec7
adding overwrite option
aarora8 Sep 4, 2018
c05cd4d
adding aachen splits
aarora8 Sep 4, 2018
5dfe8fc
fixing bugs
aarora8 Sep 4, 2018
d7448df
modification from review
aarora8 Sep 5, 2018
d7d5c22
updating parameter and result
aarora8 Sep 6, 2018
43e9af9
updating parameter and result
aarora8 Sep 6, 2018
17c506b
adding data preprocessing in test and val
aarora8 Sep 7, 2018
d640742
updating results
aarora8 Sep 7, 2018
7dfd0b5
Merge branch 'master' of https://github.com/kaldi-asr/kaldi into iam_4
aarora8 Sep 7, 2018
94a80ad
replacing prepend words with common prepend words
aarora8 Sep 7, 2018
711c3c9
updating remove_test_utterances_from_lob for aachen split
aarora8 Sep 7, 2018
5f2d960
removing data/val/text from train_lm
aarora8 Sep 7, 2018
7f2ad0b
cosmetic fixes in unk arc decoding
aarora8 Sep 7, 2018
8f2ac25
adding val data for decoding
aarora8 Sep 7, 2018
b8e71b2
modification from the review
aarora8 Sep 10, 2018
e9a75f6
modification from review
aarora8 Sep 10, 2018
ae674ed
modification from review
aarora8 Sep 10, 2018
7651f37
modification for downloading aachen splits
aarora8 Sep 10, 2018
417d97c
fixing bug in rescoring
aarora8 Sep 11, 2018
6a86531
hardcoding for removing only remaining long utterence
aarora8 Sep 12, 2018
ba07ff0
fix in hardcoding
aarora8 Sep 12, 2018
5398412
modification from review
aarora8 Sep 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
cosmetic fixes in unk arc decoding
  • Loading branch information
aarora8 committed Sep 7, 2018
commit 7f2ad0ba4b4b33c6b9cd43d2a31ce7672b04d5db
137 changes: 78 additions & 59 deletions egs/iam/v1/local/unk_arc_post_to_transcription.py
Original file line number Diff line number Diff line change
@@ -1,88 +1,107 @@
#!/usr/bin/env python3

# Copyright 2017 Ashish Arora
#Copyright 2017 Ashish Arora

""" This module will be used by scripts for open vocabulary setup.
If the hypothesis transcription contains <unk>, then it will replace the
<unk> with the word predicted by <unk> model by concatenating phones decoded
from the unk-model. It is currently supported only for triphone setup.
Args:
phones: File name of a file that contains the phones.txt, (symbol-table for phones).
phone and phoneID, Eg. a 217, phoneID of 'a' is 217.
words: File name of a file that contains the words.txt, (symbol-table for words).
word and wordID. Eg. ACCOUNTANCY 234, wordID of 'ACCOUNTANCY' is 234.
unk: ID of <unk>. Eg. 231.
one-best-arc-post: A file in arc-post format, which is a list of timing info and posterior
of arcs along the one-best path from the lattice.
E.g. 506_m01-049-00 8 12 1 7722 282 272 288 231
<utterance-id> <start-frame> <num-frames> <posterior> <word> [<ali>]
[<phone1> <phone2>...]
output-text: File containing hypothesis transcription with <unk> recognized by the
unk-model.
E.g. A move to stop mr. gaitskell.

Eg. local/unk_arc_post_to_transcription.py lang/phones.txt lang/words.txt
data/lang/oov.int
"""
import argparse
import os
import sys

parser = argparse.ArgumentParser(description="""uses phones to convert unk to word""")
parser.add_argument('phones', type=str, help='phones and phonesID')
parser.add_argument('words', type=str, help='word and wordID')
parser.add_argument('unk', type=str, default='-', help='location of unk file')
parser.add_argument('--input-ark', type=str, default='-', help='where to read the input data')
parser.add_argument('--out-ark', type=str, default='-', help='where to write the output data')
parser.add_argument('phones', type=str, help='File name of a file that contains the'
'symbol-table for phones. Each line must be: <phone> <phoneID>')
parser.add_argument('words', type=str, help='File name of a file that contains the'
'symbol-table for words. Each line must be: <word> <word-id>')
parser.add_argument('unk', type=str, default='-', help='File name of a file that'
'contains the ID of <unk>. The content must be: <oov-id>, e.g. 231')
parser.add_argument('--one-best-arc-post', type=str, default='-', help='A file in arc-post'
'format, which is a list of timing info and posterior of arcs'
'along the one-best path from the lattice')
parser.add_argument('--output-text', type=str, default='-', help='File containing'
'hypothesis transcription with <unk> recognized by the unk-model')
args = parser.parse_args()


### main ###
phone_fh = open(args.phones, 'r', encoding='latin-1')
word_fh = open(args.words, 'r', encoding='latin-1')
unk_fh = open(args.unk, 'r', encoding='latin-1')
if args.input_ark == '-':
input_fh = sys.stdin
phone_handle = open(args.phones, 'r', encoding='latin-1') # Create file handles
word_handle = open(args.words, 'r', encoding='latin-1')
unk_handle = open(args.unk,'r', encoding='latin-1')
if args.one_best_arc_post == '-':
arc_post_handle = sys.stdin
else:
input_fh = open(args.input_ark, 'r', encoding='latin-1')
if args.out_ark == '-':
out_fh = sys.stdout
arc_post_handle = open(args.one_best_arc_post, 'r', encoding='latin-1')
if args.output_text == '-':
output_text_handle = sys.stdout
else:
out_fh = open(args.out_ark, 'w', encoding='latin-1')
output_text_handle = open(args.output_text, 'w', encoding='latin-1')

phone_dict = dict() # Stores phoneID and phone mapping
phone_data_vect = phone_fh.read().strip().split("\n")
for key_val in phone_data_vect:
id2phone = dict() # Stores the mapping from phone_id (int) to phone (char)
phones_data = phone_handle.read().strip().split("\n")

for key_val in phones_data:
key_val = key_val.split(" ")
phone_dict[key_val[1]] = key_val[0]
id2phone[key_val[1]] = key_val[0]

word_dict = dict()
word_data_vect = word_fh.read().strip().split("\n")
word_data_vect = word_handle.read().strip().split("\n")

for key_val in word_data_vect:
key_val = key_val.split(" ")
word_dict[key_val[1]] = key_val[0]
unk_val = unk_fh.read().strip().split(" ")[0]
unk_val = unk_handle.read().strip().split(" ")[0]

utt_word_dict = dict()
utt_phone_dict = dict() # Stores utteranceID and phoneID
unk_word_dict = dict()
count=0
for line in input_fh:
utt_word_dict = dict() # Dict of list, stores mapping from utteranceID(int) to words(str)
for line in arc_post_handle:
line_vect = line.strip().split("\t")
if len(line_vect) < 6:
print("Bad line: '{}' Expecting 6 fields. Skipping...".format(line),
if len(line_vect) < 6: # Check for 1best-arc-post output
print("Error: Bad line: '{}' Expecting 6 fields. Skipping...".format(line),
file=sys.stderr)
continue
uttID = line_vect[0]
utt_id = line_vect[0]
word = line_vect[4]
phones = line_vect[5]
if uttID in utt_word_dict.keys():
utt_word_dict[uttID][count] = word
utt_phone_dict[uttID][count] = phones
else:
count = 0
utt_word_dict[uttID] = dict()
utt_phone_dict[uttID] = dict()
utt_word_dict[uttID][count] = word
utt_phone_dict[uttID][count] = phones
if word == unk_val: # Get character sequence for unk
phone_key_vect = phones.split(" ")
phone_val_vect = list()
for pkey in phone_key_vect:
phone_val_vect.append(phone_dict[pkey])
if utt_id not in list(utt_word_dict.keys()):
utt_word_dict[utt_id] = list()

if word == unk_val: # Get the 1best phone sequence given by the unk-model
phone_id_seq = phones.split(" ")
phone_seq = list()
for pkey in phone_id_seq:
phone_seq.append(id2phone[pkey]) # Convert the phone-id sequence to a phone sequence.
phone_2_word = list()
for phone_val in phone_val_vect:
phone_2_word.append(phone_val.split('_')[0])
phone_2_word = ''.join(phone_2_word)
utt_word_dict[uttID][count] = phone_2_word
for phone_val in phone_seq:
phone_2_word.append(phone_val.split('_')[0]) # Removing the world-position markers(e.g. _B)
phone_2_word = ''.join(phone_2_word) # Concatnate phone sequence
utt_word_dict[utt_id].append(phone_2_word) # Store word from unk-model
else:
if word == '0':
if word == '0': # Store space/silence
word_val = ' '
else:
word_val = word_dict[word]
utt_word_dict[uttID][count] = word_val
count += 1
utt_word_dict[utt_id].append(word_val) # Store word from 1best-arc-post

transcription = ""
for key in sorted(utt_word_dict.keys()):
transcription = key
for index in sorted(utt_word_dict[key].keys()):
value = utt_word_dict[key][index]
transcription = transcription + " " + value
out_fh.write(transcription + '\n')
transcription = "" # Output transcription
for utt_key in sorted(utt_word_dict.keys()):
transcription = utt_key
for word in utt_word_dict[utt_key]:
transcription = transcription + " " + word
output_text_handle.write(transcription + '\n')
141 changes: 81 additions & 60 deletions egs/uw3/v1/local/unk_arc_post_to_transcription.py
Original file line number Diff line number Diff line change
@@ -1,86 +1,107 @@
#!/usr/bin/env python
#!/usr/bin/env python3

# Copyright 2017 Ashish Arora
#Copyright 2017 Ashish Arora

""" This module will be used by scripts for open vocabulary setup.
If the hypothesis transcription contains <unk>, then it will replace the
<unk> with the word predicted by <unk> model by concatenating phones decoded
from the unk-model. It is currently supported only for triphone setup.
Args:
phones: File name of a file that contains the phones.txt, (symbol-table for phones).
phone and phoneID, Eg. a 217, phoneID of 'a' is 217.
words: File name of a file that contains the words.txt, (symbol-table for words).
word and wordID. Eg. ACCOUNTANCY 234, wordID of 'ACCOUNTANCY' is 234.
unk: ID of <unk>. Eg. 231.
one-best-arc-post: A file in arc-post format, which is a list of timing info and posterior
of arcs along the one-best path from the lattice.
E.g. 506_m01-049-00 8 12 1 7722 282 272 288 231
<utterance-id> <start-frame> <num-frames> <posterior> <word> [<ali>]
[<phone1> <phone2>...]
output-text: File containing hypothesis transcription with <unk> recognized by the
unk-model.
E.g. A move to stop mr. gaitskell.

Eg. local/unk_arc_post_to_transcription.py lang/phones.txt lang/words.txt
data/lang/oov.int
"""
import argparse
import os
import sys

parser = argparse.ArgumentParser(description="""uses phones to convert unk to word""")
parser.add_argument('phones', type=str, help='phones and phonesID')
parser.add_argument('words', type=str, help='word and wordID')
parser.add_argument('unk', type=str, default='-', help='location of unk file')
parser.add_argument('--input-ark', type=str, default='-', help='where to read the input data')
parser.add_argument('--out-ark', type=str, default='-', help='where to write the output data')
parser.add_argument('phones', type=str, help='File name of a file that contains the'
'symbol-table for phones. Each line must be: <phone> <phoneID>')
parser.add_argument('words', type=str, help='File name of a file that contains the'
'symbol-table for words. Each line must be: <word> <word-id>')
parser.add_argument('unk', type=str, default='-', help='File name of a file that'
'contains the ID of <unk>. The content must be: <oov-id>, e.g. 231')
parser.add_argument('--one-best-arc-post', type=str, default='-', help='A file in arc-post'
'format, which is a list of timing info and posterior of arcs'
'along the one-best path from the lattice')
parser.add_argument('--output-text', type=str, default='-', help='File containing'
'hypothesis transcription with <unk> recognized by the unk-model')
args = parser.parse_args()

### main ###
phone_fh = open(args.phones, 'r')
word_fh = open(args.words, 'r')
unk_fh = open(args.unk,'r')
if args.input_ark == '-':
input_fh = sys.stdin
phone_handle = open(args.phones, 'r', encoding='latin-1') # Create file handles
word_handle = open(args.words, 'r', encoding='latin-1')
unk_handle = open(args.unk,'r', encoding='latin-1')
if args.one_best_arc_post == '-':
arc_post_handle = sys.stdin
else:
input_fh = open(args.input_ark,'r')
if args.out_ark == '-':
out_fh = sys.stdout
arc_post_handle = open(args.one_best_arc_post, 'r', encoding='latin-1')
if args.output_text == '-':
output_text_handle = sys.stdout
else:
out_fh = open(args.out_ark,'wb')
output_text_handle = open(args.output_text, 'w', encoding='latin-1')

phone_dict = dict()# stores phoneID and phone mapping
phone_data_vect = phone_fh.read().strip().split("\n")
for key_val in phone_data_vect:
id2phone = dict() # Stores the mapping from phone_id (int) to phone (char)
phones_data = phone_handle.read().strip().split("\n")

for key_val in phones_data:
key_val = key_val.split(" ")
phone_dict[key_val[1]] = key_val[0]
id2phone[key_val[1]] = key_val[0]

word_dict = dict()
word_data_vect = word_fh.read().strip().split("\n")
word_data_vect = word_handle.read().strip().split("\n")

for key_val in word_data_vect:
key_val = key_val.split(" ")
word_dict[key_val[1]] = key_val[0]
unk_val = unk_fh.read().strip().split(" ")[0]
unk_val = unk_handle.read().strip().split(" ")[0]

utt_word_dict = dict()
utt_phone_dict = dict()# stores utteranceID and phoneID
unk_word_dict = dict()
count=0
for line in input_fh:
utt_word_dict = dict() # Dict of list, stores mapping from utteranceID(int) to words(str)
for line in arc_post_handle:
line_vect = line.strip().split("\t")
if len(line_vect) < 6:
print "IndexError"
print line_vect
if len(line_vect) < 6: # Check for 1best-arc-post output
print("Error: Bad line: '{}' Expecting 6 fields. Skipping...".format(line),
file=sys.stderr)
continue
uttID = line_vect[0]
utt_id = line_vect[0]
word = line_vect[4]
phones = line_vect[5]
if uttID in utt_word_dict.keys():
utt_word_dict[uttID][count] = word
utt_phone_dict[uttID][count] = phones
else:
count = 0
utt_word_dict[uttID] = dict()
utt_phone_dict[uttID] = dict()
utt_word_dict[uttID][count] = word
utt_phone_dict[uttID][count] = phones
if word == unk_val: # get character sequence for unk
phone_key_vect = phones.split(" ")
phone_val_vect = list()
for pkey in phone_key_vect:
phone_val_vect.append(phone_dict[pkey])
if utt_id not in list(utt_word_dict.keys()):
utt_word_dict[utt_id] = list()

if word == unk_val: # Get the 1best phone sequence given by the unk-model
phone_id_seq = phones.split(" ")
phone_seq = list()
for pkey in phone_id_seq:
phone_seq.append(id2phone[pkey]) # Convert the phone-id sequence to a phone sequence.
phone_2_word = list()
for phone_val in phone_val_vect:
phone_2_word.append(phone_val.split('_')[0])
phone_2_word = ''.join(phone_2_word)
utt_word_dict[uttID][count] = phone_2_word
for phone_val in phone_seq:
phone_2_word.append(phone_val.split('_')[0]) # Removing the world-position markers(e.g. _B)
phone_2_word = ''.join(phone_2_word) # Concatnate phone sequence
utt_word_dict[utt_id].append(phone_2_word) # Store word from unk-model
else:
if word == '0':
if word == '0': # Store space/silence
word_val = ' '
else:
word_val = word_dict[word]
utt_word_dict[uttID][count] = word_val
count += 1
utt_word_dict[utt_id].append(word_val) # Store word from 1best-arc-post

transcription = ""
for key in sorted(utt_word_dict.iterkeys()):
transcription = key
for index in sorted(utt_word_dict[key].iterkeys()):
value = utt_word_dict[key][index]
transcription = transcription + " " + value
out_fh.write(transcription + '\n')
transcription = "" # Output transcription
for utt_key in sorted(utt_word_dict.keys()):
transcription = utt_key
for word in utt_word_dict[utt_key]:
transcription = transcription + " " + word
output_text_handle.write(transcription + '\n')