This repository contains the code for the paper "A DNN-HMM-DNN Hybrid Model for Discovering Word-like Units from Spoken Captions and Image Regions".
Requirement: Pytorch 0.3 for pretraining the VGG 16/Res 34 net
- Download the MSCOCO 2k image features from here and the MSCOCO 2k phone sequence from here, and put them under the directory data/mscoco
- Download the pretrained image classifier weights from here
- Example: Run the linear softmax model with Res 34 image features on MSCOCO 2k:
python run_image2phone.py --dataset mscoco2k --feat_type res34 --model_type linear --image_posterior_weights_file classifier_weights.npz --lr 0.01
- Run the following for help on more customized experiments:
python run_image2phone.py --help
Please consider citing the following paper if you use the code:
@inproceedings{WH-interspeech2020,
author = {Liming Wang and Mark Hasegawa-Johnson},
title = {A {DNN-HMM-DNN} Hybrid Model for Discovering Word-like Units from Spoken Captions and Image Regions},
booktitle = {Interspeech},
year = {2020}
}