Python Scripts for extract texture features given transcripts.
The extractor BERT is from this the repository: sentence-transformer;
the pretrained weights are from: distiluse-base-multilingual-cased-v2 according to the paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. Advantages of this model according to the paper: Aligned feature space. 1) Vector spaces are aligned across languages, i.e., identical sentences in different languages are close; 2) vector space properties in the original source language from the teacher model M are adopted and transferred to other languages.
- python: 3.6+
- Pytorch: 1.7+
- download
training_data_transcripts
in the root directory
training_data_transcript/
├── animals_transcripts1_train
├──
├──
├── 025157
|── 025157_animals.srt
...
-
use
process_srt_files
inpreprocess.py
: to generate the raw dataraw_data.npy
that contains the list of chunks. Each chunk contains text, duration, talk_type, participant_id. One can use/modifyChunksDataset
to load the raw data. -
just run
feature_extraction.py
to extract the features and obtain theembeddings.npz
. Each item contains a feature embedding and a participant label. One can use/modifyEmbsDataset
to load the extracted features.