Tested both BERT and Word Vector models for predicting category of an uploaded video. I found that Spacy's Word Vector en_core_md model with stopwords removed and lemmatization worked best for my use case. BERT was trained with and without stopwords and lemmatization and performed worse in each case, particularly having a hilarious consistent error with confusing videos marked children's entertainment with significantly more adult comedy. Ideally, it would have been preferable to run BERT without stopwords and lemmatization as the model is able to bi-directionally make better use of all the context provided. Given the imperfect nature of speech transcription and that unclear audio/speech and incorrect transcriptions must expected at times, my hypothesis is that any mistakes could have been magnified in the BERT model due to excess emphasis being placed on incorrect contexts. The Word Vector model was faster, much more memory efficient, and more consistently accurate for this classification task.
-
Notifications
You must be signed in to change notification settings - Fork 0
Yet Another Video Site - full stack video sharing webapp with automatic video category classification and insight generation. Live at:
License
ajayraj/yavs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Yet Another Video Site - full stack video sharing webapp with automatic video category classification and insight generation. Live at:
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published