Wang et al., 2024 - Google Patents

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Wang et al., 2024

Document ID: 9564441537629455797
Author: Wang H; Kurita S; Shimizu S; Kawahara D
Publication year: 2024
Publication venue: arXiv preprint arXiv:2401.09759

External Links

Cited by

Snippet

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall …

Continue reading at arxiv.org (PDF) (other versions)

230000001815 facial effect 0 abstract description 3

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30796—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using original textual content or text extracted from visual content or transcript of audio data
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30634—Querying
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30799—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using low-level visual features of the video content
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30817—Information retrieval; Database structures therefor; File system structures therefor of video data using information manually generated or using information not derived from the video content, e.g. time and location information, usage information, user ratings
- G06F17/3082—Information retrieval; Database structures therefor; File system structures therefor of video data using information manually generated or using information not derived from the video content, e.g. time and location information, usage information, user ratings using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass

Similar Documents

Publication	Publication Date	Title
US10277946B2 (en)	2019-04-30	Methods and systems for aggregation and organization of multimedia data acquired from a plurality of sources
Chen	2017	Efficient vector representation for documents through corruption
US20190043500A1 (en)	2019-02-07	Voice based realtime event logging
Pavel et al.	2015	Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries
Martinez et al.	2019	Violence rating prediction from movie scripts
Raychev et al.	2019	Language-independent sentiment analysis using subjectivity and positional information
Han et al.	2023	Autoad ii: The sequel-who, when, and what in movie audio description
US20100329563A1 (en)	2010-12-30	System and Method for Real-time New Event Detection on Video Streams
Hussein et al.	2017	Unified embedding and metric learning for zero-exemplar event detection
Wang et al.	2024	SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
Rouvier et al.	2015	Audio-based video genre identification
Bost et al.	2020	Serial speakers: a dataset of tv series
Azab et al.	2018	Speaker naming in movies
Qi et al.	2016	Automated coding of political video ads for political science research
Langlois et al.	2010	VIRUS: video information retrieval using subtitles
Bourlard et al.	2013	Processing and linking audio events in large multimedia archives: The eu inevent project
Hansen et al.	2005	Towards cognitive component analysis
Diamantini et al.	2021	Automatic annotation of corpora for emotion recognition through facial expressions analysis
Jitaru et al.	2020	Lrro: a lip reading data set for the under-resourced romanian language
Vishwakarma et al.	2021	Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers
Rouvier et al.	2010	On-the-fly video genre classification by combination of audio features
Im et al.	2023	Multilayer CARU model for text summarization
Kannao et al.	2020	A system for semantic segmentation of TV news broadcast videos
Ngo et al.	2014	VIREO-TNO@ TRECVID 2014: multimedia event detection and recounting (MED and MER)
Wöllmer et al.	2013	YouTube Movie Reviews: In, Cross, and Open-Domain Sentiment Analysis in an Audiovisual Context