Wang et al., 2024 - Google Patents
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech RecognitionWang et al., 2024
View PDF- Document ID
- 9564441537629455797
- Author
- Wang H
- Kurita S
- Shimizu S
- Kawahara D
- Publication year
- Publication venue
- arXiv preprint arXiv:2401.09759
External Links
Snippet
Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall …
- 230000001815 facial effect 0 abstract description 3
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30796—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30634—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30784—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre
- G06F17/30799—Information retrieval; Database structures therefor; File system structures therefor of video data using features automatically derived from the video content, e.g. descriptors, fingerprints, signatures, genre using low-level visual features of the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30781—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F17/30817—Information retrieval; Database structures therefor; File system structures therefor of video data using information manually generated or using information not derived from the video content, e.g. time and location information, usage information, user ratings
- G06F17/3082—Information retrieval; Database structures therefor; File system structures therefor of video data using information manually generated or using information not derived from the video content, e.g. time and location information, usage information, user ratings using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10277946B2 (en) | Methods and systems for aggregation and organization of multimedia data acquired from a plurality of sources | |
Chen | Efficient vector representation for documents through corruption | |
US20190043500A1 (en) | Voice based realtime event logging | |
Pavel et al. | Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries | |
Martinez et al. | Violence rating prediction from movie scripts | |
Raychev et al. | Language-independent sentiment analysis using subjectivity and positional information | |
Han et al. | Autoad ii: The sequel-who, when, and what in movie audio description | |
US20100329563A1 (en) | System and Method for Real-time New Event Detection on Video Streams | |
Hussein et al. | Unified embedding and metric learning for zero-exemplar event detection | |
Wang et al. | SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition | |
Rouvier et al. | Audio-based video genre identification | |
Bost et al. | Serial speakers: a dataset of tv series | |
Azab et al. | Speaker naming in movies | |
Qi et al. | Automated coding of political video ads for political science research | |
Langlois et al. | VIRUS: video information retrieval using subtitles | |
Bourlard et al. | Processing and linking audio events in large multimedia archives: The eu inevent project | |
Hansen et al. | Towards cognitive component analysis | |
Diamantini et al. | Automatic annotation of corpora for emotion recognition through facial expressions analysis | |
Jitaru et al. | Lrro: a lip reading data set for the under-resourced romanian language | |
Vishwakarma et al. | Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers | |
Rouvier et al. | On-the-fly video genre classification by combination of audio features | |
Im et al. | Multilayer CARU model for text summarization | |
Kannao et al. | A system for semantic segmentation of TV news broadcast videos | |
Ngo et al. | VIREO-TNO@ TRECVID 2014: multimedia event detection and recounting (MED and MER) | |
Wöllmer et al. | YouTube Movie Reviews: In, Cross, and Open-Domain Sentiment Analysis in an Audiovisual Context |