Multimodal transcribers (v2) #5366

ZanSara · 2023-07-14T14:51:48Z

Multi modal transcribers convert image/audio/video documents into text documents.

The main question about these components, however, is what input should they deal with in order to be able to function both in Indexing and in query scenarios.

input path, output document: Works well for indexing, clumsy for query (document needs to be converted back to string)
input document, output document: Same as above
input path, output string: works for query, doesn't work for indexing (metadata is likely lost, for example whisper timestamps)

Currently WhisperTranscribers for v2 follow the path --> document pattern, but that makes them ugly to use in query pipelines.

Once we decide on a strategy, all transcribers should work similarly:

Tasks

Give feedback

ImageTranscriber (v2)
AudioTranscriber (v2)
VideoTranscriber (v2)
Options

Existing work:

The text was updated successfully, but these errors were encountered:

ZanSara mentioned this issue Jul 14, 2023

Migrate Components to Pipeline v2 #5265

Closed

ZanSara changed the title ~~Multi modal transcribers (v2)~~ Multimodal transcribers (v2) Jul 14, 2023

ZanSara added the 2.x Related to Haystack v2.0 label Aug 25, 2023

Timoeller added the P3 Low priority, leave it in the backlog label Sep 29, 2023

masci added the epic label Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal transcribers (v2) #5366

Multimodal transcribers (v2) #5366

ZanSara commented Jul 14, 2023 •

edited

Loading

Tasks

Multimodal transcribers (v2) #5366

Multimodal transcribers (v2) #5366

Comments

ZanSara commented Jul 14, 2023 • edited Loading

Tasks

Existing work:

ZanSara commented Jul 14, 2023 •

edited

Loading