`WhisperTranscriber` to add filename to document metadata #5716

TuanaCelik · 2023-09-04T22:25:57Z

It would be great if we provided the option to add the filename to the metadata of the documents that the WhisperTranscribercreates. Currently there's not good way of doing this. This would really help when building RAG pipelines where you want to query videos, but you want to reference the video in the response.

The text was updated successfully, but these errors were encountered:

TuanaCelik · 2023-09-05T08:53:17Z

Additional learning with @anakin87 :
It seems that even if we want to add the meta via an indexing pipeline, as shown below, the meta will get ignored. I think this might be because the root node (Whisper) ignores the meta.

The indexing pipeline:

whisper = WhisperTranscriber(api_key=api_key)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])

videos = ["https://www.youtube.com/watch?v=h5id4erwD4s", "https://www.youtube.com/watch?v=iFUeV3aYynI"]

# for video in videos:
file_path1 = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")
file_path2 = youtube2audio("https://www.youtube.com/watch?v=iFUeV3aYynI")
doc1 = {'file_path': file_path1, "url": "https://www.youtube.com/watch?v=h5id4erwD4s"}
doc2 = {'file_path': file_path2, "url": "https://www.youtube.com/watch?v=iFUeV3aYynI"}

indexing_pipeline.run(file_paths=[doc1['file_path'], doc2['file_path']], meta=[{"url": doc['url'] for doc in [doc1, doc2]}])

anakin87 · 2023-09-05T09:06:54Z

As Tuana said, meta is ignored.

See, for example, the run method:

haystack/haystack/nodes/audio/whisper_transcriber.py

Lines 176 to 186 in a5b8156

  :param meta: Ignored 

  """ 

 transcribed_documents: List[Document] = [] 

 if file_paths: 

 for file_path in file_paths: 

 transcription = self.transcribe(file_path) 

 d = Document.from_dict(transcription, field_map={"text": "content"}) 

 transcribed_documents.append(d) 

 output = {"documents": transcribed_documents} 

 return output, "output_1"

TuanaCelik added the topic:LLM label Sep 4, 2023

TuanaCelik added type:bug Something isn't working topic:metadata labels Sep 5, 2023

ZanSara added the 2.x Related to Haystack v2.0 label Sep 5, 2023

Timoeller added type:feature New feature or request P3 Low priority, leave it in the backlog and removed type:bug Something isn't working labels Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`WhisperTranscriber` to add filename to document metadata #5716

`WhisperTranscriber` to add filename to document metadata #5716

TuanaCelik commented Sep 4, 2023

TuanaCelik commented Sep 5, 2023 •

edited

Loading

anakin87 commented Sep 5, 2023

WhisperTranscriber to add filename to document metadata #5716

WhisperTranscriber to add filename to document metadata #5716

Comments

TuanaCelik commented Sep 4, 2023

TuanaCelik commented Sep 5, 2023 • edited Loading

anakin87 commented Sep 5, 2023

`WhisperTranscriber` to add filename to document metadata #5716

`WhisperTranscriber` to add filename to document metadata #5716

TuanaCelik commented Sep 5, 2023 •

edited

Loading