Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WhisperTranscriber to add filename to document metadata #5716

Open
TuanaCelik opened this issue Sep 4, 2023 · 2 comments
Open

WhisperTranscriber to add filename to document metadata #5716

TuanaCelik opened this issue Sep 4, 2023 · 2 comments
Labels
2.x Related to Haystack v2.0 P3 Low priority, leave it in the backlog topic:LLM topic:metadata type:feature New feature or request

Comments

@TuanaCelik
Copy link
Member

It would be great if we provided the option to add the filename to the metadata of the documents that the WhisperTranscribercreates. Currently there's not good way of doing this. This would really help when building RAG pipelines where you want to query videos, but you want to reference the video in the response.

@TuanaCelik
Copy link
Member Author

TuanaCelik commented Sep 5, 2023

Additional learning with @anakin87 :
It seems that even if we want to add the meta via an indexing pipeline, as shown below, the meta will get ignored. I think this might be because the root node (Whisper) ignores the meta.

The indexing pipeline:

whisper = WhisperTranscriber(api_key=api_key)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])
videos = ["https://www.youtube.com/watch?v=h5id4erwD4s", "https://www.youtube.com/watch?v=iFUeV3aYynI"]

# for video in videos:
file_path1 = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")
file_path2 = youtube2audio("https://www.youtube.com/watch?v=iFUeV3aYynI")
doc1 = {'file_path': file_path1, "url": "https://www.youtube.com/watch?v=h5id4erwD4s"}
doc2 = {'file_path': file_path2, "url": "https://www.youtube.com/watch?v=iFUeV3aYynI"}

indexing_pipeline.run(file_paths=[doc1['file_path'], doc2['file_path']], meta=[{"url": doc['url'] for doc in [doc1, doc2]}])

@TuanaCelik TuanaCelik added type:bug Something isn't working topic:metadata labels Sep 5, 2023
@anakin87
Copy link
Member

anakin87 commented Sep 5, 2023

As Tuana said, meta is ignored.

See, for example, the run method:

:param meta: Ignored
"""
transcribed_documents: List[Document] = []
if file_paths:
for file_path in file_paths:
transcription = self.transcribe(file_path)
d = Document.from_dict(transcription, field_map={"text": "content"})
transcribed_documents.append(d)
output = {"documents": transcribed_documents}
return output, "output_1"

@ZanSara ZanSara added the 2.x Related to Haystack v2.0 label Sep 5, 2023
@Timoeller Timoeller added type:feature New feature or request P3 Low priority, leave it in the backlog and removed type:bug Something isn't working labels Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 P3 Low priority, leave it in the backlog topic:LLM topic:metadata type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants