Get file names from InMemoryDocumentStore #211

anirbansaha96 · 2020-07-09T04:36:34Z

I'm using InMemoryDocumentStore as my document_store. At a point, I'm using PDToTextConverter followed by writing them into the document store.

Is there any way to get a list of files in the InMemoryDocumentStore().

Also in my answer output, with the command print_answers(prediction, details="all") I'm getting a document doument_id, is there any way to leverage this information to perhaps get the filename.

The text was updated successfully, but these errors were encountered:

tanaysoni · 2020-07-09T12:30:19Z

Hi @anirbansaha96,

Is there any way to get a list of files in the InMemoryDocumentStore().

You can use InMemoryDocumentStore.get_all_documents() method.

doument_id, is there any way to leverage this information to perhaps get the filename.

You can use InMemoryDocumentStore.get_document_by_id(). The returned Document object should have a name if it was supplied during indexing.

anirbansaha96 · 2020-07-09T15:23:21Z

@tanaysoni The output for my command InMemoryDocumentStore.get_all_documents(document_store) is
[Document(id='6257a1cdb8e5f13804b65b3b8125d509', text='Security best practices for Azure solutions', external_source_id=None, question=None, query_score=None, meta={}, tags=None)]

But when I'm searching InMemoryDocumentStore.get_document_by_id(document_store,'6257a1cdb8e5f13804b65b3b8125d509') or InMemoryDocumentStore.get_document_by_id(document_store,id='6257a1cdb8e5f13804b65b3b8125d509'), it should ideally give me an output as the filename, like you mentioned. But it is showing the following error:

KeyError                                  Traceback (most recent call last)
<ipython-input-15-ce74eccb6d51> in <module>()
----> 1 InMemoryDocumentStore.get_document_by_id(document_store,'62b8c57494d145acdcecf3950754fa61')

1 frames
/content/haystack/haystack/database/memory.py in _convert_memory_hit_to_document(self, hit, doc_id)
     69         document = Document(
     70             id=doc_id,
---> 71             text=hit[0].get('text', None),
     72             meta=hit[0].get('meta', {}),
     73             query_score=hit[1],

KeyError: 0

anirbansaha96 · 2020-07-11T08:21:31Z

@tanaysoni, wanted to add some information in case it helps to pinpoint the issue.
With the command document_store.get_all_documents() , I get the output id='8096d30a73bca646151018cbd42fa977'.
However with print_answers(prediction, details="all"), the same document has 'document_id': '1843'

anirbansaha96 · 2020-07-15T07:49:09Z

@tanaysoni Has there been changes made to print_answers( #something , details="all") because it is no longer showing document_id like before.

tanaysoni · 2020-07-15T09:01:04Z

Hi @anirbansaha96,

With #217, you can now get file names using the ID for all document stores. Here's an example:

from haystack.database.memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

test_docs = [
    {"name": "testing the finder 1", "text": "testing the finder with pyhton unit test 1", 'meta': {'url': 'url'}},
    {"name": "testing the finder 2", "text": "testing the finder with pyhton unit test 2", 'meta': {'url': 'url'}},
    {"name": "testing the finder 3", "text": "testing the finder with pyhton unit test 3", 'meta': {'url': 'url'}}
]
document_store.write_documents(test_docs)

print(document_store.get_document_by_id("e97e6fbebbc591fe7214e0bf26ec5dbf").meta["name"])

tanaysoni · 2020-07-15T09:02:22Z

Has there been changes made to print_answers( #something , details="all") because it is no longer showing document_id like before.

~~I am not able to reproduce this issue with tutorials 1 & 3 on the latest master. If possible, could you share a code snippet to help reproduce?~~

The issue is print_answers() is deleting keys from the passed results dicts. It should get resolved with #230.

anirbansaha96 · 2020-07-15T12:08:16Z

@tanaysoni There is still the same error, print_answers() prints a document id as 'document_id': '742', however print(document_store.get_document_by_id("742").meta["name"]) gives KeyError:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-b74f2b2ca29a> in <module>()
----> 1 print(document_store_Customer_Content.get_document_by_id("742").meta["name"])

/content/haystack/haystack/database/memory.py in get_document_by_id(self, id)
     65 
     66     def get_document_by_id(self, id: str) -> Document:
---> 67         document = self._convert_memory_hit_to_document(self.docs[id], doc_id=id)
     68         return document
     69 

KeyError: '742'

anirbansaha96 · 2020-07-15T12:24:16Z

@tanaysoni one update to pinpoint this issue:

document_store.get_all_documents() gives id='30af7155e4a229c32317768ec16954eb' and using print(document_store.get_document_by_id("30af7155e4a229c32317768ec16954eb").meta["name"]) gives me the correct name.
However print_answers() provides a document id 742 which doesn't work with print(document_store.get_document_by_id("742").meta["name"])

anirbansaha96 · 2020-07-15T12:25:33Z

Thank You #232 will solve this issue hopefully.

tanaysoni · 2020-07-15T12:55:53Z

Hi @anirbansaha96, thank you for raising the issue. It is now resolved with #232.

anirbansaha96 · 2020-07-15T13:37:50Z

Thank You, I've checked it. It works fine now. Thank You!

sophgit · 2020-12-15T10:55:05Z

Hi!

I'm trying to use the elasticsearch retriever like this:
retriever.retrieve("When is Britta on vacation?")
and would like to return the document name and/or text. I know that I can get the name and text by doing this
document_store.get_document_by_id("5c4fc733-a69b-479f-bce6-2fc517623cd9").text
document_store.get_document_by_id("5c4fc733-a69b-479f-bce6-2fc517623cd9").meta["name"]

However, the retriever.retrieve returns something like [<haystack.schema.Document at 0x7efcd596b510>].
How can I get the document id of the retrieved document, so I can access the documents text or name?

Thank you!!

anirbansaha96 · 2020-12-15T11:13:18Z

@tanaysoni

tanaysoni · 2020-12-15T11:38:17Z

Hi @sophgit, you can use document.id to get the document ids of the retrieved documents.

It seems you're using an earlier version of Haystack. With the current master branch, the representation of a document(in debugger/console, etc) is changed to be human-readable rather than the cryptic object notation.

To upgrade to the laster master branch, you can follow the installation guide.

sophgit · 2020-12-15T12:35:42Z

aah, thank you @tanaysoni, it worked with the current master branch.

tanaysoni self-assigned this Jul 9, 2020

tanaysoni added the question label Jul 9, 2020

tanaysoni closed this as completed Jul 9, 2020

tanaysoni added type:bug Something isn't working and removed question labels Jul 9, 2020

tanaysoni reopened this Jul 9, 2020

tanaysoni closed this as completed Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get file names from InMemoryDocumentStore #211

Get file names from InMemoryDocumentStore #211

anirbansaha96 commented Jul 9, 2020

tanaysoni commented Jul 9, 2020

anirbansaha96 commented Jul 9, 2020 •

edited

Loading

anirbansaha96 commented Jul 11, 2020 •

edited

Loading

anirbansaha96 commented Jul 15, 2020 •

edited

Loading

tanaysoni commented Jul 15, 2020

tanaysoni commented Jul 15, 2020 •

edited

Loading

anirbansaha96 commented Jul 15, 2020

anirbansaha96 commented Jul 15, 2020

anirbansaha96 commented Jul 15, 2020

tanaysoni commented Jul 15, 2020

anirbansaha96 commented Jul 15, 2020

sophgit commented Dec 15, 2020

anirbansaha96 commented Dec 15, 2020

tanaysoni commented Dec 15, 2020 •

edited

Loading

sophgit commented Dec 15, 2020

Get file names from InMemoryDocumentStore #211

Get file names from InMemoryDocumentStore #211

Comments

anirbansaha96 commented Jul 9, 2020

tanaysoni commented Jul 9, 2020

anirbansaha96 commented Jul 9, 2020 • edited Loading

anirbansaha96 commented Jul 11, 2020 • edited Loading

anirbansaha96 commented Jul 15, 2020 • edited Loading

tanaysoni commented Jul 15, 2020

tanaysoni commented Jul 15, 2020 • edited Loading

anirbansaha96 commented Jul 15, 2020

anirbansaha96 commented Jul 15, 2020

anirbansaha96 commented Jul 15, 2020

tanaysoni commented Jul 15, 2020

anirbansaha96 commented Jul 15, 2020

sophgit commented Dec 15, 2020

anirbansaha96 commented Dec 15, 2020

tanaysoni commented Dec 15, 2020 • edited Loading

sophgit commented Dec 15, 2020

anirbansaha96 commented Jul 9, 2020 •

edited

Loading

anirbansaha96 commented Jul 11, 2020 •

edited

Loading

anirbansaha96 commented Jul 15, 2020 •

edited

Loading

tanaysoni commented Jul 15, 2020 •

edited

Loading

tanaysoni commented Dec 15, 2020 •

edited

Loading