Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get file names from InMemoryDocumentStore #211

Closed
anirbansaha96 opened this issue Jul 9, 2020 · 15 comments
Closed

Get file names from InMemoryDocumentStore #211

anirbansaha96 opened this issue Jul 9, 2020 · 15 comments
Assignees
Labels
type:bug Something isn't working

Comments

@anirbansaha96
Copy link
Contributor

I'm using InMemoryDocumentStore as my document_store. At a point, I'm using PDToTextConverter followed by writing them into the document store.

Is there any way to get a list of files in the InMemoryDocumentStore().

Also in my answer output, with the command print_answers(prediction, details="all") I'm getting a document doument_id, is there any way to leverage this information to perhaps get the filename.

@tanaysoni tanaysoni self-assigned this Jul 9, 2020
@tanaysoni
Copy link
Contributor

Hi @anirbansaha96,

Is there any way to get a list of files in the InMemoryDocumentStore().

You can use InMemoryDocumentStore.get_all_documents() method.

doument_id, is there any way to leverage this information to perhaps get the filename.

You can use InMemoryDocumentStore.get_document_by_id(). The returned Document object should have a name if it was supplied during indexing.

@anirbansaha96
Copy link
Contributor Author

anirbansaha96 commented Jul 9, 2020

@tanaysoni The output for my command InMemoryDocumentStore.get_all_documents(document_store) is
[Document(id='6257a1cdb8e5f13804b65b3b8125d509', text='Security best practices for Azure solutions', external_source_id=None, question=None, query_score=None, meta={}, tags=None)]

But when I'm searching InMemoryDocumentStore.get_document_by_id(document_store,'6257a1cdb8e5f13804b65b3b8125d509') or InMemoryDocumentStore.get_document_by_id(document_store,id='6257a1cdb8e5f13804b65b3b8125d509'), it should ideally give me an output as the filename, like you mentioned. But it is showing the following error:

KeyError                                  Traceback (most recent call last)
<ipython-input-15-ce74eccb6d51> in <module>()
----> 1 InMemoryDocumentStore.get_document_by_id(document_store,'62b8c57494d145acdcecf3950754fa61')

1 frames
/content/haystack/haystack/database/memory.py in _convert_memory_hit_to_document(self, hit, doc_id)
     69         document = Document(
     70             id=doc_id,
---> 71             text=hit[0].get('text', None),
     72             meta=hit[0].get('meta', {}),
     73             query_score=hit[1],

KeyError: 0

@tanaysoni tanaysoni added type:bug Something isn't working and removed question labels Jul 9, 2020
@tanaysoni tanaysoni reopened this Jul 9, 2020
@anirbansaha96
Copy link
Contributor Author

anirbansaha96 commented Jul 11, 2020

@tanaysoni, wanted to add some information in case it helps to pinpoint the issue.
With the command document_store.get_all_documents() , I get the output id='8096d30a73bca646151018cbd42fa977'.
However with print_answers(prediction, details="all"), the same document has 'document_id': '1843'

@anirbansaha96
Copy link
Contributor Author

anirbansaha96 commented Jul 15, 2020

@tanaysoni Has there been changes made to print_answers( #something , details="all") because it is no longer showing document_id like before.

@tanaysoni
Copy link
Contributor

Hi @anirbansaha96,

With #217, you can now get file names using the ID for all document stores. Here's an example:

from haystack.database.memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

test_docs = [
    {"name": "testing the finder 1", "text": "testing the finder with pyhton unit test 1", 'meta': {'url': 'url'}},
    {"name": "testing the finder 2", "text": "testing the finder with pyhton unit test 2", 'meta': {'url': 'url'}},
    {"name": "testing the finder 3", "text": "testing the finder with pyhton unit test 3", 'meta': {'url': 'url'}}
]
document_store.write_documents(test_docs)

print(document_store.get_document_by_id("e97e6fbebbc591fe7214e0bf26ec5dbf").meta["name"])

@tanaysoni
Copy link
Contributor

tanaysoni commented Jul 15, 2020

Has there been changes made to print_answers( #something , details="all") because it is no longer showing document_id like before.

I am not able to reproduce this issue with tutorials 1 & 3 on the latest master. If possible, could you share a code snippet to help reproduce?

The issue is print_answers() is deleting keys from the passed results dicts. It should get resolved with #230.

@anirbansaha96
Copy link
Contributor Author

@tanaysoni There is still the same error, print_answers() prints a document id as 'document_id': '742', however print(document_store.get_document_by_id("742").meta["name"]) gives KeyError:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-b74f2b2ca29a> in <module>()
----> 1 print(document_store_Customer_Content.get_document_by_id("742").meta["name"])

/content/haystack/haystack/database/memory.py in get_document_by_id(self, id)
     65 
     66     def get_document_by_id(self, id: str) -> Document:
---> 67         document = self._convert_memory_hit_to_document(self.docs[id], doc_id=id)
     68         return document
     69 

KeyError: '742'

@anirbansaha96
Copy link
Contributor Author

@tanaysoni one update to pinpoint this issue:

  1. document_store.get_all_documents() gives id='30af7155e4a229c32317768ec16954eb' and using print(document_store.get_document_by_id("30af7155e4a229c32317768ec16954eb").meta["name"]) gives me the correct name.

  2. However print_answers() provides a document id 742 which doesn't work with print(document_store.get_document_by_id("742").meta["name"])

@anirbansaha96
Copy link
Contributor Author

Thank You #232 will solve this issue hopefully.

@tanaysoni
Copy link
Contributor

Hi @anirbansaha96, thank you for raising the issue. It is now resolved with #232.

@anirbansaha96
Copy link
Contributor Author

Thank You, I've checked it. It works fine now. Thank You!

@sophgit
Copy link

sophgit commented Dec 15, 2020

Hi!

I'm trying to use the elasticsearch retriever like this:
retriever.retrieve("When is Britta on vacation?")
and would like to return the document name and/or text. I know that I can get the name and text by doing this
document_store.get_document_by_id("5c4fc733-a69b-479f-bce6-2fc517623cd9").text
document_store.get_document_by_id("5c4fc733-a69b-479f-bce6-2fc517623cd9").meta["name"]

However, the retriever.retrieve returns something like [<haystack.schema.Document at 0x7efcd596b510>].
How can I get the document id of the retrieved document, so I can access the documents text or name?

Thank you!!

@anirbansaha96
Copy link
Contributor Author

@tanaysoni

@tanaysoni
Copy link
Contributor

tanaysoni commented Dec 15, 2020

Hi @sophgit, you can use document.id to get the document ids of the retrieved documents.

It seems you're using an earlier version of Haystack. With the current master branch, the representation of a document(in debugger/console, etc) is changed to be human-readable rather than the cryptic object notation.

To upgrade to the laster master branch, you can follow the installation guide.

@sophgit
Copy link

sophgit commented Dec 15, 2020

aah, thank you @tanaysoni, it worked with the current master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants