Skip to content
This repository has been archived by the owner on Oct 20, 2022. It is now read-only.

Add pdf ocr usage #140

Closed
wants to merge 5 commits into from
Closed

Add pdf ocr usage #140

wants to merge 5 commits into from

Conversation

shahrukhx01
Copy link

Based on OCR PR

@vercel
Copy link

vercel bot commented Sep 2, 2021

@shahrukhx01 is attempting to deploy a commit to the Deepset Team on Vercel.

A member of the Team first needs to authorize it.

@shahrukhx01
Copy link
Author

shahrukhx01 commented Sep 2, 2021

Hi @PiffPaffM FYI. please review when you’ve time.

@vercel
Copy link

vercel bot commented Sep 2, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/deepset-overnice/haystack-website/3XBLVdVQvE4G7Fp6mBEskKFvAzpv
✅ Preview: https://haystack-website-git-fork-shahrukhx01-a-f7d6bb-deepset-overnice.vercel.app

</code>
<code>
converter = PDFToTextOCRConverter(remove_numeric_tables=False,
valid_languages=["deu","eng"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to use "deu" here or does it need to be "de"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is "deu" since we are using tesseract, should we also add a link for language codes from tesseract?
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be cool. I was a bit confused.

Copy link
Author

@shahrukhx01 shahrukhx01 Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PiffPaffM I have updated the documentation for this part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The code box for PDF looks a bit weird: https://haystack-website-git-fork-shahrukhx01-a-f7d6bb-deepset-overnice.vercel.app/usage/preprocessing#file-conversion. I think the comment is not interpreted correctly. Can you have a look?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it into a code comment, but where should we put the link for Tessaract language codes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest you put the link in that same comment. However in the vercel preview, it seems like the import statement is on the same line as the comment.
https://haystack-website-hv1bt4525-deepset-overnice.vercel.app/usage/preprocessing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we just merged a new branch in that moves this file from docs/latest/usage/preprocessing.mdx to docs/latest/components/preprocessing.mdx. Could you also make this change so that this PR can be merged?

@PiffPaffM
Copy link
Contributor

@shahrukhx01 Thanks for the contribution to our docs! That is great! Looking forward to merging it after these minor changes.

@shahrukhx01
Copy link
Author

Hi @PiffPaffM following up on this :)

Copy link
Contributor

@PiffPaffM PiffPaffM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

</code>
<code>
converter = PDFToTextOCRConverter(remove_numeric_tables=False,
valid_languages=["deu","eng"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The code box for PDF looks a bit weird: https://haystack-website-git-fork-shahrukhx01-a-f7d6bb-deepset-overnice.vercel.app/usage/preprocessing#file-conversion. I think the comment is not interpreted correctly. Can you have a look?

</code>
<code>
converter = PDFToTextOCRConverter(remove_numeric_tables=False,
valid_languages=["deu","eng"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest you put the link in that same comment. However in the vercel preview, it seems like the import statement is on the same line as the comment.
https://haystack-website-hv1bt4525-deepset-overnice.vercel.app/usage/preprocessing

</code>
<code>
converter = PDFToTextOCRConverter(remove_numeric_tables=False,
valid_languages=["deu","eng"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we just merged a new branch in that moves this file from docs/latest/usage/preprocessing.mdx to docs/latest/components/preprocessing.mdx. Could you also make this change so that this PR can be merged?

@brandenchan brandenchan mentioned this pull request Sep 20, 2021
@brandenchan
Copy link
Contributor

Hi @shahrukhx01, I realise that the restructuring of the repo might have made this PR a little tricky to finish off. Would it be alright if I incorporate your changes into another PR of mine (#164)? This will save the effort of rebasing / merging branches. We will of course credit your work in the new PR :)

@shahrukhx01
Copy link
Author

@brandenchan yes definitely no problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants