Change Text extraction is not allowed error to warning #453

madhurcodes · 2020-07-03T01:11:30Z

I was using this library for parsing a large set of files and it gave an error halfway through because it encountered such a file. Users should not have to handle this manually in a library meant for parsing PDFs.

Changes the Text extraction is not allowed error thrown to a descriptive warning.

Fixes #350

How Has This Been Tested?

I tested this on an example PDF which threw the mentioned error with the original code and gave the correct text after the change as expected. I cannot share the PDF because I don't own copyright to it.

Checklist

I have added tests that prove my fix is effective or that my feature works
I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial version
I have updated the README.md or I am verified that this is not necessary
I have updated the readthedocs documentation or I verified that this is not necessary
I have added a consice human-readable description of the change to CHANGELOG.md

pietermarsman

Hi @madhurcodes,

Thanks for your work. Could you make sure that the changes are like the suggested changes in the issue:

Set the default value for check_extractable to False.

If check_extractable is True we throw an Error, if False we raise a warning.

Remove the explicit arguments for check_extractable from the high_level module.

pdfminer/pdfpage.py

pietermarsman

Looks good.

Last thing: can you add a test that checks if the warning is warned and, more importantly, that the code completes without errors.

madhurcodes · 2020-07-07T19:46:41Z

Hey as mentioned in the PR the PDFs I have which trigger this warning are copyright protected so I can't upload them here for testing. I have tested that it raises the correct warning on my machine.

The code is really basic and it should be clear that there is no functional change so IMO it should not require a new test since the earlier error throwing code was also untested.

pietermarsman · 2020-07-07T21:10:11Z

I tried changing the content of an sample encrypted document, but apparently its hard to change the password of an encrypted file using only vim 😖

Should have looked at the issue earlier. An example pdf is posted by @Recursing. You could use that one. I also notices that the encrypted samples that are in the repo already are never used for testing, these can also be used.

A lot of functionality of pdfminer.six is still untested. Having a test suite for all functionality makes it easier to add code, and to change/improve implementations in the future. That's why I'm asking. I do believe that this code works as intended 😄

If you have time to add the sample pdf from the issue, please do so. Otherwise I will find some time to do it myself and then merge this. Thanks for all the efforts so far!

pietermarsman · 2020-07-11T14:04:36Z

Thanks @madhurcodes

madhurcodes added 3 commits July 3, 2020 06:03

Changed error to warning for 'Text extraction is not allowed'

9cfb34e

updated changelog

9059c3d

fix lint

5eeba27

madhurcodes mentioned this pull request Jul 3, 2020

Add check_extractable argument to high_level.extract_text #350

Closed

pietermarsman requested changes Jul 5, 2020

View reviewed changes

pdfminer/pdfpage.py Show resolved Hide resolved

pdfminer/pdfpage.py Outdated Show resolved Hide resolved

made changes suggested in review

d4fb41f

madhurcodes requested a review from pietermarsman July 5, 2020 18:08

pietermarsman requested changes Jul 7, 2020

View reviewed changes

Update CHANGELOG.md

82c8150

Add regression test for failing pdf

61dccd9

pietermarsman approved these changes Jul 11, 2020

View reviewed changes

Reduce line length to <80

7a2220f

pietermarsman merged commit 6a9269b into pdfminer:develop Jul 11, 2020

dufferzafar mentioned this pull request Nov 18, 2021

Raise warning instead of error when doc is not extractable metebalci/pdftitle#27

Closed

DaBeIDS mentioned this pull request Sep 8, 2023

Documentation added and extraction issue resolved os-climate/corporate_data_extraction#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Text extraction is not allowed error to warning #453

Change Text extraction is not allowed error to warning #453

madhurcodes commented Jul 3, 2020 •

edited by pietermarsman

Loading

pietermarsman left a comment

pietermarsman left a comment

madhurcodes commented Jul 7, 2020 •

edited

Loading

pietermarsman commented Jul 7, 2020

pietermarsman commented Jul 11, 2020

Change Text extraction is not allowed error to warning #453

Change Text extraction is not allowed error to warning #453

Conversation

madhurcodes commented Jul 3, 2020 • edited by pietermarsman Loading

pietermarsman left a comment

Choose a reason for hiding this comment

pietermarsman left a comment

Choose a reason for hiding this comment

madhurcodes commented Jul 7, 2020 • edited Loading

pietermarsman commented Jul 7, 2020

pietermarsman commented Jul 11, 2020

madhurcodes commented Jul 3, 2020 •

edited by pietermarsman

Loading

madhurcodes commented Jul 7, 2020 •

edited

Loading