Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filename arguments with spaces don't work #31

Open
aslehigh opened this issue Jul 19, 2019 · 4 comments
Open

Filename arguments with spaces don't work #31

aslehigh opened this issue Jul 19, 2019 · 4 comments
Labels

Comments

@aslehigh
Copy link
Contributor

aslehigh commented Jul 19, 2019

If my filename contains one or more spaces (or, I suppose, any character the shell considers special), the program throws an error:

In [19]: fieldlist = pypdftk.dump_data_fields("test file.pdf")
Error: Unable to find file.
Error: Failed to open PDF file:
   test
Error: Unable to find file.
Error: Failed to open PDF file:
   file.pdf
Done.  Input errors, so no output created.
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-19-7f19f024450d> in <module>()
----> 1 fieldlist = p.dump_data_fields("test file.pdf")

/usr/local/lib/python3.5/dist-packages/pypdftk.py in dump_data_fields(pdf_path)
     94     # Or return bytes with : (will break tests)
     95     #    field_data = map(lambda x: x.split(b': ', 1), run_command(cmd, True))
---> 96     field_data = map(lambda x: x.decode("utf-8").split(': ', 1), run_command(cmd, True))
     97     fields = [list(group) for k, group in itertools.groupby(field_data, lambda x: len(x) == 1) if not k]
     98     return [dict(f) for f in fields]

/usr/local/lib/python3.5/dist-packages/pypdftk.py in run_command(command, shell)
     41 def run_command(command, shell=False):
     42     ''' run a system command and yield output '''
---> 43     p = check_output(command, shell=shell)
     44     return p.split(b'\n')
     45

/usr/local/lib/python3.5/dist-packages/pypdftk.py in check_output(*popenargs, **kwargs)
     35         if cmd is None:
     36             cmd = popenargs[0]
---> 37         raise subprocess.CalledProcessError(retcode, cmd, output=output)
     38     return output
     39

CalledProcessError: Command '/usr/bin/pdftk test file.pdf dump_data_fields' returned non-zero exit status 1

From the last line there I could see what the problem was: the filename is plopped into the command line unquoted. So I can work around the problem by including quote characters in the argument:

In [19]: fieldlist = pypdftk.dump_data_fields("'test file.pdf'")

I think this really should not be necessary. This behaviour would be confusing to new users, and makes the library more difficult to use.

After skimming through the source code, it looks like a minimal solution could be to simply add quote characters around the user-supplied filename arguments everywhere they are used. Quite likely there are other sorts of arguments that should be quoted on the command line also.

But I have another suggestion: why not just quote everything that goes on the command line? The run_command() function could take a list of items or tokens, and simply join them with a space between after adding the quote characters. Furthermore, it appears to me that the run_command() function is called with the same first argument every time. If this is true, why not include it in the function instead of requiring the caller to pass it? Obviously this would involve a lot more modification to the code, but I think it would simplify it on the whole, and potentially eliminate a whole class of possible bugs/user errors. I guess that would be a breaking change though for existing code that already adds quotes.

@revolunet revolunet added the bug label Sep 22, 2019
@revolunet
Copy link
Owner

Yes, indeed, this is confusing

@revolunet
Copy link
Owner

Tried to add quotes from the code but this doesnt seem to work :/

Curious how other libs manage this

@aslehigh
Copy link
Contributor Author

For the simplest possible solution I would suggest replacing command in the check_output function call in the run_command function (currently line 43 in pypdftk.py) with map(shlex.quote, command) ... and of course add the shlex import at the top.

This would accomplish my suggestion of quoting everything (and address a whole class of potentially security-related bugs in code using this library). It would be a breaking change though for anyone already quoting the arguments in their own code. But I'm guessing maybe not too many people have run into this or you would have heard about it before now........?

@tomasagjr
Copy link

Tried to add quotes from the code but this doesnt seem to work :/

Curious how other libs manage this

Possibly using f strings as follows... filename = f'"{os.path.join(path, "filename with spaces.pdf")}"'

Note, however, that if your f string starts (and ends) with single quotes, then any strings (incl. key names), inside the curly braces must be in double quotes, and vice versa. Otherwise, you'll raise a SyntaxError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants