Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 characters in authors leads to a crash #161

Open
CFoltin opened this issue Oct 19, 2017 · 14 comments
Open

UTF-8 characters in authors leads to a crash #161

CFoltin opened this issue Oct 19, 2017 · 14 comments

Comments

@CFoltin
Copy link

CFoltin commented Oct 19, 2017

Hi,

we had names with german letters like 'ö' inside. They terminated the application on text output.

To correct the bug, I've changed line 151 in changesoutput.py to the following line:

			print(str(i.encode(sys.stdout.encoding, errors='replace')).ljust(20), end=" ")

and imported sys.

HTH, Chris

@adam-waldenberg
Copy link
Member

Hi.

What encoding is the terminal ?
What version of Python are you using?
What exception are you getting?

You are doing errors="replace" here which works around the problem and doesn't really "solve" it. I suspect your terminal is not really set to UTF-8 and this is the actual reason for your issues.

@CFoltin
Copy link
Author

CFoltin commented Nov 2, 2017

Hi, sorry for late reply.

Encoding:
$ python -c "import sys; print(sys.stdout.encoding)"
ISO-8859-1

Python version:
$ python --version
Python 2.7.6

Error message:

Traceback (most recent call last):
File "/home/user/Downloads/gitinspector-master/gitinspector.py", line 24, in
gitinspector.main()
File "/home/user/Downloads/gitinspector-master/gitinspector/gitinspector.py", line 206, in main
run.process(repos)
File "/home/user/Downloads/gitinspector-master/gitinspector/gitinspector.py", line 83, in process
outputable.output(ChangesOutput(summed_changes))
File "/home/user/Downloads/gitinspector-master/gitinspector/output/outputable.py", line 43, in output
outputable.output_text()
File "/home/user/Downloads/gitinspector-master/gitinspector/output/changesoutput.py", line 151, in output_text
print(terminal.ljust(i, 20)[0:20 - terminal.get_excess_column_count(i)], end=" ")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\ufffd' in position 2: ordinal not in range(256)

With the following settings, not the right character appears, but it works:
$ export LC_ALL=de_DE.utf8
$ export LANG="$LC_ALL"

Thanks, Chris

@amorphius
Copy link

I have the same issue beside I have UTF-8 locale

$ gitinspector --version
...
...
    raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8

@M3kH
Copy link

M3kH commented Nov 21, 2017

I would say that this could solve.

Add to ~/.bash_profile

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

or whatever preference you want.
Seems just a generic issue with Python on Mac.

@adam-waldenberg
Copy link
Member

@CFoltin Hi.

Sorry. I forgot about this issue. In any case - it's completely normal. If your terminal can't support the character, python has no way of outputting it.

The export should do the trick though. But maybe something is still not set to UTF-8. You can try setting PYTHIONIOENCODING to utf8 or redirecting to a file - in which case these problems should never occur.

@devcurmudgeon
Copy link

Ironically I just hit this same issue today. I don't think it's "invalid", tbh.

imo gitinspector shouldn't crash just because it hits an odd character in git metadata...

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/dist-packages/gitinspector/blame.py", line 113, in run
self.handle_blamechunk_content(row)
File "/usr/local/lib/python2.7/dist-packages/gitinspector/blame.py", line 81, in handle_blamechunk_content
author = self.changes.get_latest_author_by_email(self.blamechunk_email)
File "/usr/local/lib/python2.7/dist-packages/gitinspector/changes.py", line 186, in get_latest_author_by_email
name = name.decode("unicode_escape", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0153' in position 15: ordinal not in range(128)

@adam-waldenberg
Copy link
Member

adam-waldenberg commented Nov 24, 2017

@devcurmudgeon

There are only a few options here.

  1. Ignore it and replace all characters that can not be outputted - this is something I'd rather not do, as the output will be invalid.
  2. Catch it and print out exactly the same error message ;)... Or a similar one. No point in that.
  3. Catch it, inform the user and output it with invalid characters replaced. Problem is that this will garble the output in the terminal (even if you use stderr for the warning).

This has been discussed so many times before (and not only in this project, mind you). I think it's better to just leave it, as these exceptions are very informative in python - it's also a common issue. If you plan on outputting unicode characters you best have a terminal set up to handle it. In your case, it's configured for ascii output and you are trying to output a œ character.

We actually have the following function,

def check_terminal_encoding():

This warns on mis-configured terminals that return "None" as encoding. However, it does not warn on ascii. Ascii may actually be OK if you happen to have a repo that only outputs standard ascii characters when you run gitinspector.

@CFoltin
Copy link
Author

CFoltin commented Nov 25, 2017

Hi,
this problem had cost me about 3 hours to get fixed. So, I would propose to improve the user's experience. The option to catch it and to print a remedia would be way better to leave the user with a stack trace, which appears after ~20min which the tool needed in my case to analyze the repo.

Just, my 2 cents.

BTW: In any case, this error message seems to be found....

@devcurmudgeon
Copy link

devcurmudgeon commented Nov 25, 2017

@CFoltin i'm with you. I was thinking of using gitinspector as part of a CI pipeline that builds a custom Linux distro ... approximately 700 repos. it worked well on some samples... but then crashed our pipeline, on a linux-api-headers repo, a couple of hours into the run.

Upstream folks obviously can make make their own choices about what they want to fix, but as a user I'm not interested in informative python stack-traces, I just want working software :-)

Note - i've taken plenty of heat from users moaning about stack traces in my own projects :-)

@devcurmudgeon
Copy link

@adam-waldenberg thanks for your reply. I may be wrong but I think you missed at least a couple of options:

  • force utf-8 if the environment is not set
  • skip any data that you can't handle but still generate results for all the rest of the data

Crashing a whole run because of an unexpected (but valid) character in a git repo's metadata doesn't seem like correct behaviour to me

@adam-waldenberg
Copy link
Member

@devcurmudgeon UTF-8 is forced on redirection. I won't be forcing UTF-8 on terminal output, because it's not always needed. Output also needs to work on other environments with extended UTF-8, UTF-16 etc. Strictly speaking, it's only author names (and sometimes filenames) that can be an issue. Again, skipping data would mean you get an invalid output, which is not an option either.

@CFoltin You can only catch it once you encounter it, so even if you catch it, it would still take time before you know about it. Also, depending on what is wrong with the environment, there are a number of fixes that may or may not work.

In the end, it comes down to the fact that you can't know for sure what character set you may encounter in the repository. It can even be several ones.

One option I can see that I could live with is to catch it and print it out with replaced characters... We could then add a disclaimer at the end of the output stating that the output is not 100% correct and that it had to be modified in order to accommodate the terminal charset. However, I'm afraid it would raise even more questions though, as you know longer have the python exception to search on. Alternatively, the first/last exception encountered could also be included in the disclaimer.

@adam-waldenberg
Copy link
Member

I have decided to catch this exception and let the error message point to some of the issues here on the project page. This should let people that run into this problem to more effectively understand it and remedy it.

@banbar
Copy link

banbar commented Nov 29, 2020

I have received a similar error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 380: character maps to
Python version: 3.7

Found the solution: https://stackoverflow.com/a/57134096/1959766

@adam-waldenberg
Copy link
Member

@banbar Thank you. I don't think that Windows-specific solution has been covered anywhere on the issue tracker so far. I know it's more a Python and terminal thing than it is a gitinspector thing, but it I'm considering doing a F.A.Q/Wiki with common environment related issues that can be encountered. Maybe link into the issue tracker etc.

Now and again this (or related issues) keep coming up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants