Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change narrow char load_file() interface to assume UTF-8 on Windows. #45

Open
ybungalobill opened this issue Jun 27, 2015 · 7 comments
Open

Comments

@ybungalobill
Copy link

  • Narrow-char interface on Windows is useless because it automatically makes its users Unicode-broken.
  • This would enable supporting Unicode seamlessly in portable code.
  • Would let us open XMLs referenced from another XML document, external databases or configuration files (all UTF-8), without platform dependent re-encoding logic.
  • See how SQLite does it. I had never ever worried about Unicode with SQLite, because of this.
@zeux
Copy link
Owner

zeux commented Jun 27, 2015

Using UTF-8 instantly breaks compatibility with any CRT functions.

For example, on Windows, in a portable program, you can do doc.load_file(argv[1]). And it would instantly fail for any file name that contains localized characters. Without UTF-8 conversion (aka 'right now') it would succeed for any file name that contains characters in the system locale - which is worse than supporting any file name but better than nothing.

In general it's extremely surprising to have the function that loads/saves the file behave differently from the two mechanisms available for that in C++ (fopen and std::fstream). When I get reports of load_file() returning file_not_found but the file being there, I ask to use fopen to show that the issue is not with pugixml but with the calling environment.

Having an option of loading a file by UTF-8 path makes sense. You can use doc.load_file(pugi::as_wide(...).c_str()) right now, but this is of course a bit more cumbersome than doc.load_file_utf8() or some equivalent.

@ybungalobill
Copy link
Author

Yes, this is true. But for main() or fopen(), solutions do exist. Using as_wide(...) manually is a pain, because I want it to work this way on Windows, but on Unixes it is the wrong thing to do. Today I always work with a patched version of pugixml on Windows.

I am aware of the compatibility problems with existing code. Personally I say that such code is broken anyway -- what's the difference of being 99.9% broken or 99.7% broken?

Yet, from pugi's POV, it would be enough to have it as a non-default compile time switch (i.e. load_file() assumes legacy broken ANSI versus UTF-8). I appreciate your interest of keeping this functionality as a defensive measure against people who don't understand encodings, but keep in mind that some libraries are capable to manage it anyway.

@zeux
Copy link
Owner

zeux commented Jun 27, 2015

on Unixes it is the wrong thing to do

Why? as_wide will work on Unixes as well. (of course, it's slightly less efficient than just passing the UTF-8 path directly to fopen)

Personally I say that such code is broken anyway -- what's the difference of being 99.9% broken or 99.7% broken

I disagree with the percentages here. The amount of applications that use Windows ANSI APIs is staggering. A lot of them also work "good enough" for actual users wrt path encoding. pugixml takes a stance on being practical and solving issues that people encounter as opposed to spearheading UTF8 (I would love for UTF8 to be the C++ encoding, believe me).

In practical terms this means that I'm happy to provide "advanced" functionality for those who need it but I don't want to force all users into something they may get confused by.

@ybungalobill
Copy link
Author

Why? as_wide will work on Unixes as well.

The problem is not in as_wide. It's just that Unix treats filenames as blobs. See Boost.Nowide FAQ or this blog about Python v3 Unicode fail.

I disagree with this. The amount of applications that use Windows ANSI facilities is staggering. A lot of them also work "good enough" for actual users wrt file encoding.

It is true that there is simply a lot of broken code. But the reason this code works "good enough" is because it is fed with ASCII most of the time. Give it a directory of eBooks and it will fail. Change the system locale and it will fail. And with load_file as UTF-8 it will continue working with ASCII as before.

pugixml takes a stance on being practical and solving issues that people encounter as opposed to spearheading UTF8 (I would love for UTF8 to be the C++ encoding, believe me).

How helping those who use UTF-8 for narrow strings in C++ isn't practical? Besides you gave none arguments against supporting this as a compile-time flag.

I also would like to add that the argument that load_file would behave differently from fopen is weak. Note that current pugixml, when compiled with narrow chars, uses UTF-8 for almost all of its interfaces (the exceptions are filenames and the parsing which detects the encoding). So in some sense this code

cout << n.child_value();
fopen(n.child_value(), "rb");

suffers from the same compatibility with CRT problems as you claim load_file would. And, btw, compiling for wide-chars and using wcout won't solve the above.

@zeux
Copy link
Owner

zeux commented Jun 27, 2015

It's just that Unix treats filenames as blobs.

Right. But then load_file won't use UTF-8 as an input! (e.g. it will accept malformed UTF-8 sequences) Right now the contract is "it takes the file name in whatever encoding the underlying system uses". Which is not UTF-8 on Unix systems (hopefully it's a superset) and not UTF-8 on Windows. The question is what's of more practical use, have default load/save functions behave as all other file access functions do or try to have a UTF-8 superset on all platforms.

But the reason this code works "good enough" is because it is fed with ASCII most of the time.

Most of the files that users will create will use their default locale. This is not true for some cases but this is overwhelmingly true. This of course is not an argument for Windows ANSI over UTF-8 - this is just to say that this is a reasonable compromise given the history of those APIs.

So in some sense this code suffers from the same compatibility with CRT problems as you claim load_file would.

If you use pugixml you can either know about the issues that Unicode brings or ignore them.
If you open any files without using pugixml in the same program, the difference will become apparent.

Sure, fopen on n.child_value() will fail to open localized paths if the document is encoded using UTF-8 (some users use pugixml to decode ANSI-encoded documents, but the majority probably sticks to UTF-8). But so will doc.load_file. And getting the filename through any other means will fail in the same manner. Essentially, narrow-char load_file is as "broken" as any other file opening API in C++.

Now, if you want to solve this problem, you must know you have it. You would not use fopen or std::fstream to open files. Maybe you'd use Boost or some other library or home-grown conversion primitives - regardless, if you know about this problem it's easy to see why load_file won't work - hey, no other API does!

If you don't know anything about these problems however, I'm not sure that making load_file take UTF-8 paths solves any issues for you. Because if you open any other files they may not work. So sure, if your XML file contains a reference to an XML file, that would work. But if it contains a reference to a .PNG file that you have to decode or a .log file that is opened separately, that won't. Which would force you to learn about the problems of file access anyway.

Right now it feels that the approach of transparently handling file access as UTF-8 just introduces extra incompatibilities. Are you aware of any libraries other than SQlite (and of course Boost.NoWide which is a different case - when you use that you solve that exact problem) that use UTF8-encoded paths on Windows?

And, btw, compiling for wide-chars and using wcout won't solve the above.

This is probably not relevant to this issue, but why not?

@ybungalobill
Copy link
Author

Right. But then load_file won't work on UTF-8! [...]

It would work on UTF-8 on Windows, because no native narrow encoding on Windows is capable of representing all valid filenames supported by that same system. For other systems it simply does not matter as much, because it's user's choice to use something that isn't Unicode compatible.

[...] some users use pugixml to decode ANSI-encoded documents, but the majority probably sticks to UTF-8

Don't latest versions of pugixml detect and convert ISO-8859-1 to UTF-8?

[...] Which would force you to learn about the problems of file access anyway.

Yes. But after you learnt, what's next? You would have to reinvent the wheel and hope for the best.

Are you aware of any libraries other than SQlite [...] that use UTF8-encoded paths on Windows?

GDAL, say.

Sure, most libraries out there simply write standard C and don't provide any Windows support at all. We have to patch them manually or, if they're closed source, revert to other hacks.

@zeux
Copy link
Owner

zeux commented Jun 27, 2015

Yes. But after you learnt, what's next? You would have to reinvent the wheel and hope for the best.

You use as_wide. The fact that it can't open some valid paths on Unix seems irrelevant since the XML should only contain Unicode data anyway.

I want to think about this more. I also created a SO question http:https://stackoverflow.com/questions/31092351/should-the-file-opening-interface-in-a-c-library-use-utf-8-on-windows to gather more opinions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants