-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change narrow char load_file() interface to assume UTF-8 on Windows. #45
Comments
Using UTF-8 instantly breaks compatibility with any CRT functions. For example, on Windows, in a portable program, you can do doc.load_file(argv[1]). And it would instantly fail for any file name that contains localized characters. Without UTF-8 conversion (aka 'right now') it would succeed for any file name that contains characters in the system locale - which is worse than supporting any file name but better than nothing. In general it's extremely surprising to have the function that loads/saves the file behave differently from the two mechanisms available for that in C++ (fopen and std::fstream). When I get reports of load_file() returning file_not_found but the file being there, I ask to use fopen to show that the issue is not with pugixml but with the calling environment. Having an option of loading a file by UTF-8 path makes sense. You can use doc.load_file(pugi::as_wide(...).c_str()) right now, but this is of course a bit more cumbersome than doc.load_file_utf8() or some equivalent. |
Yes, this is true. But for main() or fopen(), solutions do exist. Using as_wide(...) manually is a pain, because I want it to work this way on Windows, but on Unixes it is the wrong thing to do. Today I always work with a patched version of pugixml on Windows. I am aware of the compatibility problems with existing code. Personally I say that such code is broken anyway -- what's the difference of being 99.9% broken or 99.7% broken? Yet, from pugi's POV, it would be enough to have it as a non-default compile time switch (i.e. load_file() assumes legacy broken ANSI versus UTF-8). I appreciate your interest of keeping this functionality as a defensive measure against people who don't understand encodings, but keep in mind that some libraries are capable to manage it anyway. |
Why? as_wide will work on Unixes as well. (of course, it's slightly less efficient than just passing the UTF-8 path directly to fopen)
I disagree with the percentages here. The amount of applications that use Windows ANSI APIs is staggering. A lot of them also work "good enough" for actual users wrt path encoding. pugixml takes a stance on being practical and solving issues that people encounter as opposed to spearheading UTF8 (I would love for UTF8 to be the C++ encoding, believe me). In practical terms this means that I'm happy to provide "advanced" functionality for those who need it but I don't want to force all users into something they may get confused by. |
The problem is not in as_wide. It's just that Unix treats filenames as blobs. See Boost.Nowide FAQ or this blog about Python v3 Unicode fail.
It is true that there is simply a lot of broken code. But the reason this code works "good enough" is because it is fed with ASCII most of the time. Give it a directory of eBooks and it will fail. Change the system locale and it will fail. And with load_file as UTF-8 it will continue working with ASCII as before.
How helping those who use UTF-8 for narrow strings in C++ isn't practical? Besides you gave none arguments against supporting this as a compile-time flag. I also would like to add that the argument that load_file would behave differently from fopen is weak. Note that current pugixml, when compiled with narrow chars, uses UTF-8 for almost all of its interfaces (the exceptions are filenames and the parsing which detects the encoding). So in some sense this code
suffers from the same compatibility with CRT problems as you claim load_file would. And, btw, compiling for wide-chars and using wcout won't solve the above. |
Right. But then load_file won't use UTF-8 as an input! (e.g. it will accept malformed UTF-8 sequences) Right now the contract is "it takes the file name in whatever encoding the underlying system uses". Which is not UTF-8 on Unix systems (hopefully it's a superset) and not UTF-8 on Windows. The question is what's of more practical use, have default load/save functions behave as all other file access functions do or try to have a UTF-8 superset on all platforms.
Most of the files that users will create will use their default locale. This is not true for some cases but this is overwhelmingly true. This of course is not an argument for Windows ANSI over UTF-8 - this is just to say that this is a reasonable compromise given the history of those APIs.
If you use pugixml you can either know about the issues that Unicode brings or ignore them. Sure, fopen on n.child_value() will fail to open localized paths if the document is encoded using UTF-8 (some users use pugixml to decode ANSI-encoded documents, but the majority probably sticks to UTF-8). But so will doc.load_file. And getting the filename through any other means will fail in the same manner. Essentially, narrow-char load_file is as "broken" as any other file opening API in C++. Now, if you want to solve this problem, you must know you have it. You would not use fopen or std::fstream to open files. Maybe you'd use Boost or some other library or home-grown conversion primitives - regardless, if you know about this problem it's easy to see why load_file won't work - hey, no other API does! If you don't know anything about these problems however, I'm not sure that making load_file take UTF-8 paths solves any issues for you. Because if you open any other files they may not work. So sure, if your XML file contains a reference to an XML file, that would work. But if it contains a reference to a .PNG file that you have to decode or a .log file that is opened separately, that won't. Which would force you to learn about the problems of file access anyway. Right now it feels that the approach of transparently handling file access as UTF-8 just introduces extra incompatibilities. Are you aware of any libraries other than SQlite (and of course Boost.NoWide which is a different case - when you use that you solve that exact problem) that use UTF8-encoded paths on Windows?
This is probably not relevant to this issue, but why not? |
It would work on UTF-8 on Windows, because no native narrow encoding on Windows is capable of representing all valid filenames supported by that same system. For other systems it simply does not matter as much, because it's user's choice to use something that isn't Unicode compatible.
Don't latest versions of pugixml detect and convert ISO-8859-1 to UTF-8?
Yes. But after you learnt, what's next? You would have to reinvent the wheel and hope for the best.
GDAL, say. Sure, most libraries out there simply write standard C and don't provide any Windows support at all. We have to patch them manually or, if they're closed source, revert to other hacks. |
You use I want to think about this more. I also created a SO question http:https://stackoverflow.com/questions/31092351/should-the-file-opening-interface-in-a-c-library-use-utf-8-on-windows to gather more opinions. |
The text was updated successfully, but these errors were encountered: