-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Important issue with non-latin filenames of attachments #1736
Comments
I have created pull request with the fix of this issue (added safeBasename() protected method). |
This doesn't seem to be true for me: $a = 'Звіт з дослідження.txt';
var_dump(basename($a), pathinfo($a)); gives:
Can you point at a PHP bug report that confirms this? |
Honestly I did not searched for PHP bug report, but it's very known issue. Just check out canonical manual at https://php.net/manual/function.basename.php and see user comments with many workrounds like mb_basename() or anonymous report with exactly the same issue
(Although actually rawurldecode() is not solution.) And the problem appears only in Unix environment. Everything works good in my local Windows PC. |
Ok, I did some more research. Found that problem was described a lot in cyrillic segment of Internet: https://php.ru/forum/threads/basename-vrjot-kogda-v-imenax-fajla-russkie-bukvy.56576/ In English the problem is less known but still exists: https://www.drupal.org/project/drupal/issues/278425 Solution is to forcibly set the locale with "setlocale()". Following line can solve my problem: However it will not work if locale is not set, or was reset with Also cyrillic filenames will not work on Chinese locales, but only first character will be removed instead of the first word:
Output: віт з результатами.txt The correct and locale safe |
It concerns me that your workaround uses all non-UTF-8 aware functions, treating it as a stream of bytes and not characters. I think it needs many more test cases to be certain - for example does it make a difference if the string ends with multibyte chars, or contains multibyte chars elsewhere in the path, or escaped slashes - for example does this (a valid path) work correctly?:
That works correctly for me using basename:
|
\ and / characters always have 1 character length (\ is 0x2F, / is 0x5C), so it's proposed function is 100% unicode-aware. No need to use mb_strrpos/mb_substr. UPD. I see the problem. Yes, safeBasename("Звіт з\ результатами.txt\") will return empty string because of last slash character. Going to fix it and propose function which acts the same way as basename()... |
I suspect that whatever you write will need to take locale into account, which is extremely difficult - for example some locales use FWIW, I'm on a Unix system (macOS) and I have no problems with these strings. |
I found a PHP bug report about this, and the conclusion from that was essentially:
So if your locale is not set correctly, then basename may not work correctly either - and nor will anything else that relies on locale, so fixing basename is hiding the problem, not fixing it, and is thus not an appropriate solution. |
You said:
That's true, but those bytes can also appear within valid UTF-8 characters - for example ȯ is 0x022F, and Ŝ is 0x015C. |
No, all characters below 0x80 is single-byte characters. Ŝ is actually 0xC59C, ȯ is 0xC8AF. |
Ok, this make sense. The specific locale is really not configured on my server. I just tested However... My default locale is correct and not misconfigured. :) It's set to "C" by default. C (ISO C) is absolutely standard default locale. See https://www.gnu.org/software/libc/manual/html_node/Standard-Locales.html#Standard-Locales I'm completely aware that my CentOS doesn't supports filenames which contains non-latin characters and I never keep files with cyrillic characters on physical drive. So, "C" (ISO C) is correct locale for my internal environment (filenames and pathes) and I don't have a reasons to change it. However my website allows visitors to download and send by email files with names that contain any utf-8 characters. Not only cyrillic characters, "Звіт з результатами.txt" is just particular case. Even emoji characters should be OK for filenames. And these files is actually not exists in the file system. They all generated on the fly, allowed to be downloaded or sent by email. I'm actually don't care if the path will be stripped from filename or not. I never provide full path to So let's just get the filename after slashes. Or better, let's remove any path before slash or backslash characters and remove all characters that not related to filenames: <>:"/\|?* |
I spotted that we've been here before - PHPMailer already has an
You're quite right on the UTF-8 thing - I was mixing it up with the code points. |
Yes! Confirming that But it's not enough, after I realized that attached filenames may have invalid characters. :) Here is an alternative solution:
Many email clients will fail to open files which names contains <>:"/\|?* |
I'm not sure that's a good idea. The email RFCs do not impose any such limits on names, and you should be free to include those chars if you want - I certainly don't think we should remove them by default. PHPMailer should be transparent to this kind of issue - that some email clients may have a problem with it is not PHPMailer's fault. |
No problem! This is just my suggestion to make the filenames really safe and I will use it in my wrapper over PHPMailer, just for sure. In any case UPD. I just read in docs that <>:"/|?* is control characters only in Windows and except 0x00 and slash (/) characters they are allowed in Linux, OS-X etc. I will use it for my purposes in wrapper, but it really not required in canonical PHPMailer. |
Thanks for reporting and your help with this. I've pushed the changes for this, also cleaned up the regex a little. Historically macOS used |
Hello! It's known issue that standard "basename()" under Unix OS's (just tested again for sure on CentOS) doesn't supports cyrillic filenames.
For example,
basename("Звіт з дослідження.txt")
returns" з дослідження.txt"
. It's losing the first word. Thus all attachments have incorrect/broken filenames if they was specified in some non-latin characters set, like cyrillic or greek.To avoid basename() programmers using different workarounds. Personally I use following self-written custom function to get the filename without path:
Please implement this into PHPMailer.php to let me use it from without kludges and workarounds, like prepending an extra-space to filename.
Thanks!
The text was updated successfully, but these errors were encountered: