Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Important issue with non-latin filenames of attachments #1736

Closed
utilmind opened this issue May 26, 2019 · 16 comments
Closed

Important issue with non-latin filenames of attachments #1736

utilmind opened this issue May 26, 2019 · 16 comments

Comments

@utilmind
Copy link
Contributor

Hello! It's known issue that standard "basename()" under Unix OS's (just tested again for sure on CentOS) doesn't supports cyrillic filenames.

For example, basename("Звіт з дослідження.txt") returns " з дослідження.txt". It's losing the first word. Thus all attachments have incorrect/broken filenames if they was specified in some non-latin characters set, like cyrillic or greek.

To avoid basename() programmers using different workarounds. Personally I use following self-written custom function to get the filename without path:

function mybasename($f) { // it works both for Unix and Windows pathes with backslashes.
  return (($p = strrpos($f, '/')) === false) && (($p = strrpos($f, '\\')) === false) ? $f : substr($f, $p+1);
}

Please implement this into PHPMailer.php to let me use it from without kludges and workarounds, like prepending an extra-space to filename.

Thanks!

@utilmind
Copy link
Contributor Author

I have created pull request with the fix of this issue (added safeBasename() protected method).

@Synchro
Copy link
Member

Synchro commented May 27, 2019

This doesn't seem to be true for me:

$a = 'Звіт з дослідження.txt';
var_dump(basename($a), pathinfo($a));

gives:

string(38) "Звіт з дослідження.txt"
array(4) {
  'dirname' =>
  string(1) "."
  'basename' =>
  string(38) "Звіт з дослідження.txt"
  'extension' =>
  string(3) "txt"
  'filename' =>
  string(34) "Звіт з дослідження"
}

Can you point at a PHP bug report that confirms this?

@utilmind
Copy link
Contributor Author

utilmind commented May 27, 2019

Honestly I did not searched for PHP bug report, but it's very known issue.

Just check out canonical manual at https://php.net/manual/function.basename.php and see user comments with many workrounds like mb_basename() or anonymous report with exactly the same issue

instead of the name «ФЫВА-1234.doc», is displayed «-1234.doc».
Solution: rawurldecode(basename($full_name)).

(Although actually rawurldecode() is not solution.)

And the problem appears only in Unix environment. Everything works good in my local Windows PC.
I going to make some tests to figure out how to reporoduce this problem on any environment...

@utilmind
Copy link
Contributor Author

utilmind commented May 27, 2019

Ok, I did some more research. Found that problem was described a lot in cyrillic segment of Internet: https://php.ru/forum/threads/basename-vrjot-kogda-v-imenax-fajla-russkie-bukvy.56576/

In English the problem is less known but still exists: https://www.drupal.org/project/drupal/issues/278425

Solution is to forcibly set the locale with "setlocale()". Following line can solve my problem:
setlocale(LC_ALL, 'uk_UA');

However it will not work if locale is not set, or was reset with setlocale(LC_ALL, '');.

Also cyrillic filenames will not work on Chinese locales, but only first character will be removed instead of the first word:

setlocale(LC_ALL, 'zh_CN');
print basename('Звіт з результатами.txt');

Output: віт з результатами.txt
The first character lost in Chinese locales, which is bug too.

The correct and locale safe basename() should be independant of any locales and environments and should act the same everywhere.

@Synchro
Copy link
Member

Synchro commented May 27, 2019

It concerns me that your workaround uses all non-UTF-8 aware functions, treating it as a stream of bytes and not characters. I think it needs many more test cases to be certain - for example does it make a difference if the string ends with multibyte chars, or contains multibyte chars elsewhere in the path, or escaped slashes - for example does this (a valid path) work correctly?:

/Звіт з результатами.txt/Звіт з\\ результатами.txt\/

That works correctly for me using basename:

string(42) "Звіт з\ результатами.txt\"
array(4) {
  'dirname' =>
  string(41) "/Звіт з результатами.txt"
  'basename' =>
  string(42) "Звіт з\ результатами.txt\"
  'extension' =>
  string(4) "txt\"
  'filename' =>
  string(37) "Звіт з\ результатами"
}

@utilmind
Copy link
Contributor Author

utilmind commented May 27, 2019

\ and / characters always have 1 character length (\ is 0x2F, / is 0x5C), so it's proposed function is 100% unicode-aware. No need to use mb_strrpos/mb_substr.

UPD. I see the problem. Yes, safeBasename("Звіт з\ результатами.txt\") will return empty string because of last slash character. Going to fix it and propose function which acts the same way as basename()...

@Synchro
Copy link
Member

Synchro commented May 27, 2019

I suspect that whatever you write will need to take locale into account, which is extremely difficult - for example some locales use + as an escape character, but you can't tell if that's the case by looking at the string. This is one of the things that led to https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2016-10033 so you need to be extremely careful about doing this kind of stuff to avoid security holes. It might be better to throw an error if certain locales are used rather than trying to patch around what is essentially a configuration error.

FWIW, I'm on a Unix system (macOS) and I have no problems with these strings.

@Synchro
Copy link
Member

Synchro commented May 27, 2019

I found a PHP bug report about this, and the conclusion from that was essentially:

For better or worse, basename is affected by the locale. The encoding used in the filenames must match the set locale.

Bogus.

So if your locale is not set correctly, then basename may not work correctly either - and nor will anything else that relies on locale, so fixing basename is hiding the problem, not fixing it, and is thus not an appropriate solution.

@Synchro
Copy link
Member

Synchro commented May 27, 2019

You said:

\ and / characters always have 1 character length (\ is 0x2F, / is 0x5C)

That's true, but those bytes can also appear within valid UTF-8 characters - for example ȯ is 0x022F, and Ŝ is 0x015C.

@utilmind
Copy link
Contributor Author

That's true, but those bytes can also appear within valid UTF-8 characters - for example ȯ is 0x022F, and Ŝ is 0x015C.

No, all characters below 0x80 is single-byte characters. Ŝ is actually 0xC59C, ȯ is 0xC8AF.

@utilmind
Copy link
Contributor Author

utilmind commented May 27, 2019

So if your locale is not set correctly, then basename may not work correctly either - and nor will anything else that relies on locale, so fixing basename is hiding the problem, not fixing it, and is thus not an appropriate solution.

Ok, this make sense. The specific locale is really not configured on my server. I just tested basename() with prior call of setlocale(LC_CTYPE, 'en_US.utf-8'); and confirming that basename() works good with cyrillic (russian/ukrainian) characters even with "en_US" locale.

However... My default locale is correct and not misconfigured. :) It's set to "C" by default. C (ISO C) is absolutely standard default locale. See https://www.gnu.org/software/libc/manual/html_node/Standard-Locales.html#Standard-Locales

I'm completely aware that my CentOS doesn't supports filenames which contains non-latin characters and I never keep files with cyrillic characters on physical drive. So, "C" (ISO C) is correct locale for my internal environment (filenames and pathes) and I don't have a reasons to change it.

However my website allows visitors to download and send by email files with names that contain any utf-8 characters. Not only cyrillic characters, "Звіт з результатами.txt" is just particular case. Even emoji characters should be OK for filenames.

And these files is actually not exists in the file system. They all generated on the fly, allowed to be downloaded or sent by email. I'm actually don't care if the path will be stripped from filename or not. I never provide full path to AddStringAttachment(). The only thing I want is to be sure that filename will not broken, and the first word will not stripped. Adding extra space, or slash as prefix to the file name is not good solution. Changing the locale is not good too, because ISO C (which not unicode aware) is correct locale for my internal environment. The only purpose of that basename() is to get the pure filename without path.

So let's just get the filename after slashes. Or better, let's remove any path before slash or backslash characters and remove all characters that not related to filenames: <>:"/\|?*

@Synchro
Copy link
Member

Synchro commented May 27, 2019

I spotted that we've been here before - PHPMailer already has an mb_pathinfo function! Does it work for you if you change addStringAttachment to this:

2 => static::mb_pathinfo($filename, PATHINFO_BASENAME),

You're quite right on the UTF-8 thing - I was mixing it up with the code points.

@utilmind
Copy link
Contributor Author

utilmind commented May 27, 2019

Yes! Confirming that static::mb_pathinfo($filename, PATHINFO_BASENAME) works good with cyrillic filenames.

But it's not enough, after I realized that attached filenames may have invalid characters. :)

Here is an alternative solution:

function safeBasename($fn) {
  // This works the same as mb_pathinfo(.., PATHINFO_BASENAME) but ~ 4 times faster because doesn't use regexp.
  $fn = rtrim($fn, '/\\');
  $p = max(strrpos($fn, '/'), strrpos($fn, '\\'));
  if ($p !== false)
    $fn = substr($fn, $p+1);

  // now let's replace invalid characters to "_".
  static $bad_chars = '<>:"|?*';
  static $bad_charsc= '_______';

  return strtr($fn, $bad_chars, $bad_charsc);
}

Many email clients will fail to open files which names contains <>:"/\|?*

@Synchro
Copy link
Member

Synchro commented May 27, 2019

I'm not sure that's a good idea. The email RFCs do not impose any such limits on names, and you should be free to include those chars if you want - I certainly don't think we should remove them by default. PHPMailer should be transparent to this kind of issue - that some email clients may have a problem with it is not PHPMailer's fault.

@utilmind
Copy link
Contributor Author

utilmind commented May 27, 2019

No problem! This is just my suggestion to make the filenames really safe and I will use it in my wrapper over PHPMailer, just for sure.

In any case mb_pathinfo() will solve the issue described in topic (although it's slower than my suggested code), so I just awaiting for an update. :)

UPD. I just read in docs that <>:"/|?* is control characters only in Windows and except 0x00 and slash (/) characters they are allowed in Linux, OS-X etc. I will use it for my purposes in wrapper, but it really not required in canonical PHPMailer.

@Synchro
Copy link
Member

Synchro commented May 27, 2019

Thanks for reporting and your help with this. I've pushed the changes for this, also cleaned up the regex a little.

Historically macOS used : as a path separator, and it used to not allow it in filenames, but it does now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants