-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ins subscriber name validation fixes see #7477 #7478
ins subscriber name validation fixes see #7477 #7478
Conversation
This puts in multibyte case insensitive validation on the subscriber previous name fields. It does not fully fix the issue for all unicode names as apparently some unicode character sets can have multiple uppercase characters that correspond to a single lowercase and viceversa with many lowercase to a single uppercase. Some research shows we need to normalize the characters and then do a comparison but need to do more research to see if this is an actual valid assumption or not before implementing. This will almost fix openemr#7477 for nearly all issues, but the edge cases still need to be addressed.
($previousName['previous_name_last'] == $values['subscriber_lname'] | ||
|| mb_strtolower($previousName['previous_name_last']) == mb_strtolower($values['subscriber_lname']) | ||
) | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could just bite bullet and put following function in sanitize.inc.php (this file always including in all scripts):
function mb_is_string_equal_ci($string1, $string2): bool
{
if ($string1 == $string2) {
return true;
}
$string1_normalized = Normalizer::normalize($string1, Normalizer::FORM_KC);
$string2_normalized = Normalizer::normalize($string2, Normalizer::FORM_KC);
return mb_strtolower($string1_normalized) === mb_strtolower($string2_normalized)
|| mb_strtoupper($string1_normalized) === mb_strtoupper($string2_normalized);
}
edit: also good make a quick util class for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then have a nice function to use for when doing name lookups etc. that will support internationalization (and it will get tested over time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also throw bunch of basic unit testing on function at ci akin to this (nice way to test things without needing to manually run them through the api to test):
https://github.com/openemr/openemr/blob/master/tests/Tests/Unit/PasswordVerifyTest.php
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at building out the unit tests and getting this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bradymiller I went ahead and wrote a unit test and implemented the mb_is_string_equal_ci function. Let me know what you think.
Did a bunch of research on the unicode normalization process to make sure we were using the correct normalization form. Implemented a unit test to verify various test cases. Everything appears to be working correctly.
@@ -81,7 +81,8 @@ | |||
}, | |||
"config": { | |||
"platform": { | |||
"php": "8.1" | |||
"php": "8.1", | |||
"ext-intl": "8.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking if above line is needed. Generally less is better in these composer.json things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Normalizer class is built into the php international extension. Without the extension the functionality will fatal error. I thought it'd be best to be explicit about the dependency than to have it work for some people (you install it automatically w/ docker) and not work for others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bradymiller are you ok with keeping this in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @adunsulag , sorry for the delayed reply on this. I generally just use platform to force the build to be compatible with the lowest supported php version (in this case 8.1), so it does not use the php version that is on the server where the build is happening. No need for packages here. Can force dependencies in the require section, but don't do this since the build is generally not done on the actual server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bradymiller php-intl isn't a package, its a php extension, you can't actually install it w/ composer, the intent is to warn anyone that does a composer install that their server is not setup properly as it is missing the php extension that OpenEMR relies upon, people can bypass it by doing:
composer install --ignore-platform-reqs
We can remove the warning but if someone is missing the php extension anywhere that extension is used will fail.
$string2_normalized = Normalizer::normalize($string2, Normalizer::FORM_KC); | ||
return mb_strtolower($string1_normalized) === mb_strtolower($string2_normalized) | ||
|| mb_strtoupper($string1_normalized) === mb_strtoupper($string2_normalized); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Validators/CoverageValidator.php
Outdated
// we need to do some normalizations as per this stackoverflow post: https://stackoverflow.com/a/38855868 | ||
// I'm not incorporating the normalization techniques yet as we need to better understand their implications before | ||
// applying the fix | ||
// TODO: look at normalizing the multi-byte strings to better handle international languages where multiple unicode characters correspond to the same uppercase or lowercase character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like above comments need updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
Hi @adunsulag , just noted couple minor issues. Otherwise looks great! |
/** | ||
* Compares a multibyte unicode string identifier in a case insensitive way to see if the two strings | ||
* are semantically identical. Note that NFKC will treat several 'similar' semantically meaning texts as the same and so | ||
* should be used for identifiers (things such as proper nouns, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @adunsulag, should this comment say should not be used
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenwaite no, I believe from my research this is correctly stated. From the unicode website:
Q: Which forms of normalization should I support?
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.
Also the security concerns are more clearly discussed here:
https://www.unicode.org/reports/tr36/#Text_Comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about having two separate functions one to be used for general text and one to be used for identifiers, but I decided to just go with the single function. If you see value in it, I can make the two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NFKC
looks good, thanks
This puts in multibyte case insensitive validation on the subscriber previous name fields. Due to unicode character sets having multiple uppercase characters that correspond to a single lowercase and viceversa with many lowercase to a single uppercase, name identifiers are normalized into unicode NFKC form before comparison.
Fixes #7477