Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ins subscriber name validation fixes see #7477 #7478

Conversation

adunsulag
Copy link
Sponsor Member

@adunsulag adunsulag commented Jun 13, 2024

This puts in multibyte case insensitive validation on the subscriber previous name fields. Due to unicode character sets having multiple uppercase characters that correspond to a single lowercase and viceversa with many lowercase to a single uppercase, name identifiers are normalized into unicode NFKC form before comparison.

Fixes #7477

This puts in multibyte case insensitive validation on the subscriber
previous name fields.  It does not fully fix the issue for all unicode
names as apparently some unicode character sets can have multiple
uppercase characters that correspond to a single lowercase and viceversa
with many lowercase to a single uppercase.  Some research shows we need
to normalize the characters and then do a comparison but need to do more
research to see if this is an actual valid assumption or not before
implementing.

This will almost fix openemr#7477 for nearly all issues, but the edge cases
still need to be addressed.
($previousName['previous_name_last'] == $values['subscriber_lname']
|| mb_strtolower($previousName['previous_name_last']) == mb_strtolower($values['subscriber_lname'])
)
) {
Copy link
Sponsor Member

@bradymiller bradymiller Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just bite bullet and put following function in sanitize.inc.php (this file always including in all scripts):

function mb_is_string_equal_ci($string1, $string2): bool
{
    if ($string1 == $string2) {
        return true;
    }

    $string1_normalized = Normalizer::normalize($string1, Normalizer::FORM_KC);
    $string2_normalized = Normalizer::normalize($string2, Normalizer::FORM_KC);
    return mb_strtolower($string1_normalized) === mb_strtolower($string2_normalized)
            || mb_strtoupper($string1_normalized) === mb_strtoupper($string2_normalized);
}

edit: also good make a quick util class for this function

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then have a nice function to use for when doing name lookups etc. that will support internationalization (and it will get tested over time)

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also throw bunch of basic unit testing on function at ci akin to this (nice way to test things without needing to manually run them through the api to test):
https://github.com/openemr/openemr/blob/master/tests/Tests/Unit/PasswordVerifyTest.php

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at building out the unit tests and getting this in.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradymiller I went ahead and wrote a unit test and implemented the mb_is_string_equal_ci function. Let me know what you think.

Did a bunch of research on the unicode normalization process to make
sure we were using the correct normalization form.  Implemented a unit
test to verify various test cases.  Everything appears to be working
correctly.
@@ -81,7 +81,8 @@
},
"config": {
"platform": {
"php": "8.1"
"php": "8.1",
"ext-intl": "8.1"
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking if above line is needed. Generally less is better in these composer.json things.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Normalizer class is built into the php international extension. Without the extension the functionality will fatal error. I thought it'd be best to be explicit about the dependency than to have it work for some people (you install it automatically w/ docker) and not work for others.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradymiller are you ok with keeping this in?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @adunsulag , sorry for the delayed reply on this. I generally just use platform to force the build to be compatible with the lowest supported php version (in this case 8.1), so it does not use the php version that is on the server where the build is happening. No need for packages here. Can force dependencies in the require section, but don't do this since the build is generally not done on the actual server.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradymiller php-intl isn't a package, its a php extension, you can't actually install it w/ composer, the intent is to warn anyone that does a composer install that their server is not setup properly as it is missing the php extension that OpenEMR relies upon, people can bypass it by doing:

composer install --ignore-platform-reqs

We can remove the warning but if someone is missing the php extension anywhere that extension is used will fail.

$string2_normalized = Normalizer::normalize($string2, Normalizer::FORM_KC);
return mb_strtolower($string1_normalized) === mb_strtolower($string2_normalized)
|| mb_strtoupper($string1_normalized) === mb_strtoupper($string2_normalized);
}
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dance-bear

// we need to do some normalizations as per this stackoverflow post: https://stackoverflow.com/a/38855868
// I'm not incorporating the normalization techniques yet as we need to better understand their implications before
// applying the fix
// TODO: look at normalizing the multi-byte strings to better handle international languages where multiple unicode characters correspond to the same uppercase or lowercase character
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like above comments need updating

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@bradymiller
Copy link
Sponsor Member

Hi @adunsulag , just noted couple minor issues. Otherwise looks great!

/**
* Compares a multibyte unicode string identifier in a case insensitive way to see if the two strings
* are semantically identical. Note that NFKC will treat several 'similar' semantically meaning texts as the same and so
* should be used for identifiers (things such as proper nouns, etc).
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @adunsulag, should this comment say should not be used?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenwaite no, I believe from my research this is correctly stated. From the unicode website:

Q: Which forms of normalization should I support?
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.

Also the security concerns are more clearly discussed here:
https://www.unicode.org/reports/tr36/#Text_Comparison

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about having two separate functions one to be used for general text and one to be used for identifiers, but I decided to just go with the single function. If you see value in it, I can make the two functions.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFKC looks good, thanks

@adunsulag adunsulag merged commit fdadc05 into openemr:master Jun 28, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Insurance coverage save fails for payer relationship self when subscriber name has different casing
3 participants