Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micronesia regexes #354

Open
mattkerlogue opened this issue Feb 15, 2024 · 5 comments
Open

Micronesia regexes #354

mattkerlogue opened this issue Feb 15, 2024 · 5 comments

Comments

@mattkerlogue
Copy link

Related to #289, I've recently been working with a table that has Micronesia (the country) listed solely as "Micronesia" not "Federated States of Micronesia" and thus countrycode returns an NA value.

I noticed in the discussion at #289 a reference to making a distinction between the subregion and the country, however on further inspecting the codelist dataset this seems to only be applied in the case of the English regex, while the French, German and Italian regexes only test for the name of subregion.

I've certainly seen datasets where the country is just referred to as Micronesia, but I've also seen it abbreviated as "FS Micronesia" or "F.S. Micronesia" which the current English regex would also miss. Moreover, country.name.de is simply a reference to the subregion "Mikronesien" rather than the full country name (e.g. "Mikronesien (Föderierten Staaten von)").

countrycode::codelist |>
  dplyr::filter(iso3c == "FSM") |>
  dplyr::select(
    country.name.en, country.name.fr, country.name.de, country.name.it,
    country.name.en.regex, country.name.fr.regex,
    country.name.de.regex, country.name.it.regex) |>
  dplyr::glimpse()

#>  Rows: 1
#>  Columns: 8
#>  $ country.name.en       <chr> "Micronesia (Federated States of)"
#>  $ country.name.fr       <chr> "Micronésie (États fédérés de)"
#>  $ country.name.de       <chr> "Mikronesien"
#>  $ country.name.it       <chr> NA
#>  $ country.name.en.regex <chr> "fed.*micronesia|micronesia.*fed"
#>  $ country.name.fr.regex <chr> "micron(é|e)sie"
#>  $ country.name.de.regex <chr> "mikronesien"
#>  $ country.name.it.regex <chr> "micronesia"

In my personal experience it's rare that I've come across lists/situations which include continents/continental subregions alongside countries, and if they do I'd ordinarily remove those from a list before trying to use countrycode() on it. So it did surprise me that "Micronesia" didn't return a country code.

Given that "Micronesia" is the only geographic term that can so closely be attributed to either a country or region my expectation would be that it would return the country code rather than return an NA.

@stefgehrig
Copy link

stefgehrig commented Mar 28, 2024

This is a common issue for me as well, and I work around it by using a custom matching from "Micronesia" to "Micronesia (Federated States of)" in all my applications. If it doesn't create problems in other situations, the suggestion by @mattkerlogue would be am improvement for my own use of the package (and probably many others)

@cjyetman
Copy link
Collaborator

I would consider this a "bug" in the non-English regexes and try to fix that. I realize that solution would likely not be very satisfying to the OP, but at least the behavior would be consistent between languages.

I would also suggest using the custom_match arg to work around this.

@NilsEnevoldsen
Copy link
Contributor

I don't have a strong opinion. Happy to defer to @cjyetman's opinion.

What similar situations do we have in English?

> countrycode::countrycode("Korea", "country.name", "iso3c")
[1] "KOR"
> countrycode::countrycode("Sudan", "country.name", "iso3c")
[1] "SDN"
> countrycode::countrycode("America", "country.name", "iso3c")
[1] NA
Warning message:
Some values were not matched unambiguously: America 
> countrycode::countrycode("Congo", "country.name", "iso3c")
[1] "COG"
> countrycode::countrycode("Macedonia", "country.name", "iso3c")
[1] "MKD"
> countrycode::countrycode("Cyprus", "country.name", "iso3c")
[1] "CYP"

None of these are exactly the same situation. Maybe "America" is a weakly similar example.

FWIW, the UNGEGN official short name is Federated States of Micronesia (the), same as the formal name.

@NilsEnevoldsen
Copy link
Contributor

One alternate suggestion: we could put in a custom error messages for a couple of the uniquely troublesome cases. i.e. a conversion from Micronesia as a country.name to anything else would return NA but also a suggestion to use custom_match(). I know some people don't like wordy error messages, but I think they can improve accessibility.

@vincentarelbundock
Copy link
Owner

Sorry for the delayed response.

I don't have a super strong view, but I guess I'd lean toward being stricter.

I personally like wordy error messages, and would be very happy to include that in a future version.

For transparency though, I'm not sure I'll get to it myself soon. But I'd be happy to review and merge a Pull Request if someone wants to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants