Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[1.x] Fix incorrect calculation results when the C locale is set to a locale that uses commas as the decimal separator #17177

Merged
merged 6 commits into from
Apr 24, 2020

Conversation

nickguletskii
Copy link
Contributor

@nickguletskii nickguletskii commented Dec 26, 2019

Description

Currently, many operators utilize std::stod to convert strings into floating point numbers. This causes incorrect calculations (#17140, #16134) when the C locale is set to a locale which uses commas (,) as decimal separators.

This pull request replaces calls to std::stod and std::stof to the locale-invariant dmlc::stod and dmlc::stof respectively.

The scope of this patch

This patch should fix a large portion of interactions through Python and JVM frontends, since they use locale-invariant serialization in order to pass parameters into MXNet's C API.

However, frontends which utilize C locale-aware serialization (i.e. call sprintf or similar) may break when using locales which don't use . as the decimal separator. They will have to be fixed in a separate patch. I also suspect that they are already broken, because operators utilising dmlc-core parameter parsing were already using locale-invariant serialization.

Further steps

STL streams are heavily used within the codebase, both for serialization and for forming user-friendly messages. Fortunately, they don't seem to be affected by the C standard library locale settings. However, if someone sets the STL locale by calling std::locale::global, MXNet's API will be broken. In order to ensure that this doesn't happen, all streams which are used for serialization will have to be imbued with the "C" locale (not the locale set in the C standard library).

It would be nice to see a more principled approach to serialization in MXNet 2.0, e.g. using a binary format for communication between the frontend and the backend. In addition to solving locale-related issues, this would probably result in a smaller invocation overhead.

Locale-invariant serialization vs locale-aware serialization

As a side-note, using locale-aware serialization is not an option, simply because using , as the decimal separator adds ambiguities to tuple serialization, e.g. (4,4,3) can be a tuple of 3 integers, or a tuple of 2 floats.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Replace std::stod with dmlc::stod and std::stof with dmlc::stof`.
  • Add a test that tests locale invariance for scalar ops.

Comments

  • May break Julia and R code when using a locale that uses , as the decimal separator. However, since there was no consistency between the various operators before, it is not unlikely that the code was already broken.

@nickguletskii nickguletskii changed the title Bugfix/locale invariant stod Fix incorrect calculation results when the C locale is set to a locale that uses commas as the decimal separator Dec 26, 2019
@leezu
Copy link
Contributor

leezu commented Apr 16, 2020

@nickguletskii can resolve the conflicts so this PR may be merged?

Are you interested in working on the "more principled approach to serialization in MXNet 2.0, e.g. using a binary format for communication between the frontend and the backend. In addition to solving locale-related issues, this would probably result in a smaller invocation overhead."? Part of this may (or may not) be done already via the FFI work lead by @hzfan?

@stu1130
Copy link
Contributor

stu1130 commented Apr 17, 2020

+1 need this fix on MXNet 1.7

@ciyongch
Copy link
Contributor

@nickguletskii, as we're doing 1.7 release recently, can you also help to backport this PR to v1.x branch as @stu1130 mentioned, which will be included in MXNet 1.7 as well? Thanks!

@nickguletskii
Copy link
Contributor Author

Sorry for accidentally adding so many reviewers: this was a side effect from rebasing onto the v1.x branch, which didn't exist when I first created this PR.

@leezu Judging by the documentation for the new FFI, it does seem that using it would solve the problem. However, that would require all operators, optimizers and other parts of the codebase to be ported to the new FFI. Therefore, I don't think this patch is necessary on the master branch.

@nickguletskii nickguletskii changed the title Fix incorrect calculation results when the C locale is set to a locale that uses commas as the decimal separator [1.x] Fix incorrect calculation results when the C locale is set to a locale that uses commas as the decimal separator Apr 17, 2020
@marcoabreu
Copy link
Contributor

Shall we add a check which throws an error in sanity if people use the STD method?

@nickguletskii
Copy link
Contributor Author

@marcoabreu I think it would be better to add a (nightly) CI job that will run the whole test suite with a locale that is vastly different from en_US.UTF-8. While this PR solves the issues related to the usage of std::stod and std::stof, there's no guarantee that there aren't any more subtle locale-related issues that are quietly truncating/changing values inside MXNet.

@nickguletskii
Copy link
Contributor Author

@mxnet-bot run ci windows-cpu

@mxnet-bot
Copy link

Undefined action detected.
Permissible actions are : run ci [all], run ci [job1, job2]
Example : @mxnet-bot run ci [all]
Example : @mxnet-bot run ci [centos-cpu, clang]

@nickguletskii
Copy link
Contributor Author

@mxnet-bot run ci [windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu]

@marcoabreu
Copy link
Contributor

Sure, but a text based scan is a lot quicker and less costly than running the fully suite. They are not exclusive, but enforcing the use of the dmlc version as part of the sanity check should bring a lot of value for low investment.

@leezu
Copy link
Contributor

leezu commented Apr 17, 2020

We already run the testsuite on 2 Linux platforms: Ubuntu and CentOS7. I suggest to switch the locale for the CentOS7 platform instead of introducing another nighlty build. @nickguletskii would you like to help open a PR to the master branch? You'd need to change this Dockerfile https://github.com/apache/incubator-mxnet/blob/master/ci/docker/Dockerfile.build.centos7_cpu https://github.com/apache/incubator-mxnet/blob/master/ci/docker/Dockerfile.build.centos7_gpu

@ciyongch
Copy link
Contributor

Adding to 1.7.0 roadmap #16864

@nickguletskii
Copy link
Contributor Author

We already run the testsuite on 2 Linux platforms: Ubuntu and CentOS7. I suggest to switch the locale for the CentOS7 platform instead of introducing another nighlty build. @nickguletskii would you like to help open a PR to the master branch? You'd need to change this Dockerfile https://github.com/apache/incubator-mxnet/blob/master/ci/docker/Dockerfile.build.centos7_cpu https://github.com/apache/incubator-mxnet/blob/master/ci/docker/Dockerfile.build.centos7_gpu

Good idea! I've made a pull request that updates the CentOS CI jobs and changes the tests to respect the locale environment variables: #18097

@leezu
Copy link
Contributor

leezu commented Apr 19, 2020

@mxnet-bot run ci [windows-gpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu, unix-gpu]

@ciyongch
Copy link
Contributor

@nickguletskii , can you help to take a look at/solve the failed CI? As the code freeze date is postponed to April 25 PST, please make sure this PR is included both in v1.x and v1.7.x branch as well before the freeze date. Thanks!

@stu1130
Copy link
Contributor

stu1130 commented Apr 22, 2020

@ciyongch seems like the PR is blocked by
#18097 and #18025 @nickguletskii right?

@leezu
Copy link
Contributor

leezu commented Apr 22, 2020

@stu1130 these shouldn't be blockers

@mxnet-bot run ci [windows-gpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu, unix-gpu]

@ciyongch
Copy link
Contributor

ciyongch commented Apr 22, 2020

Thanks @stu1130 and @leezu to help tracking this PR.
Gentle ping @nickguletskii if we can catch up the freeze date for this feature? Thanks!

@nickguletskii
Copy link
Contributor Author

@ciyongch From what I've seen, the CI failures are from CI timing out. I don't think this PR is causing the CI failures, since the CI seems to hang up on different tests each time...

@leezu
Copy link
Contributor

leezu commented Apr 22, 2020

@mxnet-bot run ci [all]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [edge, sanity, miscellaneous, centos-gpu, website, windows-cpu, windows-gpu, unix-cpu, clang, unix-gpu, centos-cpu]

@ciyongch
Copy link
Contributor

@nickguletskii thanks for your information, then re-trigger the failed CI would be helpful for merging. Can you help to backport this PR to v1.7.x branch as well?

@nickguletskii
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@nickguletskii
Copy link
Contributor Author

@ciyongch Sorry, the CI failures seem to be caused by this bug: #18090
Until this deadlock is fixed, I doubt re-triggering the CI will do much good.

@leezu
Copy link
Contributor

leezu commented Apr 23, 2020

@mxnet-bot run ci [unix-gpu]

@nickguletskii the deadlock is not deterministic, so until the root cause is fixed / the respective CI tests are disabled (#18151) to at least have a stable CI, retriggering will still help.

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@leezu
Copy link
Contributor

leezu commented Apr 24, 2020

@mxnet-bot run ci [unix-gpu]

@ciyongch
Copy link
Contributor

The back ported PR to v1.7.x branch #18147 passed CI :)

@leezu leezu merged commit 770d49e into apache:v1.x Apr 24, 2020
leezu pushed a commit that referenced this pull request Apr 24, 2020
…when the C locale is set to a locale that uses commas as the decimal separator) (#18147)

* Add a test for floating point parsing locale invariance

* Use locale-invariant dmlc:stod/stof instead of std:stod/stof

* Change the new operator tutorial to use dmlc:stod instead of std::stod

* Rename locale invariance test

* Skip test_scalarop_locale_invariance if the locales aren't available

* Fix linter errors due to incorrect include order
@leezu
Copy link
Contributor

leezu commented Apr 24, 2020

Merge based on the CI results on v1.7.x.

Thanks @nickguletskii

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants