Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix jq load_probabilities precision error using python #621

Merged
merged 2 commits into from
Jan 13, 2017

Conversation

zifeishan
Copy link
Contributor

jq has a precision bug while loading probabilities with big vids, due to a known issue that it's using double floating numbers to store integers. There seems no good way in jq to support better precision:
jqlang/jq#369

We've tried using awk with --bignum but versioning seems a problem.

Here we just try to use python3 to do the job.

`jq` has a precision bug while loading probabilities with big vids:

```
$ printf '1 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294272    200    300
$ printf '2 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294274    200    300
$ printf '3 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294276    200    300
$ printf '4 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294276    200    300
$ printf '5 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294276    200    300
$ printf '6 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294278    200    300
$ printf '7 200 300\n' | jq -R -r --argjson SHARD_BASE $((37 << 48))  '
                    split(" ") |
                    [ (.[0] | tonumber + $SHARD_BASE | tostring)
                    , .[1:][] ] | join("\t")'
10414574138294280    200    300
```

this is a known issue and shockingly there seems no good way in jq to support better precision:
jqlang/jq#369

We've tried using awk but versioning seems a problem. Here we just try to use python3 to do the job.
Just used a poor
split(\" \") |
[ (.[0] | tonumber + $SHARD_BASE | tostring)
, .[1:][] ] | join(\"\\t\")' |
python3 -c "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to escape the double quote (with backslash). Travis is complaining about jq compilation.

Also, @netj should we just use python here -- which should be compatible with both 2 and 3.

@zifeishan let's do some perf test for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't resist doing a benchmark notebook
and the winner is.... Perl!

jq was actually the slowest, ~7x slower than Perl and 3x than AWK. Python is ~1.7x, AWK is ~2.2x.
jq and AWK uses little memory ~10MB, ~1/3 of Python (35-45MB), but so does Perl (14MB).
So Perl is a clear winner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If perl can handle all 2^64 numbers and that's not subject to version differences on different OSes, we can go with perl.

@netj netj merged commit d1fe52a into master Jan 13, 2017
@netj netj deleted the hotfix-jq-load-prob branch January 13, 2017 20:08
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants