Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

histogram produces wrong plots #5343

Closed
seisman opened this issue Jun 17, 2021 · 5 comments
Closed

histogram produces wrong plots #5343

seisman opened this issue Jun 17, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@seisman
Copy link
Member

seisman commented Jun 17, 2021

Description of the problem

Example data: test.txt

To reproduce the bug, run:

gmt histogram test.txt -W1p -T5.5/7.0/0.1 -Baf -png map
GMT 6.1.1 GMT 6.2.0
image image

I think the GMT 6.1.1 plot is correct and the GMT 6.2.0 plot is worng.

System information

  • Operating system: macOS
  • GMT version (gmt --version): 6.2.0
@seisman seisman added the bug Something isn't working label Jun 17, 2021
@maxrjones
Copy link
Member

@PaulWessel, the problem is with two lines changed in https://github.com/GenericMappingTools/gmt/pull/4454/files in gmtlib_var_inc in src/gmt_support.c (L17007-17008 in that commit).

Changing

		dx = x[k] - x[k-1];
		if (fabs ((fix_inc - dx) / fix_inc) > GMT_CONV8_LIMIT)

back to

		if (!doubleAlmostEqual (fabs (x[k] - x[k-1]), fix_inc))

produces the old plot.

@PaulWessel
Copy link
Member

Thanks, I will reconsider what was implemented and what it was trying to fix.

@PaulWessel
Copy link
Member

Before I dig deeper: I see that function returning false meaning the spacing is not variable, which it is not. Does the previous code return true (which would not be correct) and somehow this makes a correct plot?

@maxrjones
Copy link
Member

Before I dig deeper: I see that function returning false meaning the spacing is not variable, which it is not. Does the previous code return true (which would not be correct) and somehow this makes a correct plot?

Yes, before it returned true. So a whack-a-mole situation probably.

@PaulWessel
Copy link
Member

So this is classic rounding behavior. If we use the pshistogram_get_bin function that assumes the bins have a fixed interval we do things like bin = (int64_t)floor ((x - T->min) / T->inc) and this may give 7.99999999 and return 7. If we used the function that thinks it is not equidistant we basically loop and compare x to the array bounds and that gives the more consistent result. The reason is that the array[] values are all flawed in the same way as the input since 8 cannot be represented, only 7.999999, so we "floor" to the same side each time. With the division we are creating a new floating point that varies every other bin, either giving k.9999999999 or (k+1).0000000001111.
Perhaps the simplest solution here is to not fall into the rounding trap by using the pshistogram_get_variable_bin function. However, this is also bad user behavior trying to bin data that all fall on the bin boundaries exactly. This is what -F is for.
I remember last year we had some discussions regarding histogram with @joa-quim regarding speed. The division is obviously faster than searching the array unless the data are already relatively sorted. Maybe instead of

pshistogram_get_bin = (F->T->var_inc) ? &pshistogram_get_variable_bin : &pshistogram_get_constant_bin;

we should do

pshistogram_get_bin = (F->T->var_inc || n < BIG_NUMBER) ? &pshistogram_get_variable_bin : &pshistogram_get_constant_bin;

meaning we only encounter the rounding for very large data sets? Say BIG_NUMBER is 10e6?

PaulWessel added a commit that referenced this issue Jun 18, 2021
See #5343 for background.  By comparing a value to the array of bin boundaries we can consistently round in the same direction.  However, for large data sets of more than 1 million points we will use a faster approach for equidstant bins where values that fall exactly on bin boundaries will be rounded in alternate directions per standard rounding conventions.
PaulWessel added a commit that referenced this issue Jun 18, 2021
See #5343 for background.  By comparing a value to the array of bin boundaries we can consistently round in the same direction.  However, for large data sets of more than 1 million points we will use a faster approach for equidstant bins where values that fall exactly on bin boundaries will be rounded in alternate directions per standard rounding conventions.
@seisman seisman closed this as completed Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants