-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-19556: InnoDB native sampling #2117
base: 10.9
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should try to make the statistics scan smarter, and make it cover every index. In which way do the collected statistics differ from what dict_stats_analyze_index()
is collecting? That method would be much more accurate.
If we really want to have random sampling, then it could make more sense to introduce an API call to position the cursor to a specific position in an index (for example, indicated by a floating point number between 0 and 1, referring to the smallest and largest key). Such APIs were discussed in MDEV-21895.
@@ -9489,6 +9490,62 @@ ha_innobase::rnd_next( | |||
DBUG_RETURN(error); | |||
} | |||
|
|||
#include "../row/row0sel.cc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would seem to duplicate quite a bit of code, and possibly lead to some violations of one definition rule.
if (m_prebuilt->clust_index_was_generated) { | ||
err = change_active_index(MAX_KEY); | ||
} else { | ||
err = change_active_index(m_primary_key); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this condition really needed? Would a simple change_active_index(m_primary_key)
work?
Why would we be sampling only in the clustered index? What about other indexes?
offsets= rec_get_offsets(rec, index, offsets, index->n_core_fields, ULINT_UNDEFINED, &heap); | ||
ut_ad(offsets != NULL); | ||
ut_ad(heap == NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion ought to fail if the table contains enough many columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good then, we had some argue about that -- whether we should handle heap
or not.
@MagHErmit, you should then construct the test to check this out:) And don't throw it out afterwards, we will include it in the test suite.
The plan minimum is to free it correctly, but, Marko, maybe we should cache this heap in handler
or prebuilt
to avoid reallocations on each sample_next for wide tables?
@@ -4030,6 +4030,10 @@ class handler :public Sql_alloc | |||
virtual int ft_read(uchar *buf) { return HA_ERR_WRONG_COMMAND; } | |||
virtual int rnd_next(uchar *buf)=0; | |||
virtual int rnd_pos(uchar * buf, uchar *pos)=0; | |||
|
|||
virtual int sample_next(uchar *buf) { return 0;} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest an interface that tells the storage engine beforehand how many rows (percentage wise or number wise) will be sampled. That opens up a number of optimizations, such as allowing the storage engine to pre-fetch those rows before sample_next takes place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be a separate call, like a condition pushdown (as an index_* alternative I guess).
Anyway, using it is out of this task's scope, so this parameter will hang dangling, until somebody implements it, or not.
@dr-m Our goal is to implement Bernoulli sampling. How a more sophisticated statistics collection from For the purpose of sampling we need to access the whole record -- this is why the clustered index is used |
I could not find any test, rule of thumb is that commit of a feature with no tests usually have a problems and/or can be brocken later. There should be some tests which prevent regression and shows that statistics collection works. |
@sanja https://buildbot.mariadb.org/#/builders/146/builds/12397 My plan was to add it at the last stage. For now we wanted to collect some remarks on the api and rough implementation. Also, I don't know yet a good way to test this sampling. Mostly my plan is to upgrade |
There should be special tests of difference with current method (as we agreed) |
…method to (random + acceptance/rejection) sampling method
Description
Native sampling in InnoDB could improve Histograms collection
How can this PR be tested?
Call
ANALYZE TABLE [table_name] PERSISTENT FOR ALL;
for InnoDB and another storage engine, after compare histogramsBasing the PR against the correct MariaDB version
Backward compatibility
Should not have any backwards compatibility related issues