MDEV-19556: InnoDB native sampling #2117

MagHErmit · 2022-05-15T16:14:49Z

The Jira issue number for this PR is: MDEV-19556

Description

Native sampling in InnoDB could improve Histograms collection

How can this PR be tested?

Call ANALYZE TABLE [table_name] PERSISTENT FOR ALL; for InnoDB and another storage engine, after compare histograms

Basing the PR against the correct MariaDB version

This is a new feature and the PR is based against the latest MariaDB development branch
This is a bug fix and the PR is based against the earliest branch in which the bug can be reproduced

Backward compatibility

Should not have any backwards compatibility related issues

CLAassistant · 2022-05-15T16:19:52Z

All committers have signed the CLA.

dr-m

I think that we should try to make the statistics scan smarter, and make it cover every index. In which way do the collected statistics differ from what dict_stats_analyze_index() is collecting? That method would be much more accurate.

If we really want to have random sampling, then it could make more sense to introduce an API call to position the cursor to a specific position in an index (for example, indicated by a floating point number between 0 and 1, referring to the smallest and largest key). Such APIs were discussed in MDEV-21895.

dr-m · 2022-05-16T09:52:17Z

storage/innobase/handler/ha_innodb.cc

@@ -9489,6 +9490,62 @@ ha_innobase::rnd_next(
 DBUG_RETURN(error);
 }

+#include "../row/row0sel.cc"


This would seem to duplicate quite a bit of code, and possibly lead to some violations of one definition rule.

dr-m · 2022-05-16T09:53:51Z

storage/innobase/handler/ha_innodb.cc

+ if (m_prebuilt->clust_index_was_generated) {
+ err = change_active_index(MAX_KEY);
+ } else {
+ err = change_active_index(m_primary_key);
+ }


Is this condition really needed? Would a simple change_active_index(m_primary_key) work?

Why would we be sampling only in the clustered index? What about other indexes?

dr-m · 2022-05-16T09:54:58Z

storage/innobase/handler/ha_innodb.cc

+ offsets= rec_get_offsets(rec, index, offsets, index->n_core_fields, ULINT_UNDEFINED, &heap);
+ ut_ad(offsets != NULL);
+ ut_ad(heap == NULL);


The assertion ought to fail if the table contains enough many columns.

Good then, we had some argue about that -- whether we should handle heap or not.
@MagHErmit, you should then construct the test to check this out:) And don't throw it out afterwards, we will include it in the test suite.

The plan minimum is to free it correctly, but, Marko, maybe we should cache this heap in handler or prebuilt to avoid reallocations on each sample_next for wide tables?

cvicentiu · 2022-05-16T15:48:43Z

sql/handler.h

@@ -4030,6 +4030,10 @@ class handler :public Sql_alloc
 virtual int ft_read(uchar *buf) { return HA_ERR_WRONG_COMMAND; }
 virtual int rnd_next(uchar *buf)=0;
 virtual int rnd_pos(uchar * buf, uchar *pos)=0;
+
+ virtual int sample_next(uchar *buf) { return 0;}


I would suggest an interface that tells the storage engine beforehand how many rows (percentage wise or number wise) will be sampled. That opens up a number of optimizations, such as allowing the storage engine to pre-fetch those rows before sample_next takes place.

It can be a separate call, like a condition pushdown (as an index_* alternative I guess).
Anyway, using it is out of this task's scope, so this parameter will hang dangling, until somebody implements it, or not.

FooBarrior · 2022-05-17T11:27:06Z

@dr-m Our goal is to implement Bernoulli sampling. How a more sophisticated statistics collection from dict_stats_analyze_index can help with that?

For the purpose of sampling we need to access the whole record -- this is why the clustered index is used
(as I believe. @cvicentiu can give a better strategic view on the long-term goals)

sql/sql_statistics.cc

sanja-byelkin · 2022-05-18T07:36:31Z

I could not find any test, rule of thumb is that commit of a feature with no tests usually have a problems and/or can be brocken later. There should be some tests which prevent regression and shows that statistics collection works.

sql/handler.h

FooBarrior · 2022-05-18T14:52:09Z

@sanja https://buildbot.mariadb.org/#/builders/146/builds/12397
So there are some tests:-)

My plan was to add it at the last stage. For now we wanted to collect some remarks on the api and rough implementation.
Now the top priority is to adjust the sampling probabilities.

Also, I don't know yet a good way to test this sampling. Mostly my plan is to upgrade main.statistics* to test both sampling methods (through a switch). There is one test I would like to add in particular, but it's for 300+ column table, so I'm not sure if we can normally add it. At least, --big-test will be implied, i think

sanja-byelkin · 2022-05-20T13:41:54Z

There should be special tests of difference with current method (as we agreed)

…method to (random + acceptance/rejection) sampling method

MagHErmit added 3 commits May 4, 2022 15:02

prototype of sampling code

46a3845

add new HA_NATIVE_SAMPLING flag

0633860

make ha_samlpe interface

4193794

fix style mistakes

af4225c

dr-m requested changes May 16, 2022

View reviewed changes

cvicentiu reviewed May 16, 2022

View reviewed changes

sanja-byelkin reviewed May 18, 2022

View reviewed changes

sql/sql_statistics.cc Show resolved Hide resolved

sanja-byelkin reviewed May 18, 2022

View reviewed changes

sql/handler.h Show resolved Hide resolved

spetrunia self-requested a review May 20, 2022 17:25

MagHErmit added 2 commits May 28, 2022 16:21

fix include and heap ptr, add offsets errors handling

bd12c52

change prototypes btr_[p]cur_open_at_rnd_pos, change random sampling …

c6dcf38

…method to (random + acceptance/rejection) sampling method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-19556: InnoDB native sampling #2117

MDEV-19556: InnoDB native sampling #2117

MagHErmit commented May 15, 2022

CLAassistant commented May 15, 2022 •

edited

dr-m left a comment

dr-m May 16, 2022

dr-m May 16, 2022

dr-m May 16, 2022

FooBarrior May 17, 2022

cvicentiu May 16, 2022

FooBarrior May 17, 2022

FooBarrior commented May 17, 2022

sanja-byelkin commented May 18, 2022

FooBarrior commented May 18, 2022

sanja-byelkin commented May 20, 2022

MDEV-19556: InnoDB native sampling #2117

Are you sure you want to change the base?

MDEV-19556: InnoDB native sampling #2117

Conversation

MagHErmit commented May 15, 2022

Description

How can this PR be tested?

Basing the PR against the correct MariaDB version

Backward compatibility

CLAassistant commented May 15, 2022 • edited

dr-m left a comment

Choose a reason for hiding this comment

dr-m May 16, 2022

Choose a reason for hiding this comment

dr-m May 16, 2022

Choose a reason for hiding this comment

dr-m May 16, 2022

Choose a reason for hiding this comment

FooBarrior May 17, 2022

Choose a reason for hiding this comment

cvicentiu May 16, 2022

Choose a reason for hiding this comment

FooBarrior May 17, 2022

Choose a reason for hiding this comment

FooBarrior commented May 17, 2022

sanja-byelkin commented May 18, 2022

FooBarrior commented May 18, 2022

sanja-byelkin commented May 20, 2022

CLAassistant commented May 15, 2022 •

edited