Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-19556: InnoDB native sampling #2117

Open
wants to merge 6 commits into
base: 10.9
Choose a base branch
from

Conversation

MagHErmit
Copy link

  • The Jira issue number for this PR is: MDEV-19556

Description

Native sampling in InnoDB could improve Histograms collection

How can this PR be tested?

Call ANALYZE TABLE [table_name] PERSISTENT FOR ALL; for InnoDB and another storage engine, after compare histograms

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch
  • This is a bug fix and the PR is based against the earliest branch in which the bug can be reproduced

Backward compatibility

Should not have any backwards compatibility related issues

@CLAassistant
Copy link

CLAassistant commented May 15, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should try to make the statistics scan smarter, and make it cover every index. In which way do the collected statistics differ from what dict_stats_analyze_index() is collecting? That method would be much more accurate.

If we really want to have random sampling, then it could make more sense to introduce an API call to position the cursor to a specific position in an index (for example, indicated by a floating point number between 0 and 1, referring to the smallest and largest key). Such APIs were discussed in MDEV-21895.

@@ -9489,6 +9490,62 @@ ha_innobase::rnd_next(
DBUG_RETURN(error);
}

#include "../row/row0sel.cc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would seem to duplicate quite a bit of code, and possibly lead to some violations of one definition rule.

Comment on lines +9503 to +9507
if (m_prebuilt->clust_index_was_generated) {
err = change_active_index(MAX_KEY);
} else {
err = change_active_index(m_primary_key);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition really needed? Would a simple change_active_index(m_primary_key) work?

Why would we be sampling only in the clustered index? What about other indexes?

Comment on lines 9539 to 9541
offsets= rec_get_offsets(rec, index, offsets, index->n_core_fields, ULINT_UNDEFINED, &heap);
ut_ad(offsets != NULL);
ut_ad(heap == NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion ought to fail if the table contains enough many columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good then, we had some argue about that -- whether we should handle heap or not.
@MagHErmit, you should then construct the test to check this out:) And don't throw it out afterwards, we will include it in the test suite.

The plan minimum is to free it correctly, but, Marko, maybe we should cache this heap in handler or prebuilt to avoid reallocations on each sample_next for wide tables?

@@ -4030,6 +4030,10 @@ class handler :public Sql_alloc
virtual int ft_read(uchar *buf) { return HA_ERR_WRONG_COMMAND; }
virtual int rnd_next(uchar *buf)=0;
virtual int rnd_pos(uchar * buf, uchar *pos)=0;

virtual int sample_next(uchar *buf) { return 0;}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest an interface that tells the storage engine beforehand how many rows (percentage wise or number wise) will be sampled. That opens up a number of optimizations, such as allowing the storage engine to pre-fetch those rows before sample_next takes place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a separate call, like a condition pushdown (as an index_* alternative I guess).
Anyway, using it is out of this task's scope, so this parameter will hang dangling, until somebody implements it, or not.

@FooBarrior
Copy link
Contributor

@dr-m Our goal is to implement Bernoulli sampling. How a more sophisticated statistics collection from dict_stats_analyze_index can help with that?

For the purpose of sampling we need to access the whole record -- this is why the clustered index is used
(as I believe. @cvicentiu can give a better strategic view on the long-term goals)

@sanja-byelkin
Copy link
Member

I could not find any test, rule of thumb is that commit of a feature with no tests usually have a problems and/or can be brocken later. There should be some tests which prevent regression and shows that statistics collection works.

@FooBarrior
Copy link
Contributor

@sanja https://buildbot.mariadb.org/#/builders/146/builds/12397
So there are some tests:-)

My plan was to add it at the last stage. For now we wanted to collect some remarks on the api and rough implementation.
Now the top priority is to adjust the sampling probabilities.

Also, I don't know yet a good way to test this sampling. Mostly my plan is to upgrade main.statistics* to test both sampling methods (through a switch). There is one test I would like to add in particular, but it's for 300+ column table, so I'm not sure if we can normally add it. At least, --big-test will be implied, i think

@sanja-byelkin
Copy link
Member

There should be special tests of difference with current method (as we agreed)

@spetrunia spetrunia self-requested a review May 20, 2022 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants