Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MDEV-29209] Introduce the connection delay feature #2393

Open
wants to merge 48 commits into
base: 11.0
Choose a base branch
from

Conversation

dingweiqings
Copy link
Contributor

@dingweiqings dingweiqings commented Dec 22, 2022

  • The Jira issue number for this PR is: MDEV-29209

Description

  1. MySQL server should provide a way to introduce delay in authentication process based on consecutive unsuccessful login attempts. This would slow down brute force attack on user password.
    It should be possible to configure:
    1.1 Number of consecutive unsuccessful attempts before which delay is introduced
    2.2 Maximum amount of delay introduced
  2. The Main part of code in mysql pr is this commit c20c1c62415ca9c5bca85cd8ba77c9ed62c3b007, other commits fix some build issues .

Feature Requirements

The same as mysql worklog FR. Look at https://dev.mysql.com/worklog/task/?id=8885

High Level Design

  1. Add one global threshold variables ,
  2. Store user@host information and count failed login times somewhere.
  3. To decide if the connection has a delay based on information collected in step 2 .

Low Level Design

  1. Add global variables named failed_attempts_before_delay.
  2. The failed_attempts_before_delay is also a feature switch , when its be set to 0 then will disable connection delay .
  3. Use current password errors as failed login information on acl user .
  4. Calculate delay time based on net_read_timeout/(max_password_errors - password_errors_before_delay)*(failed_count -password_errors_before_delay) .
  5. Use mysql_mutex_t and mysql_cond_t to implement thread wait.

Note and Limitation

  1. Status variables Connection_control_delay_generated and
    information_schema.CONNECTION_CONTROL_FAILED_LOGIN_ATTEMPTS is not included in this PR , which upstream mysql implemented.
  2. Connection delay is disabled by default , it can be enabled by seting password_errors_before_delay to a positive integer.
  3. Connection delay rely on acl user@host ** not connect user@host** . It means it all connect user@host will count to one matched user@host .
  4. If a user is authed with a plugin , the plugin ** must return CR_AUTH_USER_CREDENTIALS ** when password is not matched.
  5. net_read_timeout/(max_password_errors - password_errors_before_delay)*(failed_count -password_errors_before_delay) is delay time calculated formula. As a result, max delay time is equal to net_read_timeout.

How can this PR be tested?

./mtr --suite=connection_delay

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch
  • This is a bug fix and the PR is based against the earliest branch in which the bug can be reproduced

Backward compatibility

  1. on-disk format change
  2. login fail will get a delay response if enable connection delay

@dingweiqings dingweiqings force-pushed the topic/kurt.ding/connection-control-plugin branch from b901a50 to 6355d2d Compare December 22, 2022 12:58
@dingweiqings dingweiqings marked this pull request as ready for review December 22, 2022 13:06
@dingweiqings dingweiqings force-pushed the topic/kurt.ding/connection-control-plugin branch from a0c9c84 to 2f0e9a9 Compare December 22, 2022 13:16
@dingweiqings
Copy link
Contributor Author

@vuvova Would you like to have a look at code and design?

@dingweiqings
Copy link
Contributor Author

link pr #2331

@dingweiqings
Copy link
Contributor Author

@vuvova Any comments would be great.

@LinuxJedi
Copy link
Contributor

@dingweiqings (just an FYI whilst it is in review) some of our builders are failing with this:

../libmariadbd.a(sql_acl.cc.o): In function `clear_user_login_faild_hash()':
/home/buildbot/amd64-centos-7-rpm-autobake/build/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/sql/sql_acl.cc:64: undefined reference to `user_login_failed_hash'
../libmariadbd.a(sql_acl.cc.o): In function `acl_authenticate(THD*, unsigned int)':
/home/buildbot/amd64-centos-7-rpm-autobake/build/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/sql/sql_acl.cc:14547: undefined reference to `get_or_create_user_login_failed_record(THD*, user_login_failed_record**)'
/home/buildbot/amd64-centos-7-rpm-autobake/build/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/sql/sql_acl.cc:14548: undefined reference to `check_connection_delay_for_user(THD*, user_login_failed_record*)'
collect2: error: ld returned 1 exit status

@dingweiqings
Copy link
Contributor Author

@LinuxJedi Fix it already .

Copy link
Member

@vuvova vuvova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.tags is checked in by mistake

more comments inline.

@@ -299,6 +299,8 @@ The following specify which files/extra groups are read (specified before remain
--extra-port=# Extra port number to use for tcp connections in a
one-thread-per-connection manner. 0 means don't use
another port
--failed-connections-threshold=#
The failed login threshold connection control
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this help text doesn't really help. I don't understand what "failed login threshold connection control" mean.
better something like "The number of failed authentication attempts before the delay is applied.

But even better, I'd say, just hard-code some reasonable value, like 3. And remove the variable.

#define FAILED_ATTEMPTS_BEFORE_DELAY 3

@@ -592,6 +594,8 @@ The following specify which files/extra groups are read (specified before remain
If there is more than this number of interrupted
connections from a host this host will be blocked from
further connections
--max-connection-delay=#
The maximum number of connection response delay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better: "The maximum connection response delay in seconds"

@@ -1655,6 +1662,7 @@ max-binlog-cache-size 18446744073709547520
max-binlog-size 1073741824
max-binlog-stmt-cache-size 18446744073709547520
max-connect-errors 100
max-connection-delay 18446744073709551615
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather long, may be use some realistic value by default? Like, a minute.

Copy link
Member

@vuvova vuvova Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a delay larger than a connection timeout practically means that the account is blocked — connecting is no longer possible. So, this creates an interesting dependency with --max-connect-errors value.

First the user can do --failed-connections-threshold connections without a delay. Then user gets increasingly longer delays and at --max-password-errors he can no longer connect. No longer connect means the delay is equal to the timeout.

Considering this, you need neither --max-connection-delay nor --min-connection-delay nor hard-coded step of 1000ms. The delay is calculated as

timeout * (failure_count - failed_connections_threshold)/(max_password_errors - failed_connections_threshold)

and in this logic you cannot use hard-coded FAILED_ATTEMPTS_BEFORE_DELAY=3, as you need at least some way to enable/disable this feature, so there must be at least one variable to control it.

Don't forget to make sure you don't lock out SUPER users. They will always have to have delay < timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for reviewing code . Yes you are right , when failure_count = max_password_errors , the max delay is timeout*1=timeout. But here timeout value maybe set to

min(net_wait_timeout,net_interactive_timeout) - 10

not connect timeout from client , minus 10 to keep the max delay is less than timeout , which server send access denied response then server close this connection .

VARIABLE_TYPE INT UNSIGNED
VARIABLE_COMMENT The maximum number of connection response delay
NUMERIC_MIN_VALUE 0
NUMERIC_MAX_VALUE 4294967295
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't match what mysqld--help has. How comes?

sql/mysqld.cc Outdated
@@ -3962,7 +3965,7 @@ static int init_common_variables()
exit(1);
}
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, try to avoid invisible whitespaces

sql/sql_acl.cc Outdated
check_connection_delay_for_user(thd, record);
delete_user_login_failed_record(thd, record);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_resource aren't authentication in my opinion, so this code should be before user_resource checks. May be next to handle_password_errors that has a clean comment Login succeeded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it'd be better to have this code inside handle_password_errors() function. You will have it all in one place, not split over two if() as now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here maybe is a limitation , it is depending on return result for auth plugin that calling handle_password_errors. For example, auth_test_plugin just return CR_ERROR , and will not call handle_password_errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it'd be more better to put two if() into one function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this feature is a protection against brute force. So only delaying on CR_AUTH_USER_CREDENTIALS looks appropriate. It's a plugin bug to return incorrect status, you cannot do much if the plugin has a bug.

PSI_mutex_key key_connection_delay_mutex= PSI_NOT_INSTRUMENTED;
PSI_cond_key key_connection_delay_wait= PSI_NOT_INSTRUMENTED;
PSI_stage_info stage_waiting_in_user_login_failed_delay= {
0, "Waiting in connection_control stage", 0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no connection_control plugin, so better rename everything that carries this name to match the new design.


/*
Get structure for logging connection data for the current user
*/

#ifndef NO_EMBEDDED_ACCESS_CHECKS
static HASH hash_user_connections;
HASH user_login_failed_hash;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps you could use a C++ wrapper of HASH to get type checks and other C++ goodies

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, do you need a hash at all? failed password attempts are counted in ACL_USER already, you can simply use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, use password errors in ACL_USER is very simple, and don't to introduce additional HASH and less lock . But here maybe one problem , ACL_USER is count failed password attempts by acl user not connect user/host . In this case , acl user is app@% , if someone login as app user with many wrong times , it will delay application server in production connect. I don't know this is acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is. I'd say it's important to behave consistently in all cases of "wrong password" reactions. If the whole ACL_USER is locked, then the timeout should also apply to the ACL_USER.

Supposedly, it was done to prevent distributed password brute-forcing from many different IPs.

my_printf_error(ME_NOTE, "%s have login failed %u times",
MYF(ME_NOTE | ME_ERROR_LOG), record->key,
record->failed_count);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need to wrap my_printf_error in #ifndef DBUG_OFF it means you should simply use DBUG_PRINT()

int delete_user_login_failed_record(THD *thd,
st_user_login_failed_record *record)
{
my_hash_delete(&user_login_failed_hash, (uchar *) record);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure that FLUSH PRIVILEGES empties the hash

@dingweiqings dingweiqings reopened this Feb 18, 2023
@dingweiqings dingweiqings marked this pull request as draft February 23, 2023 02:10
@dingweiqings dingweiqings force-pushed the topic/kurt.ding/connection-control-plugin branch from 9079add to 6c7911f Compare March 3, 2023 08:59
@dingweiqings dingweiqings changed the title [MDEV-29209] Introduce the Connection Control plugin [MDEV-29209] Introduce the connection delay feature Mar 7, 2023
@dingweiqings dingweiqings marked this pull request as ready for review March 7, 2023 10:04
@dingweiqings dingweiqings requested a review from vuvova March 7, 2023 10:05
mysql-test/main/mysqld--help.result Outdated Show resolved Hide resolved
/**
dialog test plugin mimicking the ordinary auth mechanism. Used to test the clear text plugin API.
The only difference between auth_cleartext_plugin and auth_cleartext_plugin2 is that the second plugin return CR_AUTH_USER_CREDENTIALS when password isn't matched.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to do that. Simply change auth_cleartext_plugin().

sql/sys_vars.cc Outdated Show resolved Hide resolved
sql/sql_acl.cc Outdated Show resolved Hide resolved
sql/sys_vars.cc Outdated Show resolved Hide resolved
sql/sql_connect.cc Outdated Show resolved Hide resolved
sql/sql_connect.cc Outdated Show resolved Hide resolved
/* Time unit for delay step is seconds , convert to milliseconds*/
uint64 timeout= (p->net_wait_timeout < p->net_interactive_timeout
? p->net_wait_timeout
: p->net_interactive_timeout) *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use

  uint64 timeout= thd->net->read_timeout;

but please test that it's the correct timeout to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that read timeout value is 30 seconds , which means max delay is 30 seconds . Think from another point , client send login packet by network socket and server create a thread serving for connection. So in this thread context only when preceding instructions are executed poll function will be called to check if timeout. Simplely first delay response and then check read timeout. As a result , delay time and timeout is independent. Maybe I dont't get full picture of vio framework. Would you give me some explanation? @vuvova

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, you're right. The connection will not be automatically aborted because of the timeout. And indeed delay time and timeout are thus independent.

Still, connect_timeout (which is also vio->read_timeout and vio->write_timeout until the user is fully authenticated) is supposed to limit the duration of the login phase, so it kind of makes sense to artificially limit delay by vio->read_timeout. As a bonus it reduces the number of degrees of freedom, independent variables to configure, and makes two separate features parts of one larger feature

DBUG_PRINT("info",("The step for connection delay is %d milliseconds",step));
/* min delay time is 1 seconds and max delay time is timeout */
uint64 delay=
(step > 1000 ? step : 1000) * (count - p->failed_attempts_before_delay);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you cap the value with 1 second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have calculated this case : if max_password_errors is UNIT_MAX and timeout is 24h , delay time is nearly =24*3600*1000/40_0000_0000=24*36/40000=216/10000=0.0216 ms . The result is too small , which influence nothing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I personally wouldn't do it, but up to you. Just, please, explain this logic in a comment.

sql/sql_connect.cc Outdated Show resolved Hide resolved
@dingweiqings dingweiqings marked this pull request as draft March 13, 2023 02:41
@dingweiqings dingweiqings marked this pull request as draft March 14, 2023 01:59
@dingweiqings dingweiqings force-pushed the topic/kurt.ding/connection-control-plugin branch from bae59a5 to fda926c Compare March 14, 2023 03:10
@dingweiqings dingweiqings marked this pull request as ready for review March 15, 2023 03:56
Copy link
Member

@vuvova vuvova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! That was quite good, I didn't have any comments about the actual implementation logic.
just some minor style comments, like, fix the spacing, rephrase the comment, redundant definition. in other words, making it ready to be pushed :)

mysql-test/main/mysqld--help.result Outdated Show resolved Hide resolved
sql/sql_connect.h Outdated Show resolved Hide resolved
sql/sys_vars.cc Outdated Show resolved Hide resolved
sql/sys_vars.cc Outdated Show resolved Hide resolved
sql/sql_connect.cc Outdated Show resolved Hide resolved
/*
* Based on that invalid password errors will cause user can't connect server
ever and server close connection if it got message from client after
wait_timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you've found out, this is not correct, the delay can be longer than wait_timeout and the server will not disconnect. We artificially limit the delay by wait_timeout. So, better reflect that in the comment, for example

/*
  connect_timeout is designed to limit the connect time, every single network read and
  write is limited by it. Thus we'll say here that the maximal delay is limited by the
  connect_timeout too, and if it takes longer - the connection times out and the user
  cannot connect at all. This means that the delay must reach the connect_timeout
  when the count reaches max_password_errors
*/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very sorry for late response. Recently , I have some personal business.
Hmm, what do you think about this?

/*
  Connect_timeout is designed to limit the connect time, every single network read and
  write is limited by it. Thus we'll say here that the maximal delay is limited by the
  connect_timeout too. As a bonus it reduces the number of degrees of freedom, independent variables to
  configure 
*/

Here , my main point is the delay time can be longer than connect timeout , it could be better from reducing config variables perspective.

sql/sql_connect.cc Show resolved Hide resolved
sql/sql_connect.cc Outdated Show resolved Hide resolved
sql/sql_connect.cc Outdated Show resolved Hide resolved
sql/sql_connect.cc Outdated Show resolved Hide resolved
read_timeout / (max_password_errors - password_errors_before_delay);
/* If max_password_errors is large, step is very small which
* influence nothing, so 1 second is minim delay step */
step= step > 1000 ? step : 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking more about it, I believe this is not a good idea. Yes, the delay will be growing slowly if max_password_errors is big. But with 1s step, the delay will very quickly reach connect_timeout (for example, if connect_timeout is 10, then the delay will be 10s after 10 wrong passwords) and then it will not grow at all. Constant linear growth seems to be more adequate here

--source ../inc/check_connection_delay.inc

let $SERVER_RESPONSE_TIME= $DELAY_MIN_RESPONSE_TIME;
--source ../inc/check_response_delay_time.inc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very convincing. If you look for the wait to be not found, better make it

let $SERVER_RESPONSE_TIME= .*;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have redesigned test cases. check_delay.inc is used to check expect delay, check_no_delay.inc is used to check expect not delay. In check_no_delay.inc file , it use

let $SERVER_RESPONSE_TIME= \d+;

@vuvova
Copy link
Member

vuvova commented Mar 17, 2023

I was reviewing your test cases, and when looking at connection_delay_invalid_users.test I realized that we have a problem.
MariaDB tries very hard to not leak the information about what usernames correspond to real users.

It was implemented in MDEV-3909 remote user enumeration. See also fa7d0c4 and this comment

--max-password-errors feature did not break that logic, because it still goes through the full authentication process and only at the end returns "access denied". And it doesn't increment u->password_errors if the username did not correspond to an existing user.

But this delay feature does. There is no delay for non-existent users, so by trying the same username few times and observing the delay one can determine whether such a user exists or not. At the moment I don't yet know how to solve this

@dingweiqings dingweiqings marked this pull request as draft March 20, 2023 09:48
@dingweiqings dingweiqings marked this pull request as ready for review March 22, 2023 10:26
@dingweiqings
Copy link
Contributor Author

--max-password-errors feature did not break that logic, because it still goes through the full authentication process and only at the end returns "access denied". And it doesn't increment u->password_errors if the username did not correspond to an existing user.

We could add a hash map for count non-existent users, so we could call handle_password_errors for all users. This design expand max_password_errors and delay response feature for all users .

@vuvova
Copy link
Member

vuvova commented Apr 3, 2023

I'm afraid, yes, it has to be a hash. I would've been great and very simply to use ACL_USER, but we need to have the same behavior for non-existent usernames.

But if there will be a hash, there must be some mechanism to limit its growth to prevent it from eating up all the memory. One way of doing it would be to keep only the latest N (configurable) usernames. If the N+1 username connects it simply gets the max possible delay without being added to the hash. Another approach would be to remove the oldest username from the hash, but it's not worth the extra complexity.

@dingweiqings
Copy link
Contributor Author

I'm afraid, yes, it has to be a hash. I would've been great and very simply to use ACL_USER, but we need to have the same behavior for non-existent usernames.

But if there will be a hash, there must be some mechanism to limit its growth to prevent it from eating up all the memory. One way of doing it would be to keep only the latest N (configurable) usernames. If the N+1 username connects it simply gets the max possible delay without being added to the hash. Another approach would be to remove the oldest username from the hash, but it's not worth the extra complexity.

I find that we can simply make non-existent users get max possible delay . For example ,

  if (res > CR_OK && mpvio.status != MPVIO_EXT::SUCCESS)
  {
    Host_errors errors;
    switch (res)
    {
    case CR_AUTH_PLUGIN_ERROR:
      errors.m_auth_plugin= 1;
      break;
    case CR_AUTH_HANDSHAKE:
      errors.m_handshake= 1;
      break;
    case CR_AUTH_USER_CREDENTIALS:
      errors.m_authentication= 1;
      if (thd->password && !mpvio.make_it_fail)
        handle_password_errors(thd, acl_user->user.str, acl_user->host.hostname, PASSWD_ERROR_INCREMENT);
      break;
    case CR_ERROR:
    default:
      /* Unknown of unspecified auth plugin error. */
      errors.m_auth_plugin= 1;
      break;
    }

    inc_host_errors(mpvio.auth_info.thd->security_ctx->ip, &errors);
    if (!thd->is_error())
      login_failed_error(thd);
    if(mpvio.make_it_fail)
      delay_user_for_max_time(user, host);
    DBUG_RETURN(1);
  }

@vuvova
Copy link
Member

vuvova commented Apr 22, 2023

No, of course, we cannot just use max possible delay for non-existent users. Because then it will be a property different between existing users and non-existent. Like, if you connect, and get a "wrong password" error after a delay which is less than max possible delay then the user must definitely exist. And this is exactly what we need to prevent, a way to tell whether a user exists.

@dingweiqings
Copy link
Contributor Author

No, of course, we cannot just use max possible delay for non-existent users. Because then it will be a property different between existing users and non-existent. Like, if you connect, and get a "wrong password" error after a delay which is less than max possible delay then the user must definitely exist. And this is exactly what we need to prevent, a way to tell whether a user exists.
Behaviour for non-existent user is a little complicated. I write a new version , which try best effort to make two types of users have the same behaviour.

@@ -14393,6 +14398,41 @@ static bool check_password_lifetime(THD *thd, const ACL_USER &acl_user)
return false;
}

static void handle_nonexistent_user_login_failed(THD *thd,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuvova What do you think about this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuvova following up on this, it looks like it needs a reply from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants