Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the error message when frontend RPC to datanode timed out #3447

Open
killme2008 opened this issue Mar 7, 2024 · 4 comments
Open
Assignees

Comments

@killme2008
Copy link
Contributor

What type of enhancement is this?

User experience

What does the enhancement do?

Currently the db returns an internal error instead of timeout error:

2024-03-05 23:01:54,235 ERROR sender_thread Operational error: (MySQLdb.OperationalError) (1815, 'Internal error: 1000')
[SQL: INSERT INTO keyboard_monitor (hits, ts) VALUES (%s, now())]
[parameters: ('Key.cmd+Key.left',)]

We'd better improve it.

Implementation challenges

No response

@tisonkun
Copy link
Contributor

tisonkun commented Mar 7, 2024

Thanks for raising this! I hit this issue, and I suggest the server-side logs help also.

@shuiyisong May you attach the logs (in text, so we can copy and search) here with the necessary desensitization?

@tisonkun
Copy link
Contributor

tisonkun commented Mar 7, 2024

Note: timezone is biased with (+08:00).

2024-03-06 22:44:20,078 ERROR sender_thread Operational error: (MySQLdb.OperationalError) (1815, 'Internal error: 1000')
[SQL: INSERT INTO keyboard_monitor (hits, ts) VALUES (%s, now())]
[parameters: ("'a'",)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

And I hit it again. I'd change the code logic to retry on such error also in my App now.

@shuiyisong
Copy link
Contributor

shuiyisong commented Mar 7, 2024

2024-03-06 22:44:20,078 ERROR sender_thread Operational error: (MySQLdb.OperationalError) (1815, 'Internal error: 1000')
[SQL: INSERT INTO keyboard_monitor (hits, ts) VALUES (%s, now())]
[parameters: ("'a'",)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

And I hit it again. I'd change the code logic to retry on such error also in my App now.

Logs on Frontend

2024-03-06T14:44:19.951468Z ERROR on_query{protocol="mysql"}:mysql::do_query:do_query: frontend::instance: Failed to execute query: INSERT INTO keyboard_monitor (hits, ts) VALUES ('\'a\'', now()) err=0: Table operation error, at greptimedb/src/frontend/src/instance.rs:271:141: Failed to execute logical plan, at /home/runner/work/gtc/gtc/greptimedb/src/operator/src/statement.rs:241:14
2: Table mutation error, at greptimedb/src/query/src/datafusion.rs:217:14
3: Failed to do table mutation, at greptimedb/src/operator/src/table.rs:60:14
4: Failed to insert data, at greptimedb/src/operator/src/insert.rs:217:26
5: Retry later
6: Failed to request RegionServer, code: The operation was cancelled
7: Timeout expired
2024-03-06T14:44:19.954555Z ERROR on_query{protocol="mysql"}: servers::mysql::writer: Failed to handle mysql query err=0: Failed to execute query, query: INSERT INTO keyboard_monitor (hits, ts) VALUES ('\'a\'', now()), at /home/runner/work/gtc/gtc/greptimedb/src/servers/src/query_handler/sql.rs:89:22
1: Table operation error, at greptimedb/src/frontend/src/instance.rs:271:14
2: Failed to execute logical plan, at /home/runner/work/gtc/gtc/greptimedb/src/operator/src/statement.rs:241:14
3: Table mutation error, at greptimedb/src/query/src/datafusion.rs:217:14
4: Failed to do table mutation, at greptimedb/src/operator/src/table.rs:60:14
5: Failed to insert data, at greptimedb/src/operator/src/insert.rs:217:26
6: Retry later
7: Failed to request RegionServer, code: The operation was cancelled
8: Timeout expired
2024-03-06T14:44:20.569265Z  WARN servers::mysql::server: Internal error occurred during query exec, server actively close the channel to let client try next time err=0: Internal IO error
1: Custom { kind: UnexpectedEof, error: "peer closed connection without sending TLS close_notify: https://docs.rs/rustls/latest/rustls/manual/_03_howto/index.html#unexpected-eof" }
2024-03-06T14:44:20.618607Z ERROR servers::error: Failed to handle HTTP request:  err=0: Execute gRPC query error, at greptimedb/src/frontend/src/instance/prom_store.rs:194:18
1: Table operation error, at greptimedb/src/frontend/src/instance/grpc.rs:203:14
2: Failed to insert data, at /home/runner/work/gtc/gtc/greptimedb/src/operator/src/insert.rs:217:26
3: Retry later
4: Failed to request RegionServer, code: The operation was cancelled
5: Timeout expired
2024-03-06T14:44:20.618871Z ERROR tower_http::trace::on_failure: response failed classification=Status code: 500 Internal Server Error latency=10023 ms

Logs on Datanode

2024-03-06T14:44:19.906650Z  INFO mito2::request: Successfully compacted region: 263599322824704(61374, 0)2024-03-06T14:44:19.935821Z  INFO mito2::request: Successfully compacted region: 264475496153088(61578, 0)
2024-03-06T14:44:19.960967Z  INFO mito2::request: Successfully compacted region: 264552805564416(61596, 0)
2024-03-06T14:44:19.993690Z  INFO mito2::request: Successfully compacted region: 264449726349312(61572, 0)
2024-03-06T14:44:20.029620Z  INFO mito2::request: Successfully compacted region: 263655157399552(61387, 0)
2024-03-06T14:44:20.076301Z  INFO mito2::request: Successfully compacted region: 263616502693888(61378, 0)
2024-03-06T14:44:20.151376Z  INFO mito2::request: Successfully compacted region: 264325172297728(61543, 0)
2024-03-06T14:44:20.218329Z  INFO mito2::request: Successfully compacted region: 263736761778176(61406, 0)
2024-03-06T14:44:20.245113Z  INFO mito2::request: Successfully compacted region: 264423956545536(61566, 0)
2024-03-06T14:44:20.283950Z  INFO mito2::request: Successfully compacted region: 113876762886144(26514, 0)
2024-03-06T14:44:20.333990Z  INFO mito2::request: Successfully compacted region: 264119013867520(61495, 0)
2024-03-06T14:44:20.359387Z  INFO mito2::request: Successfully compacted region: 263732466810880(61405, 0)
2024-03-06T14:44:20.396025Z  INFO mito2::request: Successfully compacted region: 263620797661184(61379, 0)
2024-03-06T14:44:20.421128Z  INFO mito2::request: Successfully compacted region: 263869905764352(61437, 0)
2024-03-06T14:44:20.482840Z  INFO mito2::request: Successfully compacted region: 264419661578240(61565, 0)
2024-03-06T14:44:20.523033Z  INFO mito2::request: Successfully compacted region: 263852725895168(61433, 0)
2024-03-06T14:44:20.549548Z  INFO mito2::request: Successfully compacted region: 263826956091392(61427, 0)
2024-03-06T14:44:20.599248Z  INFO mito2::request: Successfully compacted region: 113966957199360(26535, 0)
2024-03-06T14:44:20.634219Z  INFO mito2::request: Successfully compacted region: 264776143863808(61648, 0)
2024-03-06T14:44:20.641370Z  INFO mito2::compaction::twcs: Compaction region 263814071189504(61424, 0) output [e7babfcc-253a-4ce7-8a43-5eb22da78d7d,1ade1ae5-2cb3-4b0c-9f6c-0a352f6376f2]-> 47953fe6-2633-45ca-a596-d9e0e5fc7077
2024-03-06T14:44:20.642141Z  INFO mito2::compaction::twcs: Compaction region 264806208634880(61655, 0) output [040afffc-3708-401b-ad6a-d8a4bcd63332,1d54a324-5ab7-4193-a192-40f171a2a870]-> e245cd4d-f61a-48c2-87c0-9d1a5c5a7f45
2024-03-06T14:44:20.642747Z  INFO mito2::compaction::twcs: Compaction region 263998754783232(61467, 0) output [35224ec3-0cb3-4d8c-96a4-385015d7a5af,1604f53a-fd3e-444d-b17b-7c74a5da4fd5]-> 1261ad5f-898c-4ec5-a834-bb8480594629
2024-03-06T14:44:20.717214Z  INFO mito2::request: Successfully compacted region: 113979842101248(26538, 0)
2024-03-06T14:44:20.742845Z  INFO mito2::request: Successfully compacted region: 113932597460992(26527, 0)
2024-03-06T14:44:20.768225Z  INFO mito2::request: Successfully compacted region: 113838108180480(26505, 0)
2024-03-06T14:44:20.842690Z  INFO mito2::request: Successfully compacted region: 264771848896512(61647, 0)
2024-03-06T14:44:20.865614Z  INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 263814071189504(61424, 0), file_id: FileId(e7babfcc-253a-4ce7-8a43-5eb22da78d7d), time_range: (Timestamp { value: 1709731723913, unit: Millisecond }, Timestamp { value: 1709736128216, unit: Millisecond }), level: 0, file_size: 3885, available_indexes: [InvertedIndex], index_file_size: 451 }, FileMeta { region_id: 263814071189504(61424, 0), file_id: FileId(1ade1ae5-2cb3-4b0c-9f6c-0a352f6376f2), time_range: (Timestamp { value: 1709722032889, unit: Millisecond }, Timestamp { value: 1709731425855, unit: Millisecond }), level: 1, file_size: 4964, available_indexes: [InvertedIndex], index_file_size: 452 }], output: [FileMeta { region_id: 263814071189504(61424, 0), file_id: FileId(47953fe6-2633-45ca-a596-d9e0e5fc7077), time_range: (Timestamp { value: 1709722032889, unit: Millisecond }, Timestamp { value: 1709736128216, unit: Millisecond }), level: 1, file_size: 5212, available_indexes: [InvertedIndex], index_file_size: 452 }], window: Some(43200)
2024-03-06T14:44:20.870613Z  INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 264806208634880(61655, 0), file_id: FileId(040afffc-3708-401b-ad6a-d8a4bcd63332), time_range: (Timestamp { value: 1709731715996, unit: Millisecond }, Timestamp { value: 1709736120303, unit: Millisecond }), level: 0, file_size: 5441, available_indexes: [InvertedIndex], index_file_size: 617 }, FileMeta { region_id: 264806208634880(61655, 0), file_id: FileId(1d54a324-5ab7-4193-a192-40f171a2a870), time_range: (Timestamp { value: 1709722039975, unit: Millisecond }, Timestamp { value: 1709731417941, unit: Millisecond }), level: 1, file_size: 7751, available_indexes: [InvertedIndex], index_file_size: 616 }], output: [FileMeta { region_id: 264806208634880(61655, 0), file_id: FileId(e245cd4d-f61a-48c2-87c0-9d1a5c5a7f45), time_range: (Timestamp { value: 1709722039975, unit: Millisecond }, Timestamp { value: 1709736120303, unit: Millisecond }), level: 1, file_size: 6926, available_indexes: [InvertedIndex], index_file_size: 617 }], window: Some(43200)
2024-03-06T14:44:20.893455Z  INFO mito2::request: Successfully compacted region: 263668042301440(61390, 0)
2024-03-06T14:44:20.903840Z  INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 263998754783232(61467, 0), file_id: FileId(35224ec3-0cb3-4d8c-96a4-385015d7a5af), time_range: (Timestamp { value: 1709731715996, unit: Millisecond }, Timestamp { value: 1709736120303, unit: Millisecond }), level: 0, file_size: 3392, available_indexes: [InvertedIndex], index_file_size: 364 }, FileMeta { region_id: 263998754783232(61467, 0), file_id: FileId(1604f53a-fd3e-444d-b17b-7c74a5da4fd5), time_range: (Timestamp { value: 1709722039975, unit: Millisecond }, Timestamp { value: 1709731417941, unit: Millisecond }), level: 1, file_size: 3583, available_indexes: [InvertedIndex], index_file_size: 365 }], output: [FileMeta { region_id: 263998754783232(61467, 0), file_id: FileId(1261ad5f-898c-4ec5-a834-bb8480594629), time_range: (Timestamp { value: 1709722039975, unit: Millisecond }, Timestamp { value: 1709736120303, unit: Millisecond }), level: 1, file_size: 3688, available_indexes: [InvertedIndex], index_file_size: 365 }], window: Some(43200)
2024-03-06T14:44:20.917369Z  INFO mito2::request: Successfully compacted region: 264131898769408(61498, 0)
2024-03-06T14:44:20.975908Z  INFO mito2::request: Successfully compacted region: 114052856545280(26555, 0)

@waynexia @evenyag is it possible that datanode is doing compaction so that requests are hold until timeout?

@evenyag
Copy link
Contributor

evenyag commented Mar 7, 2024

datanode is doing compaction so that requests are hold until timeout?

The engine does compaction in the background so it won't hold requests.

The time spent on flushing the memtable is short, but the total flush time is quite long. So the region worker was blocked for nearly 1 min. To ensure consistency of manifest files we write them in the worker loop. This might block the worker if there are many manifest files to write. I will create an issue for it and fix it in v0.8.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants