Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 Bug Report: maintenance task must not kill executions that have not reached the timeout defined in the function #5233

Open
2 tasks done
rafagazani opened this issue Mar 15, 2023 · 15 comments
Assignees
Labels
bug Something isn't working product / functions Fixes and upgrades for the Appwrite Functions.

Comments

@rafagazani
Copy link

馃憻 Reproduction steps

Create a functions with a timeout greater than 30 seconds.
Make the execution run until close to the defined timeout.

馃憤 Expected behavior

that the execution is waited until the defined timeout

馃憥 Actual Behavior

If the run is active for more than 30 seconds when starting the maintenance task, it will kill the run.

馃幉 Appwrite version

Version 1.1.x

馃捇 Operating system

Linux

馃П Your Environment

python appwrite==1.1.0

馃憖 Have you spent some time to check if this issue has been raised before?

  • I checked and didn't find similar issue

馃彚 Have you read the Code of Conduct?

@rafagazani rafagazani added the bug Something isn't working label Mar 15, 2023
@rafagazani
Copy link
Author

rafagazani commented Mar 15, 2023

the findings were recorded in this discord conversation https://discord.com/channels/564160730845151244/1078437335844266085/1085587879817924768

@stnguyen90 stnguyen90 added the product / functions Fixes and upgrades for the Appwrite Functions. label Mar 15, 2023
@stnguyen90
Copy link
Contributor

stnguyen90 commented Mar 15, 2023

@rafagazani, thanks for creating this issue! 馃檹馃徏 To provide some more context:

When a function executes, a runtime container is created (if one does not exist already) to execute your function code. To save resources, Appwrite has a maintenance task that runs every 30 minutes to remove "inactive" runtime containers:

appwrite/app/executor.php

Lines 738 to 758 in e8c74d7

Timer::tick(MAINTENANCE_INTERVAL * 1000, function () use ($orchestrationPool, $activeRuntimes) {
Console::warning("Running maintenance task ...");
foreach ($activeRuntimes as $runtime) {
$inactiveThreshold = \time() - App::getEnv('_APP_FUNCTIONS_INACTIVE_THRESHOLD', 60);
if ($runtime['updated'] < $inactiveThreshold) {
go(function () use ($runtime, $orchestrationPool, $activeRuntimes) {
try {
$orchestration = $orchestrationPool->get();
$orchestration->remove($runtime['name'], true);
$activeRuntimes->del($runtime['name']);
Console::success("Successfully removed {$runtime['name']}");
} catch (\Throwable $th) {
Console::error('Inactive Runtime deletion failed: ' . $th->getMessage());
} finally {
$orchestrationPool->put($orchestration);
}
});
}
}
});
});

An "inactive" runtime container is one that has not completed successfully in the last _APP_FUNCTIONS_INACTIVE_THRESHOLD seconds (60 by default).

So, if you have a long-running function, there is a higher chance it may be processing while the maintenance task runs, and because it hasn't finished successfully, it can be removed by Appwrite leading to an error like:

An internal curl error has occurred within the executor! Error Msg: Connection reset by peer

Let me talk to the team about this.

@stnguyen90 stnguyen90 self-assigned this Mar 15, 2023
@Meldiron
Copy link
Contributor

Meldiron commented Mar 16, 2023

We noticed a race condition in Appwrite Cloud caused most likely by the same problem.
In executor, we added a change that re-marks a runtime "active" at the beginning of the execution, as well as the end.

PR: https://github.com/open-runtimes/executor/pull/24/files
(not released yet)

@garvitomer
Copy link

@Meldiron Hi, the changes that you mentioned above will solve the issue of that curl error in version 0.15.3 if i make this change to src files? As most of my fucntions runs within 1 seconds.
Also is this all or there is anything else to change?

@Meldiron
Copy link
Contributor

Hi @garvitomer 馃憢

This issue is regarding long-running functions, so I don't think it's related to your issue.
The issue we solved with my PR was regarding Appwrite Cloud, and the problem was found with many concurrent executions. We never got this error reported by the community, so I don't think this is relevant for you either unless you have really high load on your function.

Could you please elaborate on your issue? Are you also getting Connection reset by peer聽error? Did you manage to find what could trigger it?

@garvitomer
Copy link

@Meldiron We discussed it on discord server.

This is the github issue:
#3776

Discord link:
https://discord.com/channels/564160730845151244/1075555914540650526/1082968681841172523

@garvitomer
Copy link

garvitomer commented Mar 16, 2023

@Meldiron This is the error which I get in general. There is no set pattern I can find whoch cause this error. I restart docker everyday to minimize the error, as it generally starts happening from day2.

  • All my functions gets executed under 1 seconds
  • I use both dart and python sdk.
  • Error happens in normal function triggered by appwrite server or users frontend.
  • It happens in cron schedule too.

An internal curl error has occurred within the executor! Error Msg: Could not resolve host: 630ffeafdee98bacd072-63e63da0ef7c2e3ea5af

@rafagazani
Copy link
Author

@garvitomer As I reported there on discord, there was a lot of this type of random error when my server was in the account, I don't know why but when changing servers I didn't come across this error anymore
Just out of curiosity, it's been running for over 24 hours, 119k runs, zero errors. I'll leave it running until I've finished reading the entire collection and checking that it won't fail even one run.

@garvitomer
Copy link

@rafagazani Hi, please track it and inform us about your result after 4-5 days as errors starts usually after 24-48 hours and not every function gets impacted by it. I personally don't think it is due to server but if that is the case I will change it too from to hetzner to some other provider. Which one are you using now?

@rafagazani
Copy link
Author

@garvitomer I've been watching since the 15th at 00:26. It has passed 197k executions, zero errors. I believe that this process goes on for a few more days, it should go up to 650 thousand executions. I'm using digital ocean.

@rafagazani
Copy link
Author

@Meldiron, could you please mark which version the correction will be applied to?

@suikodev
Copy link

This issue is not related to long-running functions, as it can occur even if your function's running time is less than 1 s. Everyone who uses appwrite function could face this issue. Take a look at this line, which only runs at the end of the appwrite function execution flow:

$runtime['updated'] = \time();

If a maintenance task is executed between the start of the appwrite function execution flow and the execution of $runtime['updated'] = \time();, you might encounter this problem. Here is my Appwrite executor log for reference:

Executing Runtime: dreams-648ecd347888ec43f653
Running maintenance task ...
[Error] Type: Exception
[Error] Message: An internal curl error has occurred within the executor! Error Msg: Connection reset by peer
[Error] File: /usr/src/code/app/executor.php
[Error] Line: 544
Successfully removed dreams-648ecd347888ec43f653
Successfully removed dreams-648ecd092f0bf133fff9

@BChip
Copy link

BChip commented Aug 22, 2023

Hello, I am facing this issue as well.
An internal curl error has occurred within the executor! Error Msg: Connection reset by peer
image

It happens the first time I call the cloud function; it immediately fails after 2-3 seconds. Second request always works and doesn't return this error. It's like there is some kind of cold-start issue or something?

@garvitomer
Copy link

Hi, I stopped using appwrite 6 months back due to this issue and learned Golang and currently building server in it all due to this single major issue. Now I think it is best way to do whatever you want and with complete freedom. Though appwrite community helped me a lot to understand basics of api and I will always be grateful to them.

@rafagazani
Copy link
Author

I got around this problem using these tips:

  1. Always add try catch.
  2. ensure that when the function ends, there is a return response.json or response.send.
  3. Functions that can be executed for more than 30 seconds should not be executed by appwrite function. In this case I created a separate container to call the function.

I'm waiting for version 1.4.0, I believe it will resolve the issue. I also thank the appwrite core and its community for the project and support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working product / functions Fixes and upgrades for the Appwrite Functions.
Projects
None yet
Development

No branches or pull requests

6 participants