-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
馃悰 Bug Report: Appwrite main container freezes when executing cloud-functions. #5629
Comments
Hi |
Any update on this? |
I have revisited the issue and all of it is very technical and correct. I have one concern 馃 Considering 6 workers (1 core), you seem to get functions to freeze within 4 executions. I cant see logical reason for that, as even with 5 function executions, everything should remain functional - just slower. My theory is that if at least 1 worker is doing any request different from synchronous function execution, the stack should not freeze. Because such worker always finishes and becomes available for another request very quickly. |
Hey Matej, thank you for taking the time to go over the issue. The problem is that the second a function calls Appwrite again when there's no available worker, then in this case the function will wait till the timeout is passed which causes a waterfall of timeouts one after another. For example: Execution 4 -> Can't even start the execution as there no worker is available for process the first request. After a few executions that bottle neck gets bigger and bigger, and the running function blocks existing workers. |
馃憻 Reproduction steps
The problem
Appwrite main container freezes and Swoole is crashing for 30 seconds +- when accessing Appwrite api from a cloud function.
Take this JavaScript function for example: (Applied to other platforms as well)
When running this function it works perfect.
But when trying to run it more than 3 times per second - with benchmark or simple browser request - the function runs till the function it hit its own
timeout
and responds withfailed
status.A workaround
When running a function in synchronized mode it goes through the same route as any Appwrite API endpoints, Plus, The additional process required to go from the executor to the function (open-runtime) itself.
like such:
User -> Traefik -> Appwrite -> Executor -> function
Then the function itself is calling Appwrite service - let's say to database one - then we have this path:
function -> internet --> Traefik -> Appwrite -> database endpoint
,Something like this. the dashed line marked the sequential request
So, what happened - from what I've observed - is that the 6-per core Swoole workers get blocked by their own execution.
Meaning, if 3 functions execution happened simultaneously than half of the workers are in busy handling the
createExecute
function while the other half are handling thecreateDocument
function.But if more than half of the available workers are busy handling the
createExecute
function, then only 2 left for handling thecreateDocument
function, and thelast 2 left stuck in midair causing a chain reaction of functions timeout error.
It also happened when the value
_APP_WORKER_PER_CORE
is set to a higher value. In this case Swools's freezes after crossing half of the workers per second.In order to prove it I've tried to run a stressed benchmark on the server for the
createDocument
endpoint, and because no worker was taken from the inside the request can waitin line and got execute one after each other.
Also, I've deployed Appwrite-free function and run a stressed benchmark on that one. Like the
createDocument
Appwrite was able to process all the request by order, Like a charm.Semi-solution
The final solution for this issue will probably be decided by Appwrite team.
In this solution I've added another container to
docker-compose.yml
that uses the sameappwrite
image as the rest, This container acts as duplicate of the main container.The container is also connected to the
runtimes
network, so functions can connect to the container using is internal address.Then the user can set the function endpoint as follows:
Using this method makes sure that the function and the API the function is pointing to are using a completely different sets of Swoole workers, which guarantees fail-proof execution by avoiding recursive call to the same workers, Something like this:
Same solution in a separate file
As many won't want to edit the `docker-compose.yml` file, And, it can make the upgrade process of Appwrite real annoying.For that you can do as follows:
docker-compose.yml
. name itdocker-compose-functions.yml
for example.-f
flag to start/stop this file.Using this method, will let you upgrade Appwrite as much as you want.
External snippet
Another possible solution.
In order to avoid connecting the open-runtime generated container function to the
appwrite_appwrite
network, the adding of the new duplicated container can work in a similar way of therealtime
one.Tests
To check this assumption I've run these tests with oha benchmark tool.
Create Document API
Command
Execute function - Just JSON
Command
Executing this function that doesn't connect to Appwrite at all.
Execute function - With Appwrite
Command:
Same as that one
Function code - connects to instance main domain.
Execute function - With Appwrite (Tunneling)
Command:
Same as that one
Function code - tunnels the client request through the internal docker-host domain
appwrite-functions
Benchmarks table
Request per second
Just JSON
With Appwrite
With Appwrite (Tunneled)
% Faster
As you can see the
Tunneled
way is faster in any case. and it's a must in low-budget - up to medium - servers.P.s. This partially related to #4626
馃憤 Expected behavior
Appwrite should be able to handle all calls, Also in low-budget servers.
馃憥 Actual Behavior
Up to
8vCPU 16GB
Swoole is not able to handle the recursive workers call.馃幉 Appwrite version
Different version (specify in environment)
馃捇 Operating system
Linux
馃П Your Environment
I've tested from version
1.1.2
up to1.3.5
馃憖 Have you spent some time to check if this issue has been raised before?
馃彚 Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: