Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 Bug Report: Appwrite main container freezes when executing cloud-functions. #5629

Closed
2 tasks done
byawitz opened this issue Jun 2, 2023 · 4 comments
Closed
2 tasks done
Labels
bug Something isn't working product / functions Fixes and upgrades for the Appwrite Functions.

Comments

@byawitz
Copy link
Member

byawitz commented Jun 2, 2023

馃憻 Reproduction steps

The problem

Appwrite main container freezes and Swoole is crashing for 30 seconds +- when accessing Appwrite api from a cloud function.

Take this JavaScript function for example: (Applied to other platforms as well)

const sdk = require("node-appwrite");

module.exports = async function (req, res) {
    const client = new sdk.Client();

    const database = new sdk.Databases(client);

    client
        .setEndpoint('https://appwrite.domain.com/v1')
        .setProject('projectid')
        .setKey('KEY')
        .setSelfSigned(true);

    await database.createDocument('DATABASE_ID', 'COLLECTION_ID', sdk.ID.unique(), {time: (+new Date()).toString()});

    res.json({'success': true});
};

When running this function it works perfect.

But when trying to run it more than 3 times per second - with benchmark or simple browser request - the function runs till the function it hit its own timeout and responds with failed status.

A workaround

When running a function in synchronized mode it goes through the same route as any Appwrite API endpoints, Plus, The additional process required to go from the executor to the function (open-runtime) itself.
like such:

User -> Traefik -> Appwrite -> Executor -> function

Then the function itself is calling Appwrite service - let's say to database one - then we have this path:

function -> internet --> Traefik -> Appwrite -> database endpoint,

Something like this. the dashed line marked the sequential request

flowchart BT
    A[User] --> I((Internet))
    I --> tr
    I -.-> tr
    function -.-> I

    subgraph Appwrite
        tr[[Traefik]] --> ar
        tr -.-> ar
        ar{{appwrite}} --> executor
        executor --> function
        ar -.-> database
    end

So, what happened - from what I've observed - is that the 6-per core Swoole workers get blocked by their own execution.

Meaning, if 3 functions execution happened simultaneously than half of the workers are in busy handling the createExecute function while the other half are handling the
createDocument function.

But if more than half of the available workers are busy handling the createExecute function, then only 2 left for handling the createDocument function, and the
last 2 left stuck in midair causing a chain reaction of functions timeout error.

It also happened when the value _APP_WORKER_PER_CORE is set to a higher value. In this case Swools's freezes after crossing half of the workers per second.

In order to prove it I've tried to run a stressed benchmark on the server for the createDocument endpoint, and because no worker was taken from the inside the request can wait
in line and got execute one after each other.

Also, I've deployed Appwrite-free function and run a stressed benchmark on that one. Like the createDocument Appwrite was able to process all the request by order, Like a charm.

Semi-solution

The final solution for this issue will probably be decided by Appwrite team.

In this solution I've added another container to docker-compose.yml that uses the same appwrite image as the rest, This container acts as duplicate of the main container.

The container is also connected to the runtimes network, so functions can connect to the container using is internal address.

  appwrite-functions:
    image: appwrite/appwrite:1.3.5
    container_name: appwrite-functions
    restart: unless-stopped
    networks:
      - appwrite
      - runtimes
    volumes:
      - appwrite-uploads:/storage/uploads:rw
      - appwrite-cache:/storage/cache:rw
      - appwrite-config:/storage/config:rw
      - appwrite-certificates:/storage/certificates:rw
      - appwrite-functions:/storage/functions:rw
    depends_on:
      - mariadb
      - redis
      - influxdb
    environment:
      - _APP_ENV
      - "Same variable as in the main Appwrite container (this can be change to have less in the future)" 

Then the user can set the function endpoint as follows:

    client
    .setEndpoint('http:https://appwrite-functions/v1')
    .setProject('projectid')
    .setKey('KEY')
    .setSelfSigned(true);

Using this method makes sure that the function and the API the function is pointing to are using a completely different sets of Swoole workers, which guarantees fail-proof execution by avoiding recursive call to the same workers, Something like this:

flowchart LR
    A[User] --> I((Internet))
    I --> tr
    function -.-> f

    subgraph Appwrite
        f{{appwrite-functions}}
        tr[[Traefik]] --> ar
        ar{{appwrite}} --> executor
        executor --> function
        f -.-> database
    end   
Same solution in a separate file As many won't want to edit the `docker-compose.yml` file, And, it can make the upgrade process of Appwrite real annoying.

For that you can do as follows:

  1. Create another file next to your docker-compose.yml. name it docker-compose-functions.yml for example.
  2. Add the external snippet below.
  3. Now you can use Docker compose -f flag to start/stop this file.
docker compose -f docker-compose-functions.yml up -d
docker compose -f docker-compose-functions.yml down

Using this method, will let you upgrade Appwrite as much as you want.

External snippet

version: '3'

services:
  appwrite-functions:
    image: appwrite/appwrite:1.3.7
    container_name: appwrite-functions
    restart: unless-stopped
    networks:
      - appwrite
      - runtimes
    volumes:
      - appwrite-uploads:/storage/uploads:rw
      - appwrite-cache:/storage/cache:rw
      - appwrite-config:/storage/config:rw
      - appwrite-certificates:/storage/certificates:rw
      - appwrite-functions:/storage/functions:rw
    depends_on:
      - mariadb
      - redis
      - influxdb
    environment:
      - _APP_ENV

volumes:
  appwrite-cache:
    external: true
    name: appwrite-cache
  appwrite-uploads:
    external: true
    name: appwrite-uploads
  appwrite-certificates:
    external: true
    name: appwrite-certificates
  appwrite-functions:
    external: true
    name: appwrite-functions
  appwrite-config:
    external: true
    name: appwrite-config
  appwrite-executor:
    external: true
    name: appwrite-executor

networks:
  appwrite:
    name: appwrite_appwrite
    external: true
  runtimes:
    name: appwrite_runtimes
    external: true

Another possible solution.

In order to avoid connecting the open-runtime generated container function to the appwrite_appwrite network, the adding of the new duplicated container can work in a similar way of the realtime one.

  appwrite-functions:
    image: appwrite/appwrite:1.3.5
    container_name: appwrite-functions
    restart: unless-stopped
    labels:
      - "traefik.enable=true"
      - "traefik.constraint-label-stack=appwrite"
      - "traefik.docker.network=appwrite"
      - "traefik.http.services.appwrite_function.loadbalancer.server.port=80"
      # http
      - traefik.http.routers.appwrite_function_http.entrypoints=appwrite_web
      - traefik.http.routers.appwrite_function_http.rule=PathPrefix(`/v1/functions`)
      - traefik.http.routers.appwrite_function_http.service=appwrite_function
      # https
      - traefik.http.routers.appwrite_function_https.entrypoints=appwrite_websecure
      - traefik.http.routers.appwrite_function_https.rule=PathPrefix(`/v1/functions`)
      - traefik.http.routers.appwrite_function_https.service=appwrite_function
      - traefik.http.routers.appwrite_function_https.tls=true
      - traefik.http.routers.appwrite_function_https.tls.certresolver=dns
    networks:
      - appwrite

Tests

To check this assumption I've run these tests with oha benchmark tool.

Create Document API

Command

oha https://appwrite.domain.com/v1/databases/DATABASE_ID/collections/COLLECTION_ID/documents \
  -z 180s \
  -m POST \
  -H "Content-Type: application/json" \
  -H "x-appwrite-project: projectid" \
  -H "X-Appwrite-Key: key" \
  -d '{"documentId": "unique()",   "data": {"time": "1"}}'  

Execute function - Just JSON

Command

oha https://appwrite.domain.com/v1/functions/63eae4aedfd2b97fef68/executions \
  -z 60s \
  -m POST \
  -H "Content-Type: application/json" \
  -H "x-appwrite-project: projectid" \
  -H "X-Appwrite-Key: key" \
  -d '{"data":"function-data"}'

Executing this function that doesn't connect to Appwrite at all.

const sdk = require("node-appwrite");

module.exports = async function (req, res) {
    res.json({success: true});
};

Execute function - With Appwrite

Command:

Same as that one

Function code - connects to instance main domain.

const sdk = require("node-appwrite");

module.exports = async function (req, res) {
    const client = new sdk.Client();

    const database = new sdk.Databases(client);

    client
        .setEndpoint('https://appwrite.domain.com/v1')
        .setProject('projectid')
        .setKey('KEY')
        .setSelfSigned(true);

    await database.createDocument('DATABASE_ID', 'COLLECTION_ID', sdk.ID.unique(), {time: (+new Date()).toString()});

    res.json({'success': true});
};

Execute function - With Appwrite (Tunneling)

Command:

Same as that one

Function code - tunnels the client request through the internal docker-host domain appwrite-functions

const sdk = require("node-appwrite");

module.exports = async function (req, res) {
    const client = new sdk.Client();

    const database = new sdk.Databases(client);

    client
        .setEndpoint('http:https://appwrite-functions/v1')
        .setProject('projectid')
        .setKey('KEY')
        .setSelfSigned(true);

    await database.createDocument('DATABASE_ID', 'COLLECTION_ID', sdk.ID.unique(), {time: (+new Date()).toString()});

    res.json({'success': true});
};

Benchmarks table

Request per second

Server Create Document API Execute function
Just JSON
Execute function
With Appwrite
Execute function
With Appwrite (Tunneled)
Tunneled is
% Faster
1vCPU 1GB 42.81 30.03 0.30 16.23 5400%
2vCPU 2GB 61.64 52.24 0.61 32.03 5250%
4vCPU 8GB 184.01 122.03 1.20 68.68 5720%
8vCPU 16GB 483.07 285.06 113.53 173.48 152%
16vCPU 128GB 1947.54 745.16 343.44 486.80 141%
32vCPU 256GB 3182.35 1389.30 436.87 893.83 204%

As you can see the Tunneled way is faster in any case. and it's a must in low-budget - up to medium - servers.

P.s. This partially related to #4626

馃憤 Expected behavior

Appwrite should be able to handle all calls, Also in low-budget servers.

馃憥 Actual Behavior

Up to 8vCPU 16GB Swoole is not able to handle the recursive workers call.

馃幉 Appwrite version

Different version (specify in environment)

馃捇 Operating system

Linux

馃П Your Environment

I've tested from version 1.1.2 up to 1.3.5

馃憖 Have you spent some time to check if this issue has been raised before?

  • I checked and didn't find similar issue

馃彚 Have you read the Code of Conduct?

@byawitz byawitz added the bug Something isn't working label Jun 2, 2023
@stnguyen90 stnguyen90 added the product / functions Fixes and upgrades for the Appwrite Functions. label Jun 2, 2023
@joeyouss
Copy link

joeyouss commented Jun 5, 2023

Hi
As discussed in discord, we have started looking into this and discussing about this. More updates soon!

@obiwanzenobi
Copy link

Any update on this?

@Meldiron
Copy link
Contributor

I have revisited the issue and all of it is very technical and correct.

I have one concern 馃 Considering 6 workers (1 core), you seem to get functions to freeze within 4 executions. I cant see logical reason for that, as even with 5 function executions, everything should remain functional - just slower.

My theory is that if at least 1 worker is doing any request different from synchronous function execution, the stack should not freeze. Because such worker always finishes and becomes available for another request very quickly.

@byawitz
Copy link
Member Author

byawitz commented Jan 30, 2024

Hey Matej, thank you for taking the time to go over the issue.

The problem is that the second a function calls Appwrite again when there's no available worker, then in this case the function will wait till the timeout is passed which causes a waterfall of timeouts one after another.

For example:
4 execution per second, in the first nanosecond.
Execution 1 -> Appwrite -> Occupying first worker.
Execution 2 -> Appwrite -> Occupying second worker.
Execution 3 -> Appwrite -> Occupying third worker.
At that exact moment, all workers are occupied.

Execution 4 -> Can't even start the execution as there no worker is available for process the first request.
But even if only one worker is released and the function starts to execute it is still unable to get a worker starting to wait in the timeout.

Like so 馃憞
image

After a few executions that bottle neck gets bigger and bigger, and the running function blocks existing workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working product / functions Fixes and upgrades for the Appwrite Functions.
Projects
Status: Done
Development

No branches or pull requests

5 participants