Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare invocation namespaces when handling a cycle and recovering a queue #5432

Merged
merged 3 commits into from
Aug 14, 2023

Conversation

style95
Copy link
Member

@style95 style95 commented Aug 7, 2023

Description

This is to fix a bug in that ETCD data for non-existent containers persist forever.

This bug happens when a shared action is invoked.
When the queue manager receives an activation message, it finds a queue based on the docId only and this is wrong.
As a result, the activation message could be sent to the wrong queue as long as the docId is same.
But the docId is same for all shared actions.

ex)

  • original action: /whisk.system/sharedPackage/hello
  • bound action1: /style95/myPackage/hello (docId: /whisk.system/sharedPackage/hello)
  • bound action2: /bdoyle/yourPackage/hello (docId: /whisk.system/sharedPackage/hello)

So an activation for /style95/myPackage/hello could be sent to /bdoyle/yourPackage/hello.
Then a memory queue will send the activation to a container for /style95/myPackage/hello.
The container is initialized with /style95/myPackage/hello and it registers the ETCD data for the running container.
But after executing the activation for /bdoyle/yourPackage/hello, it tries to remove the ETCD data for the running container using the key prefix with /bdoyle/yourPackage/hello because the key now resides in the container data(WarmData).
Accordingly, the original ETCD data for the running container is not deleted forever.

Please refer to the following logs that I found.
(The shared action is whisk.system/sharedPackage/hello and I replaced the name of two namespace to style95 and bdoyle from the original logs just to hide proprietary information.)

[2023-08-04T16:31:49.392Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [QueueManager] Got activation message 98481e836089419d881e836089d19d00 for Namespace(style95,07e316f4-9393-4777-a1d3-650da22a83f0)/whisk.system/sharedPackage/[email protected] from kafka.
[2023-08-04T16:31:49.392Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [QueueManager] [98481e836089419d881e836089d19d00] the key whisk/queue/style95/whisk.system/sharedPackage/hello/leader is not in the initRevisionMap. revision: 38-5e7fc51a452c685030fdfcec61e3bdf1
[2023-08-04T16:31:49.392Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [QueueManager] [98481e836089419d881e836089d19d00] send activation to remote queue, key: whisk/queue/style95/whisk.system/sharedPackage/hello/leader revision: 38-5e7fc51a452c685030fdfcec61e3bdf1
[2023-08-04T16:31:49.419Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [QueueManager] add a new actor selection to a map with key: whisk/queue/style95/whisk.system/sharedPackage/hello/leader
[2023-08-04T16:31:49.419Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [QueueManager] Got activation message 98481e836089419d881e836089d19d00 for Namespace(style95,07e316f4-9393-4777-a1d3-650da22a83f0)/whisk.system/sharedPackage/[email protected] from remote queue manager.
[2023-08-04T16:31:49.419Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [QueueManager] Queue for action whisk.system/sharedPackage/[email protected] is already recovered, skip
[2023-08-04T16:31:49.419Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [MemoryQueue] [bdoyle:whisk.system/sharedPackage/[email protected]:Running] got a new activation message 98481e836089419d881e836089d19d00

As you can see above the activation(98481e836089419d881e836089d19d00) is originally for style95 but it is finally sent to the queue for bdoyle.

And the queue sent this activation to a container with an ID(f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff).

[2023-08-04T16:31:49.686Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [MemoryQueue] [bdoyle:whisk.system/sharedPackage/[email protected]:Running] Get activation request f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff, send one message: 98481e836089419d881e836089d19d00

This container was originally created for bdoyle.

[2023-08-04T16:31:49.623Z] [INFO] [#tid_sid_unknown] [FunctionPullingContainerPool] received a container creation message: f3dc23dbb34c43889c23dbb34c93882e
[2023-08-04T16:31:49.683Z] [INFO] [#tid_sid_unknown] [WatcherService] watch endpoint: WatchEndpoint(whisk/namespace/bdoyle/whisk.system/sharedPackage/hello/38-5e7fc51a452c685030fdfcec61e3bdf1/invoker0/container/f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff,,false,data-management-service,Set(DeleteEvent))
[2023-08-04T16:31:49.685Z] [INFO] [#tid_sid_unknown] [FPCInvokerReactive] Posted success ack of container creation f3dc23dbb34c43889c23dbb34c93882e for bdoyle/whisk.system/sharedPackage/[email protected]

But when the same container(f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff) is being paused, it tries to unwatch endpoint based on the style95 key because it executed an activation for style95 namespace.

[2023-08-04T16:31:49.688Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [FunctionPullingContainerProxy] received a message 98481e836089419d881e836089d19d00 for whisk.system/sharedPackage/[email protected] in ClientCreated
[2023-08-04T16:31:49.688Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [DockerContainer] sending initialization to ContainerId(f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff) ContainerAddress(172.17.0.34,8080)
[2023-08-04T16:31:49.888Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [DockerContainer] initialization result: ok
[2023-08-04T16:31:49.888Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [DockerContainer] sending arguments to /whisk.system/sharedPackage/hello at ContainerId(f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff) ContainerAddress(172.17.0.34,8080)
[2023-08-04T16:31:49.992Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [DockerContainer] running result: ok
[2023-08-04T16:31:49.992Z] [INFO] [#tid_5de45169ee33b5678a8067431ba09ca8] [MessagingActiveAck] posted completion of activation 98481e836089419d881e836089d19d00


[2023-08-04T16:32:00.007Z] [INFO] [#tid_sid_unknown] [FunctionPullingContainerProxy] No more run activation is coming in state: Running, action: ExecutableWhiskAction/whisk.system/sharedPackage/[email protected], container: ContainerId(f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff)
[2023-08-04T16:32:00.472Z] [INFO] [#tid_sid_invokerNanny] [RuncClient] running /usr/bin/docker-runc pause f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff (timeout: 10 seconds)
[2023-08-04T16:32:00.544Z] [INFO] [#tid_sid_unknown] [WatcherService] watch endpoint: WatchEndpoint(whisk/warmed/style95/whisk.system/sharedPackage/hello/38-5e7fc51a452c685030fdfcec61e3bdf1/invoker/0/container/f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff,,false,data-management-service,Set(DeleteEvent))
[2023-08-04T16:32:00.544Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(whisk/namespace/style95/whisk.system/sharedPackage/hello/38-5e7fc51a452c685030fdfcec61e3bdf1/invoker0/container/f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff,false,data-management-service,true)

Consequently, the data for bdoyle(whisk/namespace/bdoyle/whisk.system/sharedPackage/hello/38-5e7fc51a452c685030fdfcec61e3bdf1/invoker0/container/f372eb3f960da72c4bff75110d24b0c5b72d6d306d3c31e59f4c10f90ccefdff) is not removed forever unless the invoker is restarted.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@style95 style95 added the bug label Aug 7, 2023
@style95 style95 requested a review from bdoyle0182 August 7, 2023 22:23
@bdoyle0182
Copy link
Contributor

Huge! This was the only remaining bug I had identified from my use.

@bdoyle0182
Copy link
Contributor

Is there anything that can be unit tested here?

@style95
Copy link
Member Author

style95 commented Aug 7, 2023

Is there anything that can be unit tested here?

I tried to add a unit test but it seems it's not easy to reproduce the situation based on a unit test.

@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2023

Codecov Report

Merging #5432 (bedd28d) into master (54564cb) will decrease coverage by 0.28%.
The diff coverage is 100.00%.

❗ Current head bedd28d differs from pull request most recent head 9566e00. Consider uploading reports for the commit 9566e00 to get more accurate results

@@            Coverage Diff             @@
##           master    #5432      +/-   ##
==========================================
- Coverage   76.81%   76.54%   -0.28%     
==========================================
  Files         241      241              
  Lines       14630    14632       +2     
  Branches      616      616              
==========================================
- Hits        11238    11200      -38     
- Misses       3392     3432      +40     
Files Changed Coverage Δ
.../openwhisk/core/scheduler/queue/QueueManager.scala 81.61% <100.00%> (+0.11%) ⬆️

... and 10 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bdoyle0182
Copy link
Contributor

Is there anything that can be unit tested here?

I tried to add a unit test but it seems it's not easy to reproduce the situation based on a unit test.

LGTM then. This is a pretty critical bug since an activation can be delivered to a wrong container so should get this merged as quickly as possible.

@style95
Copy link
Member Author

style95 commented Aug 8, 2023

It seems there is a problem in multi-runtime tests.

@bdoyle0182
Copy link
Contributor

I'm not sure why the multiruntime tests are failing now

@style95
Copy link
Member Author

style95 commented Aug 12, 2023

As I shared via the dev list, this change is urgent and not related to anything about multi-runtime tests, I would merge this PR without passing the tests in 24 hours unless there is any objection.

@rabbah
Copy link
Member

rabbah commented Aug 12, 2023

Change lgtm.

@style95 style95 merged commit 9371d62 into apache:master Aug 14, 2023
5 of 6 checks passed
mtt-merz pushed a commit to mtt-merz/openwhisk that referenced this pull request Oct 22, 2023
…queue (apache#5432)

* Compare invocation namespaces when handling a cycle and recovering a queue

* Temporarily enable upterm session for debugging

* Revert the upterm change

(cherry picked from commit 9371d62)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants