Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an ElasticSearchActivationStore #4724

Merged
merged 20 commits into from
Mar 3, 2020

Conversation

jiangpengcheng
Copy link
Contributor

@jiangpengcheng jiangpengcheng commented Nov 14, 2019

Implement an ElasticSearchActivationStore

Description

Although currently we can use ArtifactWithFileStorageActivationStore to store activations into ES, but this need configure logstash manually, and it will still save activations in CouchDb

This PR will save activations into ElasticSearch directly, an then CouchDb will only serve for subjects&actions, someone may need this

It also provides ansible scripts to deploy a simple ES cluster for test/dev purpose

The activation stored in ES is like below(the _source field):

{
  "_index": "openwhisk-test_@shared_1911",
  "_type": "_doc",
  "_id": "jiangpengcheng/6138d67781a34b9fb8d67781a35b9f80",
  "_version": 1,
  "_score": 11.635652,
  "_source": {
    "@timestamp": "2019-11-20T06:57:04.282Z",
    "activationId": "6138d67781a34b9fb8d67781a35b9f80",
    "annotations": {
      "path": "jiangpengcheng/hello",
      "waitTime": 78,
      "kind": "nodejs:6",
      "timeout": false,
      "limits": {
        "concurrency": 1,
        "logs": 1,
        "memory": 256,
        "timeout": 60000
      },
      "initTime": 42
    },
    "duration": 57,
    "end": 1574233024339,
    "entityType": "activation",
    "logs": [
      "2019-11-20T06:57:04.337733554Z stdout: undefined"
    ],
    "name": "hello",
    "namespace": "jiangpengcheng",
    "path": "jiangpengcheng/hello",
    "publish": false,
    "response": {
      "result": "{}",
      "statusCode": 0
    },
    "start": 1574233024282,
    "subject": "jiangpengcheng",
    "updated": 1574233024340,
    "version": "0.0.1"
  },
  "fields": {
    "@timestamp": [
      "2019-11-20T06:57:04.282Z"
    ]
  },
  "highlight": {
    "namespace": [
      "@kibana-highlighted-field@jiangpengcheng@/kibana-highlighted-field@"
    ]
  }
}

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@style95
Copy link
Member

style95 commented Nov 14, 2019

FTR, with this change, we can take advantage of ElasticSearch.
So we can do full-text searching against Activation including logs, drawing graphs or diagrams with Kibana.
And users can have their own custom dashboards with Kibana and so on.

dir:
become: "{{ elastic_dir_become | default(false) }}"
base_volume: "{{ elastic_base_volume | default('esdata') }}"
cluster_name: "{{ elastic_cluster_name | default('lambda') }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be changed.

@@ -0,0 +1,20 @@
# Licensed to the Apache Software Foundation (ASF) under one or more contributor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use the short version of the Apache license header anymore.

with ActivationStoreBehavior {

override def checkDeleteActivation(activation: WhiskActivation)(implicit transid: TransactionId): Unit = {
retry(super.checkDeleteActivation(activation), 10)
Copy link
Contributor Author

@jiangpengcheng jiangpengcheng Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes delete will get a not found error even we can get a 200 response with a get, didn't know the reason

export ELASTIC_INDEX_PATTERN="unittest-%s"
export ELASTIC_USERNAME="admin"
export ELASTIC_PASSWORD="admin"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it is ok to add codes here, but without these lines, tests for ElasticSearchActivationStore will be skipped and it can introduce some risk to the system

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test we can also look into using TestContainers ElasticSearch support. Then you would not need to deploy an ES container via ansible just for test runs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this looks great! I will look into it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while using TestContainers, the gradle process will be time out cause of GC overhead limit exceed issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangpengcheng could you please give more details? Even better if you can push a branch.
Testcontainers is not very memory greedy, so I believe this is something else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsideup sorry to misleading you, actually the error is related to my environment, after do ./gradlew clean, it worked

new Batcher(500, maxOpenDbRequests)(doStore(_)(TransactionId.dbBatcher))

private val minStart = 0L
private val maxStart = Instant.now.toEpochMilli + 100L * 365 * 24 * 60 * 60 * 1000 //100 years from now
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used for range query when since or upto is a None

}

private def generateIndex(namespace: String): String = {
elasticSearchConfig.indexPattern.dropWhile(_ == '/') format namespace.toLowerCase
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can provide a custom pattern to store activations in special indices

for example, with pattern "openwhisk-%s", activations in different namespaces will be saved in to different indices

or use "openwhisk" with out a "%s", then all activations will be saved into the "openwhisk" index

we can even use index alias like:

  • openwhisk-whisk.system(this is an alias and below are indices it connects to)

    • openwhisk-whisk.system_201911(write index)(activations will be write to this index only while list/get activations search all indices)
    • openwhisk-whisk.system_201910
    • openwhisk-whisk.system_201909
  • openwhisk-ns1

    • openwhisk-ns1_201911(write index)
    • openwhisk-ns1_201910
    • openwhisk-ns1_201909
  • openwhisk-ns2(we can use shared indices for namespaces which are not very active)

    • openwhisk@shared_201911(write index)
    • openwhisk@shared_201910
    • openwhisk@shared_201909
  • openwhisk-ns3

    • openwhisk@shared_201911(write index)
    • openwhisk@shared_201910
    • openwhisk@shared_201909

@codecov-io
Copy link

codecov-io commented Nov 15, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@809514e). Click here to learn what that means.
The diff coverage is 70%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4724   +/-   ##
=========================================
  Coverage          ?   78.62%           
=========================================
  Files             ?      198           
  Lines             ?     9156           
  Branches          ?      631           
=========================================
  Hits              ?     7199           
  Misses            ?     1957           
  Partials          ?        0
Impacted Files Coverage Δ
...openwhisk/common/tracing/OpenTracingProvider.scala 21.15% <ø> (ø)
...che/openwhisk/core/yarn/YARNContainerFactory.scala 87.5% <ø> (ø)
...e/loadBalancer/ShardingContainerPoolBalancer.scala 88.7% <ø> (ø)
...che/openwhisk/core/loadBalancer/LeanBalancer.scala 0% <ø> (ø)
...apache/openwhisk/core/entitlement/Collection.scala 87.5% <ø> (ø)
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 7.69% <ø> (ø)
...isk/core/controller/actions/PrimitiveActions.scala 87.24% <ø> (ø)
...rg/apache/openwhisk/core/controller/Triggers.scala 94.16% <ø> (ø)
.../openwhisk/core/entity/ActivationEntityLimit.scala 100% <ø> (ø)
.../scala/org/apache/openwhisk/core/entity/Exec.scala 90.62% <ø> (ø)
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 809514e...34b7e27. Read the comment docs.

Copy link
Member

@chetanmeh chetanmeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass on the code part (excluding the ansible setup). Looks good! Wanted to confirm if caching support is actually needed for activation store?

common/scala/build.gradle Outdated Show resolved Hide resolved
ErrorLevel))
}

override def delete(activationId: ActivationId, context: UserContext)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later we should review the usage for delete as activation records are immutable and removed via expiry instead of explicit removal. So technically we can drop the delete support

@chetanmeh
Copy link
Member

Would be useful to have the actual record structure in ElasticSearch documented as part of PR description

@style95
Copy link
Member

style95 commented Nov 29, 2019

@jiangpengcheng Could you address @chetanmeh's comment?
#4724 (comment)

@style95
Copy link
Member

style95 commented Dec 10, 2019

@chetanmeh Any other opinion on this?

@rabbah
Copy link
Member

rabbah commented Dec 20, 2019

This is fantastic!

@chetanmeh
Copy link
Member

@chetanmeh Any other opinion on this?

Missed reviewing it. Added few more comments

@style95
Copy link
Member

style95 commented Dec 23, 2019

@chetanmeh Thank you for the comments.

@style95
Copy link
Member

style95 commented Jan 16, 2020

@jiangpengcheng Could you handle the review points?

@jiangpengcheng
Copy link
Contributor Author

already updated @style95

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@chetanmeh chetanmeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Added few more comments.

Looking at coverage report one thing looks odd

image

2 lines which do not have any coverage somehow indicates that flow hits the Try part first but due to some exception fallbacks to else part. Is that some issue on coverage side or test only exercise one part of the flow?

start,
s"[PUT] 'activations' failed to put document: '${activation.docid}'; http status: '${res.status}'",
ErrorLevel)
throw new Exception("Unexpected http response code: " + res.status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use a specific exception here. In general when store impls throw exception they throw exception extending ArtifactStoreException such that in StoreUtils.reportFailure those exception are not rehandled. In current impl transid.failed would be invoked twice and thus would result in wrong metrics being recorded

Copy link
Contributor Author

@jiangpengcheng jiangpengcheng Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, great point, sorry I missed it

about the odd coverage report, currently it only apply unittests from org.apache.openwhisk.core.database.test.behavior.ActivationStoreBehavior, in which mock activations' annotations and response.result fields are all null, so Try will failed and fallbacks to else part, I will add some fake result and annotations to mock activations

@chetanmeh
Copy link
Member

@jiangpengcheng Thanks for taking care of the review feedback. Overall (barring ansible part which I was not able to have a closer look) it looks good to me.

Feel free to merge once you are done with this

Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind adding a short set of instructions for how to try this with a local deployment?

@@ -36,6 +36,9 @@ invoker1 ansible_host={{ docker_machine_ip }}
[apigateway]
{{ docker_machine_ip }} ansible_host={{ docker_machine_ip }}

[elasticsearch:children]
db
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of db here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means elasticsearch nodes will be deployed on the same hosts specified in the db section.

@jiangpengcheng
Copy link
Contributor Author

do you mind adding a short set of instructions for how to try this with a local deployment?

sorry for the late response, shall I write the instruction to a document or in this PR's description?

@style95
Copy link
Member

style95 commented Feb 11, 2020

@jiangpengcheng I think we anyway need a document to describe how to use it.

ansible/README.md Outdated Show resolved Hide resolved
ansible/README.md Outdated Show resolved Hide resolved
ansible/README.md Outdated Show resolved Hide resolved
ansible/README.md Outdated Show resolved Hide resolved
ansible/README.md Outdated Show resolved Hide resolved
@rabbah
Copy link
Member

rabbah commented Feb 19, 2020

Thanks for adding the instructions. I have restarted Travis.

Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Remove vagrant change since we removed all vagrant mentions in the project. Also did you consider adding a new target for wskdev to start es?

ansible/environments/vagrant/hosts Outdated Show resolved Hide resolved
@style95
Copy link
Member

style95 commented Feb 27, 2020

It seems it is ready to merge.

@style95 style95 merged commit 05ed4e1 into apache:master Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants