Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chore] Update Prometheus #7416

Merged
merged 8 commits into from
Jun 10, 2024
Merged

[Chore] Update Prometheus #7416

merged 8 commits into from
Jun 10, 2024

Conversation

alanprot
Copy link
Contributor

@alanprot alanprot commented Jun 4, 2024

Changes

We pinned github.com/sercand/kuberesolver/v4 => github.com/sercand/kuberesolver/v5 v5.1.1 in order to be able to update google.golang.org/grpc needed by opentelemetry-go (and consequently removed the replace for google.golang.org/grpc)

Breaking Changes on Prometheus

Verification

Breaking Changes on the new go-grpc:

  • From here FailOnNonTempDialError, WithBlock, and WithReturnConnectionError are three DialOptions that are only supported by Dial because they only affect the behavior of Dial itself. WithBlock causes Dial to wait until the ClientConn reports its State as connectivity.Connected. The other two deal with returning connection errors before the timeout (WithTimeout or on the context when using DialContext).
    • Those options were added to work around an regression on GRPC: see this and this

@alanprot alanprot changed the title update thanos [Chore] Update Prometheus Jun 4, 2024
@alanprot alanprot force-pushed the update-prometheus branch 3 times, most recently from 3c55ace to 6c34cfa Compare June 4, 2024 23:31
@alanprot alanprot marked this pull request as ready for review June 4, 2024 23:53
@yeya24
Copy link
Contributor

yeya24 commented Jun 5, 2024

@alanprot OK I just checked and I think the E2E test failure is related. The test tries to spin up a Prometheus locally using Prometheus binary.

I logged the command in my local path.

prometheus --storage.tsdb.retention=2d --storage.tsdb.path=/var/folders/rd/vh0cfrk13jsf8pnnqgz9yrg40000gr/T/prometheus-test3289092244 --web.listen-address=localhost:54196 --web.route-prefix= --web.enable-admin-api --config.file=/var/folders/rd/vh0cfrk13jsf8pnnqgz9yrg40000gr/T/prometheus-test3289092244/prometheus.yml

And it failed with error below.

ts=2024-06-05T06:07:29.433Z caller=main.go:521 level=error msg="Error loading config (--config.file=/var/folders/rd/vh0cfrk13jsf8pnnqgz9yrg40000gr/T/prometheus-test3289092244/prometheus.yml)" file=/var/folders/rd/vh0cfrk13jsf8pnnqgz9yrg40000gr/T/prometheus-test3289092244/prometheus.yml err="parsing YAML file /var/folders/rd/vh0cfrk13jsf8pnnqgz9yrg40000gr/T/prometheus-test3289092244/prometheus.yml: yaml: unmarshal errors:\n  line 2: field rule_query_offset not found in type config.plain"

The config is:

global:
    rule_query_offset: 0s
    external_labels:
        az: "1"
        region: eu-west

rule_query_offset is a new config and it doesn't have omit_empty so it makes old Prometheus version fail. https://github.com/prometheus/prometheus/blob/main/config/config.go#L402.

I think what we can do is, rather than using global config to write the Prometheus config file here, we can write with plain yaml similar to this

@yeya24
Copy link
Contributor

yeya24 commented Jun 5, 2024

Let's update Prometheus version to prometheus/prometheus#14216 and see if it fixes it?

@alanprot
Copy link
Contributor Author

alanprot commented Jun 5, 2024

Let's update Prometheus version to prometheus/prometheus#14216 and see if it fixes it?

Will do.. but i think there is another braking change that we will need to PR on prometheus.. I will push the change with the new prometheus for now but the build will fail.

prometheus/common#538 (comment)

I pinned the common to work around this issue.

@alanprot
Copy link
Contributor Author

alanprot commented Jun 5, 2024

=== NAME  TestRangeQueryShardingWithRandomData
    testutil.go:91: query_frontend_test.go:667: ""
        
        	exp: model.Matrix(nil)
        
        	got: model.Matrix{}
        
        Diff

Seems related to: prometheus/prometheus#13993

=/
Need to understand why.

@alanprot alanprot force-pushed the update-prometheus branch 2 times, most recently from 472b5d7 to bf38079 Compare June 5, 2024 21:20
@alanprot
Copy link
Contributor Author

alanprot commented Jun 5, 2024

Ok..

We are not using jsoniter to serialize response as the upstream prometheus is... it explains why the null is being serialized as null as the fix for prometheus/prometheus#13993 was actually done prometheus/prometheus#13997 and it was a fix on the json serializer...

I'm changing thanos to use jsoniter as it has multiple benefits anyway.

@alanprot alanprot force-pushed the update-prometheus branch 3 times, most recently from bc1338a to 15f804e Compare June 5, 2024 22:05
yeya24
yeya24 previously approved these changes Jun 7, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alanprot
Copy link
Contributor Author

alanprot commented Jun 7, 2024

@GiedriusS @fpetkovski Can you guys take a peek on this one?

@fpetkovski
Copy link
Contributor

We are seeing problems with grpc 1.64.0: grpc/grpc-go#7314. The issue in Thanos happens when we try to close stale connections here: https://github.com/thanos-io/thanos/blob/main/pkg/query/endpointset.go#L442.

Maybe it's better to pin grpc to 1.63.2?

@alanprot
Copy link
Contributor Author

We are seeing problems with grpc 1.64.0: grpc/grpc-go#7314. The issue in Thanos happens when we try to close stale connections here: https://github.com/thanos-io/thanos/blob/main/pkg/query/endpointset.go#L442.

Maybe it's better to pin grpc to 1.63.2?

Ok.. lemme do this.

yeya24
yeya24 previously approved these changes Jun 10, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Merge on green

@yeya24 yeya24 enabled auto-merge (squash) June 10, 2024 20:54
Signed-off-by: alanprot <[email protected]>
auto-merge was automatically disabled June 10, 2024 21:31

Head branch was pushed to by a user without write access

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yeya24 yeya24 merged commit 882d6a1 into thanos-io:main Jun 10, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants