Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: crashing after upgrade to 0.3.0 #829

Closed
R4scal opened this issue Feb 10, 2019 · 16 comments
Closed

store: crashing after upgrade to 0.3.0 #829

R4scal opened this issue Feb 10, 2019 · 16 comments
Labels

Comments

@R4scal
Copy link

R4scal commented Feb 10, 2019

Hi

thanos, version 0.3.0 (branch: HEAD, revision: 837e9671737698bf1778a4a9abfebbf96117a0be)
  build user:       root@986454de7a63
  build date:       20190208-15:23:51
  go version:       go1.11.5

I have 24h storage in prometheus and thanos for long-term. After upgrade thanos to 0.3.0 querying interval more then 24 crashing tanhos store:

 level=debug ts=2019-02-10T06:54:56.452875653Z caller=bucket.go:653 msg="Blocks source resolutions" blocks=6 mint=1549608897000 maxt=1549781697000 lset="{environment=\"prod\",replica=\"A\",service=\"sys\"}" spans="Range: 1549497600000-1549771200000 Resolution: 0"
 panic: runtime error: slice bounds out of range
 goroutine 981 [running]:
 github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc05f438900, 0x127d1e0, 0xc003f29080, 0xc01ac84280, 0x4, 0x20, 0x0, 0x5f48c8305ecfa1d, 0x0, 0x0)
         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1573 +0x6d3
 github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0x4346e9, 0x11757b0)
         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1544 +0xb2
: github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc08bdd0c60, 0xc08bdd0b40, 0xc0686b08e0)
         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
 created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xbe
@GiedriusS
Copy link
Member

GiedriusS commented Feb 10, 2019

Did it happen pre-0.3.0? Are you sure you have enough RAM in that box to execute this query? If you execute sysctl vm.overcommit_memory=2 (disables overcommitting) and perform the same action - what happens (I assume you run on Linux)?

@R4scal
Copy link
Author

R4scal commented Feb 10, 2019

I'm try set sysctl -w vm.overcommit_memory=2, with no success. I have lot of free memory.
Also I'm try downgrade thanos on store node to 0.2.1 and it works fine, no crash on queries.

@thomasriley
Copy link

Also seeing the same panic with Thanos Store after upgrading to 0.3.0:

goroutine 978 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc4218485a0, 0x11e87c0, 0xc4f9cbd280, 0xc45c5f6800, 0x2f, 0x100, 0x1, 0x1e9185281e8dc144, 0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1573 +0x6bf
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1544 +0xab
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc43e7d8180, 0xc43e7d80c0, 0xc43f936510)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

I can see that Store uses a fair amount of memory during bucket initialisation and then drops off to a more conservative usage. As you can see below it does not run out of memory at the moment it crashes (7.61GB / 20GB @ 11:08:30):

screen shot 2019-02-11 at 11 13 04

@bwplotka bwplotka added the bug label Feb 11, 2019
@bwplotka
Copy link
Member

Hm.. the code path that is problematic looks exactly like this: #816

This means that we ask for more bytes in object storage and reader gives us less. We probably need some check anyway (like mentioned in discussion in linked ticket). But the overall state looks lile malformed block. Why would index point to non existsing bytes? Unless we have bug in posting code, which was touched recently.

This happens on particular block or all of them? How often?

@R4scal
Copy link
Author

R4scal commented Feb 11, 2019

In my case all queries to remote storage in 0.3.0 crashes. I'm downgrade to 0.2.1 only store node and all working fine now

@PsychoSid
Copy link

I have also downgraded by storage processes to 0.2.1 and all is fine now (have left query, and compactor running as 0.3.0)

@domgreen
Copy link
Contributor

Is the block that this is happening a partially uploaded block?

From the code, this should only happen if we are trying to get data that we expect would be in the block but has not been written.

@R4scal
Copy link
Author

R4scal commented Feb 11, 2019

No it's not. I don't have partially uploaded blocks (at least on the sidecar or compact logs). But in one s3 bucket (local minio cluster) I have blocks from multiple prometheus with different tags (replica,dc,service) and store queries to all of them failed on thanos 0.3.0, but it work fine on version 0.2.1.
I downgrade to 0.2.1 only store service. Sidecar, query and compact are working on 0.3.0.

@PsychoSid
Copy link

What @R4scal says almost exactly mirrors my issue as well. Although I had 2 separate environments (different buckets etc) and 0.3.0 failed store queries so I downgraded that to 0.2.1 and all ok. Everything else is at 0.3.0. I can switch versions at will really if anything needs testing.

@bwplotka
Copy link
Member

bwplotka commented Feb 11, 2019 via email

@PsychoSid
Copy link

Doesn't seem tied to a particular block. But it's hard to say for sure. Anything I can run to point to ? Running it in debug shows a bunch but nothing to indicate a problem with any of them.

@bwplotka
Copy link
Member

I think this is related to this change: #753

@bwplotka
Copy link
Member

Important question. What queries are you doing exactly?

@R4scal
Copy link
Author

R4scal commented Feb 12, 2019

Example queries that crash thanos 0.3.0 from grafana in my case:

telegraf_internal_gather_metrics_gathered{input="disk",environment="$env",ms="$ms",service="mon"}
rate(net_bytes_recv{ms="$ms",host=~"$host", interface=~"bond[0-9]+$",environment="$env",service="sys"}[$inter])*8

bwplotka added a commit that referenced this issue Feb 13, 2019
Moved bucket e2e tests to table test.

Signed-off-by: Bartek Plotka <[email protected]>
@bwplotka
Copy link
Member

bwplotka commented Feb 13, 2019

Thanks for all info! In just couple of days after release we found out (thanks to you guys) and hopefully fixed this: #837

(: We need to fix some issues with negative matcher and then we will do patch release to add this.

domgreen added a commit that referenced this issue Feb 13, 2019
* setting the start and end to prior posting changes

* really need some tests data but this may also be the fix

* moving the start and end inside the loop, so they are not updated as we iterate over items

* Added regressions tests for #829.

Moved bucket e2e tests to table test.

Signed-off-by: Bartek Plotka <[email protected]>

* Fixed overestimation for fetching chunks and series.

Signed-off-by: Bartek Plotka <[email protected]>

* Removed wrong comment.

Signed-off-by: Bartek Plotka <[email protected]>

* changing func to match interface
@bwplotka
Copy link
Member

Fixed by this: #837 (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants