-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follower server redownloads segments upon server restart due to CRC change #11004
Comments
You're right, I think it's coincidence that the clusters we saw issue with are running a newer version of Pinot. We've jumped ahead to a commit that includes both fixes and we're still seeing the behavior. It looks like the lucene index may not be created deterministically also:
And I think this issue would also present if we use certain transform functions (e.g. storing ingestion time via What necessitates computing the CRC for all segments each restart? Assuming no data corruption happened, it seems that any data difference was already being 'served' as valid data. How do others feel about adding a config to skip the CRC check? |
Good point on the indeterministic index and functions. We should consider always downloading from deep-store for non-committing servers, or fetch CRC from ZK and set it into the local segment (this is kind of hacky) |
When restarting servers in our clusters running on 0.12+, we observe numerous Helix pending messages due to
Failed to Load LLC Segment
. This is caused by CRC mismatch between ZK and the starting server.Initial investigation showed that the 'leader' server commits and updates ZK, while the 'follower' server catches up and builds the segment locally with a different CRC. When a server is restarted all the segments that it 'followed' fail to load and are redownloaded from our deep store. We've confirmed that startOffset/endOffset match and the difference between two segments lies in the
columns.psf
file.Logs for a segment:
This behavior isn't seen on our clusters running on a 0.11 base. Is it possible some non-deterministic was introduced in the segment build process?
The text was updated successfully, but these errors were encountered: