Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed handling of interval tree clocks #85

Merged
merged 2 commits into from
Aug 25, 2018

Conversation

tschmittni
Copy link
Contributor

These rules have to be followed to make ITCs work properly:

  • Never store a remote clock in the local registry. Remote clocks
    have no identity (or the wrong identity). There have been many
    places where remote clocks have been stored in the local registry.
    This should be always wrong, because remote clocks either have
    no identity (if peeked) or the wrong identity of another node.

  • Use Clock.join() to receive causal information from remote (peeked)
    clocks. Clock.join() has not been used anywhere in the Tracker,
    which means that causal information from other nodes has never
    updated after a broadcast.

  • Always call Clock.peek() before sending clocks to other replicas.
    This avoids a couple bugs when the local node stores the wrong
    clock identity.

  • Make sure that every node gets a properly forked clock when
    joining the cluster.

Removed handle_replica calls for :add_meta and :remove_meta because
clocks are stored with the metadata itself and they cannot determine
the causal ordering of add and remove events. Using :update_meta for
all replica events.

Changed broadcast_event and added get_registry_snapshot wrapper
to make sure that only peeked clocks are send over to other nodes.

The TrackerState clock is now only used to hold the clock identity.

These rules have to be followed to make ITCs work properly:

- Never store a remote clock in the local registry. Remote clocks
  have no identity (or the wrong identity). There have been many
  places where remote clocks have been stored in the local registry.
  This should be always wrong, because remote clocks either have
  no identity (if peeked) or the wrong identity of another node.

- Use Clock.join() to receive causal information from remote (peeked)
  clocks. Clock.join() has not been used anywhere in the Tracker,
  which means that causal information from other nodes has never
  updated after a broadcast.

- Always call Clock.peek() before sending clocks to other replicas.
  This avoids a couple bugs when the local node stores the wrong
  clock identity.

- Make sure that every node gets a properly forked clock when
  joining the cluster.

Removed handle_replica calls for :add_meta and :remove_meta because
clocks are stored with the metadata itself and they cannot determine
the causal ordering of add and remove events. Using :update_meta for
all replica events.

Changed broadcast_event and added get_registry_snapshot wrapper
to make sure that only peeked clocks are send over to other nodes.

The TrackerState clock is now only used to hold the clock identity.
@tschmittni
Copy link
Contributor Author

This change should address the issues I outlined here:
#21

- Add tests for handling replication events:
  track, untrack and update_meta

- Add tests for syncing registries
@tschmittni
Copy link
Contributor Author

Added Tracker tests to validate handling of replication events and syncing registries

@radu-iviniciu
Copy link
Contributor

We have been using this change in production for a couple of weeks now. Without any problems. It really fixes the occasional issues we saw with the clock handling

@tschmittni
Copy link
Contributor Author

@bitwalker have you had time to look at this change?

@arjan
Copy link
Collaborator

arjan commented Jul 23, 2018

@tschmittni I'm trying to get all Swarm tests passing, and some previously failing tests seem to be fixed on your ITC branch, which is awesome :-)

However, your most recent commit, in which you added test/tracker_sync_test.exs, that particular testcase has 4 out of 7 failing tests.. are you still working on this or should I investigate myself?

@tschmittni
Copy link
Contributor Author

Thank you!

It looks like the tests are not working in "clustered" mode. But you can run them with:
mix test test/tracker_sync_test.exs --exclude clustered

Let me know if you want me to change anything with the tests. Otherwise, we can also just drop them if you prefer other type of tests.

@arjan
Copy link
Collaborator

arjan commented Jul 24, 2018

Thanks, that helps. The correct branch is itc/refactored-itc-usage-tests2, right?

@tschmittni
Copy link
Contributor Author

The latest branch should be the one in this pull request: https://github.com/tschmittni/swarm/tree/refactored-itc-usage

@arjan
Copy link
Collaborator

arjan commented Jul 24, 2018

I'll see what I can do to get them running with :clustered; they're the only tests right now that fail.

@arjan arjan mentioned this pull request Jul 24, 2018
@arjan
Copy link
Collaborator

arjan commented Jul 24, 2018

Fixed the tests, see my #94 PR

@bitwalker bitwalker merged commit a606566 into bitwalker:master Aug 25, 2018
@arjan arjan mentioned this pull request Sep 16, 2018
tschmittni added a commit to tschmittni/swarm that referenced this pull request Nov 14, 2018
This is a follow up of bitwalker#85

I forgot to change the handle_topology_change method which is
as far as I can tell also the root cause for
bitwalker#106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants