Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Storm db manager #481

Closed
wants to merge 91 commits into from
Closed

Conversation

lfdversluis
Copy link
Contributor

@lfdversluis lfdversluis commented May 25, 2016

I open this PR so people can view the progress on this work and provide feedback and or suggestions.

This will become the database manager used in Tribler/Dispersy to do IO on a separate thread to gain performance.
Small performance indicator: The current synchronous, blocking tests take 635-640 seconds to complete. Now with 18 more tests, it takes 375 seconds.

Adds the Storm DB manager (feature).
Adds 18 tests

Fixes #488 Fixes #484

Refactoring progress:

File Uses Stormdb? (Synchronous and blocking)
dispersy.py
community.py
multichain DB x
DispersyDatabase
statistics.py
node.py
test_classification.py
test_sync.py
test_undo.py
tracker/community.py
database.py
Database function Returning deferred ? Running on threadpool ?
execute
execute_many
insert
fetchone
fetchall
executescript
insert_many
delete
count

Impact changes for Tribler et al.

dispersy_auto_load was previously a setter and getter using @propery. I've left the getter untouched but it returns a deferred now. The setter however is a function now set_dispersy_auto_load which returns a deferred too. This setter was only used in a test, but probably is used more in Tribler.

# Create a DeferredLock that should be used by callers to schedule their call.
self.db_lock = DeferredLock()

self._version = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the current code, the initial database version is 0. Are you changing this for any particular reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in sqlitecachedb they set it to 1 in the _open_connection function if the query fails. That's why I went for 1 here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking a the dispersy db manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will have to be careful to keep it consistent when refactoring I don't know if 0/1 are actually checked and used to act in Tribler and/or dispersy

Copy link
Contributor Author

@lfdversluis lfdversluis May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the version is below 17 in Tribler, the database is too old to be migrated apparently. So 0 or 1 will make no difference. In dispersy the database_version method (property) is used and checked for 0. if it's 0 it reads "# setup new database with current database_version".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant, if you refactor Tribler/dispersy to use the same db manager, and one used to give special meaning to 0, 1 or whatnot. You will need to make extra sure the meaning gets changed everywhere to mean the same. Just a friendly reminder.

@whirm
Copy link
Contributor

whirm commented May 27, 2016

On Monday I wil set up an AllChannel experiment runner for dispersy too (will use latest Tribler devel with the dispersy from the PR being tested.

This way you will be able to compare it with the upstream results.

@qstokkink This will be very useful for your PRs too.

@lfdversluis
Copy link
Contributor Author

Awesome, it's good to have more insurance that changes do not break current workings.

@qstokkink
Copy link
Contributor

@whirm 🎉 Nice, this leaves me more time to do other things.

@lfdversluis lfdversluis force-pushed the storm-db-manager branch 2 times, most recently from 30ea219 to 5da399c Compare May 30, 2016 13:00
@whirm
Copy link
Contributor

whirm commented May 30, 2016

retest this please

1 similar comment
@whirm
Copy link
Contributor

whirm commented Jun 6, 2016

retest this please

@lfdversluis
Copy link
Contributor Author

lfdversluis commented Jun 7, 2016

@whirm
Copy link
Contributor

whirm commented Jun 7, 2016

The Allchannel experiment failed so it aborted the whole stage.

I moved the experiment to a second phase until I have time to deploy the automatic cluster selection stuff I made.

@lfdversluis
Copy link
Contributor Author

Quite the timing, I was about to comment the question if the allchannel experiment was being timedout and therefore the build. Thanks for your reply :D

@lfdversluis
Copy link
Contributor Author

lfdversluis commented Jun 7, 2016

@whirm Why does a push trigger two PRs ? https://gyazo.com/771ba958cf9dfc517da4095d3807ce69
Here one succeeded and one failed too, so does that mean something is still off?

https://jenkins.tribler.org/job/dispersy/job/GH/job/PR/job/Unit_tests_linux/1073/
and https://jenkins.tribler.org/job/dispersy/job/GH/job/PR/job/Unit_tests_linux/1074/ were both triggered by one push: both report commit 8e1e722

@whirm
Copy link
Contributor

whirm commented Jun 7, 2016

Turns out both the master multijob and the Linux tester job had the GHPRB trigger enabled, so it has been running twice since the beginning.

@whirm
Copy link
Contributor

whirm commented Jun 7, 2016

(fixed now)

@lfdversluis
Copy link
Contributor Author

Thanks man!

@lfdversluis
Copy link
Contributor Author

Time to start activating the Deferred lock on one of the functions!

@@ -237,7 +199,7 @@ def requests(self, node_count, expected_responses, *pairs):
other.send_identity(node)

messages = [other.create_sequence_text("Sequence message #%d" % i, i + 10, i) for i in range(1, 11)]
yield other.store(messages)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be still a yield I guess.

@lfdversluis
Copy link
Contributor Author

lfdversluis commented Jul 21, 2016

Allchannel succeeded.

For those interested:
Very interesting observations during the last two weeks of debugging.
I have manged to find plenty of bugs thanks to the allchannel experiment. It turns out that once you create and close a connection per database transaction (any execute, insert, etc.) you are killing your performance. Currently we only flush once per minute or when we send messages that are our own. This saves A LOT of IO and if you generate a graph out of this behaviour you will see notable differences.

After putting this commit once per minute structure back, I got the following results ( 20 nodes 1k clients, i.e. the real allchannel scenario): https://jenkins.tribler.org/job/pers/job/allchannel_laurens_v2/115/artifact/output/ a (almost) perfect match with https://jenkins.tribler.org/job/Tribler/job/Experiments/job/AllChannel+Channelcommunity_devel/975/artifact/output/

Some interesting and notable differences: You see clear peaks in the read and writes of the regular one: clear peaks and in my case it's more of a curve but with peaks: curve-ish

rchars:
Current allchannel:
current allchannel rchars

My work:
my work rchars

The download, upload and more are very similar however. For example the upload:
current allchannel send
and my work:
my send

@lfdversluis
Copy link
Contributor Author

So in conclusion: batching your SQL statements and flush once in a while really gives A LOT of performance. This will be an interesting problem to tackle once IO is moved to the threadpool because there you may not be able to batch that easily, perhaps have a dedicated thread for the IO that can batch + block so the main twisted thread can continue.

Next up: gumby refactoring and Tribler.

@lfdversluis
Copy link
Contributor Author

lfdversluis commented Jul 21, 2016

Also, it looks like my stuff is able to do (slightly) more IO than the current, while yielding the same results. Also I noticed that at the start of the experiment there is quite a bit of dropped messages, yet no dropped messages after that (like the current allchannel but more drops). @whirm thoughts?

@synctext
Copy link
Member

sooooo... You reduced the amount of commits by batching stuff up?

dramatic changes in the graphs.

@synctext
Copy link
Member

lot more IO in your stuff. You commit after creating each message?

@lfdversluis
Copy link
Contributor Author

lfdversluis commented Jul 21, 2016

@synctext Yes, I am now batching them (or well SQLite is) and once every minute or when you send messages created by yourself (so your global time advances) the changes are flushed to the actual disk.

In all my previous attempts I would commit after every database change, this generated a lot more IO and was killing the performance. Basically the owner of the channel in the allchannel experiment would go flat.

@devos50
Copy link
Contributor

devos50 commented Jul 22, 2016

@lfdversluis good job!

Indeed, most graphs look very similar. I was expecting the spiked pattern for the wchar graph but instead, it is a line so you have IO operations ongoing all the time?

Regarding the dropped messages, they peak around 100 seconds. If you take a look at https://jenkins.tribler.org/job/pers/job/allchannel_laurens_v2/116/artifact/output//wchars.png, you see that the IO also peaks at around 100 seconds so I think you run into the same problem again: too much IO, causing dropped messages. I think this peak occurred the moment clients are starting to discover each other and start to send messages. Can you somehow investigate/reduce this peak?

@synctext
Copy link
Member

try to disable all commits and see what happens.

@devos50
Copy link
Contributor

devos50 commented Mar 12, 2017

Closing in favour of #530

@devos50 devos50 closed this Mar 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants