Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard Redis. #539

Merged
merged 52 commits into from
May 19, 2017
Merged

Shard Redis. #539

merged 52 commits into from
May 19, 2017

Conversation

stephanie-wang
Copy link
Contributor

Continued from #509.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/728/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/737/
Test FAILed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I have a few comments/questions!

src/common/net.h Outdated
@@ -1,6 +1,9 @@
#ifndef NET_H
#define NET_H

#include <string>
#include <vector>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These includes are no longer necessary, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe they need to be included in a different file (like redis.cc)?

@@ -4,13 +4,18 @@
#include "common.h"
#include "event_loop.h"

#include <vector>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this include?

if ((status == REDIS_ERR) || db->context->err) {
LOG_REDIS_DEBUG(db->context,
"error in redis_object_table_subscribe_to_notifications");
redisAsyncContext *context = get_redis_context(db, object_ids[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? This is turning a single redis command into potentially a million separate redis commands. Couldn't that be really expensive?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I realize now that you're doing this because the different requests need to go to different redis shards. However, in the case where we are requesting 1 million objects, this may be too expensive..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is possible to batch everything that goes into a single shard together. Let's do that as an optimization later.

@@ -779,6 +903,15 @@ void redis_task_table_test_and_update_callback(redisAsyncContext *c,
redisReply *reply = (redisReply *) r;
/* Parse the task from the reply. */
Task *task = parse_and_construct_task_from_redis_reply(reply);
if (task == NULL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this change do, it seems like it changes the behavior of the system? Wouldn't it make sense to still call the callback and let the caller decide how to handle this case in the callback?

Also, if this is only possible with actor tasks, then we should assert that the task is an actor task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the task entry is not found in the table, the callback is not called. Previously, it would crash on the next line since reply->str was NULL.

It should only happen for actor tasks at the moment, but the problem is that we can't tell whether it's an actor task or not without the task entry.

/* The IP address of the node that this global scheduler is running on. */
char *node_ip_address = NULL;
int c;
while ((c = getopt(argc, argv, "h:r:")) != -1) {
while ((c = getopt(argc, argv, "h:r:t:")) != -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -t is unused, right?

@@ -1193,13 +1192,13 @@ int main(int argc, char *argv[]) {
char *num_workers_str = NULL;
int c;
bool global_scheduler_exists = true;
while ((c = getopt(argc, argv, "s:r:p:m:ga:h:c:w:n:")) != -1) {
while ((c = getopt(argc, argv, "s:r:t:p:m:ga:h:c:w:n:")) != -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The t is unused.

int c;
while ((c = getopt(argc, argv, "s:m:h:p:r:")) != -1) {
while ((c = getopt(argc, argv, "s:m:h:p:r:t:")) != -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The t is unused.

@@ -1538,7 +1538,8 @@ def testGlobalStateAPI(self):
task_table = ray.global_state.task_table()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add another test in runtest.py that tests calling ray.init with num_redis_shards=10 or 20 or some large number and then runs some remote functions and gets the results. I don't think we're testing the changes to the ray.init code path anywhere.

@robertnishihara
Copy link
Collaborator

I'm looking into the jenkins failure now.

@robertnishihara
Copy link
Collaborator

Ok, I observed the following crash when running the many_driver_test jenkins test.

STDOUT

/opt/conda/lib/python2.7/site-packages/ray-0.0.1-py2.7.egg/ray/local_scheduler/../core/src/local_scheduler/local_scheduler[0x40f879]
/opt/conda/lib/python2.7/site-packages/ray-0.0.1-py2.7.egg/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(_Z41redis_task_table_test_and_update_callbackP17redisAsyncContextPvS1_+0x88)[0x42b658]
/opt/conda/lib/python2.7/site-packages/ray-0.0.1-py2.7.egg/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(redisProcessCallbacks+0x96)[0x44d9c6]
/opt/conda/lib/python2.7/site-packages/ray-0.0.1-py2.7.egg/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(aeProcessEvents+0x130)[0x434c20]
/opt/conda/lib/python2.7/site-packages/ray-0.0.1-py2.7.egg/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(aeMain+0x2b)[0x43503b]
/opt/conda/lib/python2.7/site-packages/ray-0.0.1-py2.7.egg/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(main+0x2f4)[0x40e924]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f783c860830]
/opt/conda/lib/python2.7/site-packages/ray-0.0.1-py2.7.egg/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(_start+0x29)[0x40ec49]

STDERR

[FATAL] (/ray/src/local_scheduler/local_scheduler.cc:565: errno: Operation now in progress) Check failure: ActorID_equal(TaskSpec_actor_id(spec), NIL_ID)

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/745/
Test FAILed.

@robertnishihara
Copy link
Collaborator

Also, the most recent jenkins run failed with https://amplab.cs.berkeley.edu/jenkins/job/Ray-PRB/745/console. I've never seen this error before on jenkins and the jenkins don't seem to be flaky, so it's possible there is a problem.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/773/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/774/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/789/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/793/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/794/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/796/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/798/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/799/
Test PASSed.

@robertnishihara
Copy link
Collaborator

This PR will break the web UI. That's completely fine because we need to rewrite the web UI anyway.

/* This reply will be received by redis_task_table_update_callback or
* redis_task_table_add_task_callback in redis.cc, which will then reissue
* the command. */
RedisModule_ReplyWithError(ctx, "No subscribers received message.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug, forgot a return statement here.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/824/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/825/
Test PASSed.

@shaneknapp
Copy link
Contributor

test this please

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/826/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/828/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/829/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/830/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/831/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/833/
Test PASSed.

@robertnishihara robertnishihara changed the title Use multiple Redis shards to reduce load. Shard Redis. May 19, 2017
@pcmoritz pcmoritz merged commit ee08c82 into ray-project:master May 19, 2017
@pcmoritz pcmoritz deleted the redis-shards branch May 19, 2017 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants