-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard Redis. #539
Shard Redis. #539
Conversation
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I have a few comments/questions!
src/common/net.h
Outdated
@@ -1,6 +1,9 @@ | |||
#ifndef NET_H | |||
#define NET_H | |||
|
|||
#include <string> | |||
#include <vector> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These includes are no longer necessary, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe they need to be included in a different file (like redis.cc
)?
src/common/state/db.h
Outdated
@@ -4,13 +4,18 @@ | |||
#include "common.h" | |||
#include "event_loop.h" | |||
|
|||
#include <vector> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this include?
if ((status == REDIS_ERR) || db->context->err) { | ||
LOG_REDIS_DEBUG(db->context, | ||
"error in redis_object_table_subscribe_to_notifications"); | ||
redisAsyncContext *context = get_redis_context(db, object_ids[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? This is turning a single redis command into potentially a million separate redis commands. Couldn't that be really expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I realize now that you're doing this because the different requests need to go to different redis shards. However, in the case where we are requesting 1 million objects, this may be too expensive..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is possible to batch everything that goes into a single shard together. Let's do that as an optimization later.
@@ -779,6 +903,15 @@ void redis_task_table_test_and_update_callback(redisAsyncContext *c, | |||
redisReply *reply = (redisReply *) r; | |||
/* Parse the task from the reply. */ | |||
Task *task = parse_and_construct_task_from_redis_reply(reply); | |||
if (task == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this change do, it seems like it changes the behavior of the system? Wouldn't it make sense to still call the callback and let the caller decide how to handle this case in the callback?
Also, if this is only possible with actor tasks, then we should assert that the task is an actor task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the task entry is not found in the table, the callback is not called. Previously, it would crash on the next line since reply->str
was NULL
.
It should only happen for actor tasks at the moment, but the problem is that we can't tell whether it's an actor task or not without the task entry.
/* The IP address of the node that this global scheduler is running on. */ | ||
char *node_ip_address = NULL; | ||
int c; | ||
while ((c = getopt(argc, argv, "h:r:")) != -1) { | ||
while ((c = getopt(argc, argv, "h:r:t:")) != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -t
is unused, right?
@@ -1193,13 +1192,13 @@ int main(int argc, char *argv[]) { | |||
char *num_workers_str = NULL; | |||
int c; | |||
bool global_scheduler_exists = true; | |||
while ((c = getopt(argc, argv, "s:r:p:m:ga:h:c:w:n:")) != -1) { | |||
while ((c = getopt(argc, argv, "s:r:t:p:m:ga:h:c:w:n:")) != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The t
is unused.
src/plasma/plasma_manager.cc
Outdated
int c; | ||
while ((c = getopt(argc, argv, "s:m:h:p:r:")) != -1) { | ||
while ((c = getopt(argc, argv, "s:m:h:p:r:t:")) != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The t
is unused.
@@ -1538,7 +1538,8 @@ def testGlobalStateAPI(self): | |||
task_table = ray.global_state.task_table() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add another test in runtest.py
that tests calling ray.init
with num_redis_shards=10
or 20
or some large number and then runs some remote functions and gets the results. I don't think we're testing the changes to the ray.init
code path anywhere.
I'm looking into the jenkins failure now. |
Ok, I observed the following crash when running the STDOUT
STDERR
|
Merged build finished. Test FAILed. |
Test FAILed. |
Also, the most recent jenkins run failed with https://amplab.cs.berkeley.edu/jenkins/job/Ray-PRB/745/console. I've never seen this error before on jenkins and the jenkins don't seem to be flaky, so it's possible there is a problem. |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
a55bc69
to
4c0a2ce
Compare
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
This PR will break the web UI. That's completely fine because we need to rewrite the web UI anyway. |
/* This reply will be received by redis_task_table_update_callback or | ||
* redis_task_table_add_task_callback in redis.cc, which will then reissue | ||
* the command. */ | ||
RedisModule_ReplyWithError(ctx, "No subscribers received message."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug, forgot a return statement here.
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
test this please |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Continued from #509.