Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a few flaky tests #9709

Merged
merged 5 commits into from
Jul 26, 2020
Merged

Conversation

robertnishihara
Copy link
Collaborator

No description provided.

return ray.worker.global_worker.node.unique_id

@ray.remote(resources={"CustomResource": 1})
def h():
ray.get([f.remote() for _ in range(5)])
return ray.worker.global_worker.node.unique_id

# The f tasks should be scheduled on both raylets.
assert len(set(ray.get([f.remote() for _ in range(500)]))) == 2
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is not necessarily true, so removing it.

@@ -416,24 +414,6 @@ def unique_name_3():
"'ray stack'")


def test_pandas_parquet_serialization():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems unnecessary.


# Delete the resource
ray.get(delete_res.remote(res_name, target_node_id))

wait_for_condition(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't fully verified it yet, but I suspect the issue was that we need to wait for the update to take effect. That is, delete_res happens asynchronously. Is that correct?

outputs = subprocess.check_output(
[sys.executable, __file__, "_ray_instance"],
stderr=subprocess.STDOUT).decode()
lines = outputs.split("\n")
assert len(lines) == 3
assert len(lines) == 3, lines
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't actually fix this test, but should get enough information to debug it from this assert now.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@robertnishihara robertnishihara merged commit a8efb21 into ray-project:master Jul 26, 2020
@robertnishihara robertnishihara deleted the tests branch July 26, 2020 00:11
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28939/
Test FAILed.

Edilmo added a commit to BonsaiAI/ray that referenced this pull request Aug 20, 2020
* [Core] Enhance common client connection (ray-project#9367)

* enhance client connection

* add write buffer async

* read message

* add test

* Bazel move more shell to native rules (ray-project#9314)

Co-authored-by: Mehrdad <[email protected]>

* [tune] Fix github readme (ray-project#9365)

Co-authored-by: Amog Kamsetty <[email protected]>

* Combine different severities into the same log files (ray-project#9230)

* Combine different severities into the same log files

Co-authored-by: Mehrdad <[email protected]>

* [core] Pass owner address from the workers to the raylet (ray-project#9299)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (ray-project#9063)"

This reverts commit 275da2e.

* Fix free

* fix tests

* Fix tests

* build

* build

* fix

* Change assertion to warning to fix java

* [Core] Add placement group scheduler and some api in resource scheduler (ray-project#9039)

* Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (ray-project#8984).

* change the bundle id and delete unit count in bundle

change vector<bundle_spec> to vector<shared_ptr<bundle_spec>>

Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (ray-project#8984).

change the bundle id and delete unit count in bundle

remove CheckIfSchedulable()

add comments and fix the bug in resource

* fix placement group schedule

* add placement group scheduler and change some api in resource scheduler

* fix by the comments

* fix conflict

* fix lint

* fix lint

* fix bug in merge

* fix lint

Co-authored-by: Lingxuan Zuo <[email protected]>

* [Core] New scheduler fixes (ray-project#9186)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* Fixed scheduling tests

* .

* .

* [Core] put small objects in memory store (ray-project#8972)

* remove the put in memory store

* put small objects directly in memory store

* cast data type

* fix another place that uses Put to spill to plasma store

* fix multiple tests related to memory limits

* partially fix test_metrics

* remove not functioning codes

* fix core_worker_test

* refactor put to plasma codes

* add a flag for the new feature

* add flag to more places

* do a warmup round for the plasma store

* lint

* lint again

* fix warmup store

* Update _raylet.pyx

Co-authored-by: Eric Liang <[email protected]>

* [autoscaler] Move command runners into separate file and clean up interface. (ray-project#9340)

* cleanup

* wip

* fix imports

* fix lint

* [docs][rllib] Recommended workflow for training, saving, and testing (ray-project#9319)

* [autoscaler] Allow users to disable the cluster config cache (ray-project#8117)

* [autoscaler] Remove autoscaler config cache.

* [autoscaler] Add flag allowing users to explicitly disable the config cache.

* Update hiredis and remove Windows patches (ray-project#9289)

Co-authored-by: Mehrdad <[email protected]>

* Fix flaky test_dynres.py (ray-project#9310)

* Fix gcs_table_storage testcase bug (ray-project#9393)

Co-authored-by: 灵洵 <[email protected]>

* [HOTFIX] Fix compile direct_actor_transport_test on mac (ray-project#9403)

* Change Python's `ObjectID` to `ObjectRef` (ray-project#9353)

* [Java] Improve JNI performance when submitting and executing tasks (ray-project#9032)

* Remove the RAY_CHECK in Worker::Port() (ray-project#9348)

* [RLlib] Issue ray-project#9366 (DQN w/o dueling produces invalid actions). (ray-project#9386)

* Fix macos compliation bug (ray-project#9391)

* Fix.

* [Core] Plasma RAII support (ray-project#9370)

* [Serve] Merge router with HTTPProxy (ray-project#9225)

* Pass run args to DockerCommandRunner (ray-project#9411)

* Fix copy to workspace (ray-project#9400)

* [RLlib] Tf2.x native. (ray-project#8752)

* Update conda and ray wheel on GCP images (ray-project#9388)

* [Core] Simplify Raylet Client (ray-project#9420)

* Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (ray-project#9407)

* [RLLib] WindowStat bug fix (ray-project#9213)

* WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue ray-project#7910.
ray-project#7910

* [tune] handling nan values (ray-project#9381)

* TRAVIS_PULL_REQUEST is false for non-PRs, not empty (ray-project#9439)

Co-authored-by: Mehrdad <[email protected]>

* [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (ray-project#9422)

* [Tune] Trainable documentation fix (ray-project#9448)

* Allow --lru-evict to be passed into `ray start` (ray-project#8959)

* GCP authentication using oauth tokens (ray-project#9279)

* Bazel selects compiler flags based on compiler (ray-project#9313)



Co-authored-by: Mehrdad <[email protected]>

* [Core] Build raylet client as an independent component (ray-project#9434)

* [tune] sklearn comment out (ray-project#9454)

* Add ability to specify SOCKS proxy for SSH connections (ray-project#8833)

* [docs] Render ActorPool documentation, etc (ray-project#9433)

* [tune] Put examples under proper version control (ray-project#9427)

Co-authored-by: krfricke <[email protected]>

* Fix test-multi-node (ray-project#9453)

* Machine View Sorting / Grouping (ray-project#9214)

* Convert NodeInfo.tsx to a functional component

* Update NodeRowGroup to be a functional component

* lint

* Convert TotalRow to functional component.

* lint

* move node info over to using the sortable table head component. spacing is still a little wonky.

* Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping

* Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer

* Add sort accessors for CPU

* Add sort accessors for Disk

* Add sort accessors for RAM

* add a table sort util for function based accessors (rather than flat attribute-based accessor)

* wip refactor node info features

* wip

* Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic

* wip

* wip

* wip

* Finish adding sorting and grouping of machine view

* lint

* fix bug in filtration of logs and errors by worker from recent refactor.

* Add export of Cluster Disk feature

* fix some merge issues

Co-authored-by: Max Fitton <[email protected]>

* [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (ray-project#9269)

* [RLlib] Issue 9402 MARWIL producing nan rewards. (ray-project#9429)

* Fix gcs_pubsub_test bug(ray-project#9438)

Co-authored-by: 灵洵 <[email protected]>

* change error code name of boost timer (ray-project#9417)

* [tune] PyTorch CIFAR10 example (ray-project#9338)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>

* Remove legacy C++ code (ray-project#9459)

* Fix ObjectRef and ActorHandle serialization (ray-project#9462)

* [Stats] metrics agent exporter (ray-project#9361)

* [Core] Support GCS server port assignment. (ray-project#8962)

* Add scripts symlink back (ray-project#9219) (ray-project#9475)

(cherry picked from commit 77933c9)

Co-authored-by: Simon Mo <[email protected]>

* [tune] Issue 8821: ExperimentAnalysis doesn't expand user (ray-project#9461)

* [docker] Include base-deps image in rayproject Docker Hub (ray-project#9458)

* [Core] remove create_and_seal and create_and_seal_batch (ray-project#9457)

* Speedups for GitHub Actions (ray-project#9343)

Co-authored-by: Mehrdad <[email protected]>

* Fix flaky test_object_manager.py (ray-project#9472)

* [Java] fix redis-server binary path (ray-project#9398)

* [core] Handle out-of-order actor table notifications (ray-project#9449)

* Drop stale actor table notifications

* build

* Add num_restarts to disconnect handler

* Unit test and increment num_restarts on ALIVE, not RESTARTING

* Wait for pid to exit

* Fix name clash on Windows (ray-project#9412)

Co-authored-by: Mehrdad <[email protected]>

* Add job configs to gcs (ray-project#9374)

* Make pip install verbose (ray-project#9496)

Co-authored-by: Mehrdad <[email protected]>

* Make more tests compatible with Windows (ray-project#9303)

* [tune] extend PTL template (GPU, typing fixes, tensorboard) (ray-project#9451)

Co-authored-by: Kai Fricke <[email protected]>

* [core] Replace task resubmission in raylet with ownership protocol (ray-project#9394)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (ray-project#9063)"

This reverts commit 275da2e.

* Fix free

* Regression tests - shorten timeouts in reconstruction unit tests

* Remove timeout for non-actor tasks

* Modify tests using ray.internal.free

* Clean up future resolution code

* Raylet polls the owner

* todo

* comment

* Update src/ray/core_worker/core_worker.cc

Co-authored-by: Edward Oakes <[email protected]>

* Drop stale actor table notifications

* Fix bug where actor restart hangs

* Revert buggy code for duplicate tasks

* build

* Fix errors for lru_evict and internal.free

* Revert "Drop stale actor table notifications"

This reverts commit 193c5d2.

* Revert "build"

This reverts commit 5644edb.

* Fix free test

* Fixes for freed objects

Co-authored-by: Edward Oakes <[email protected]>

* release gil in global state accessor (ray-project#9357)

* [Java] Named java actor (ray-project#9037)

* Fix clang-cl build (ray-project#9494)

Co-authored-by: Mehrdad <[email protected]>

* [GCS Actor Management] Gcs actor management broken detached actor (ray-project#9473)

* [RLlib] Issue ray-project#9437 (PyTorch converts to CPU tensor, even if on GPU). (ray-project#9497)

* Get rid of build shell scripts and move them to Python (ray-project#6082)

* Fix broken test_raylet_info_endpoint (ray-project#9511)

* Fix. (ray-project#9464)

* [Autoscaler] Making bootstrap config part of the node provider interface (ray-project#9443)

* supporting custom bootstrap config for external node providers

* bootstrap config

* renamed config to cluster_config

* lint

* remove 2 args from importer

* complete move of bootstrap to node_provider

* renamed provider_cls

* move imports outside functions

* lint

* Update python/ray/autoscaler/node_provider.py

Co-authored-by: Eric Liang <[email protected]>

* final fixes

* keeping lines to reduce diff

* lint

* lamba config

* filling in -> adding for lint

Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Eric Liang <[email protected]>

* Fix flaky test_actor_failures::test_actor_restart (ray-project#9509)

* Fix flaky test

* os exit

* [rllib] MAML Transform (ray-project#9463)

* MAML Transform

* Moved Inner Adapt to Method in Execution Plan

* Cleanup Plasma Store (hash utilities) (ray-project#9524)

* [Serve] Improve buffering for simple cases (ray-project#9485)

* [Serve] Use pickle instead of clouldpickle (ray-project#9479)

* Fix pip and Bazel interaction messing up CI (ray-project#9506)

Co-authored-by: Mehrdad <[email protected]>

* [Core] Fix Java detached error (ray-project#9526)

* fix java createActor NPE bug (ray-project#9532)

* [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (ray-project#9516)

* [Stats] Fix metric exporter test (ray-project#9376)

* Hotfix Lint for Serve (ray-project#9535)

* Windows cleanup (ray-project#9508)

* Remove unneeded code for Windows

* Get rid of usleep()

* Make platform_shims includes non-transitive

Co-authored-by: Mehrdad <[email protected]>

* [RLlib] Issue 8384: QMIX doesn't learn anything. (ray-project#9527)

* Add placement group manager and some code in core_worker (ray-project#9120)

Co-authored-by: Lingxuan Zuo <[email protected]>

* [core] Add flag to enable object reconstruction during ray start (ray-project#9488)

* Add flag

* doc

* Fix tests

* Pipelining task submission to workers (ray-project#9363)

* first step of pipelining

* pipelining tests & default configs
- added pipelining unit tests in direct_task_transport_test.cc
- added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker
- consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_

* post-review revisions

* linting, following naming/style convention

* linting

* [New scheduler] Queueing refactor (ray-project#9491)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* .

* .

* .

* .

* .

* .

* .

* cleanup

* address reviews

* address reviews

* more refactor

* :)

* travis pls

* .

* travis pls

* .

* [Serve] Add internal instruction for running benchmarks (ray-project#9531)

* MADDPG learning confirmation test. (ray-project#9538)

* Fix Bazel in Docker (ray-project#9530)

Co-authored-by: Mehrdad <[email protected]>

* Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (ray-project#9539)

Co-authored-by: 灵洵 <[email protected]>

* [tune] Unflattened lookup for ProgressReporter (ray-project#9525)

Co-authored-by: Kai Fricke <[email protected]>

* Add plasma store benchmark for small objects (ray-project#9549)

* [Tune] Copy default_columns in new ProgressReporter instances (ray-project#9537)

* quickfix (ray-project#9552)

* [tune] pin tune-sklearn (ray-project#9498)

* [cli] ray memory: added redis_password (ray-project#9492)

* [GCS]Fix lease worker leak bug when gcs server restarts (ray-project#9315)

* add part code

* fix compile bug

* fix review comments

* fix review comments

* fix review comments

* fix review comments

* fix review comment

* fix ut bug

* fix lint error

* fix review comment

* fix review comments

* add testcase

* add testcase

* fix bug

* fix review comments

* fix review comment

* fix review comment

* refine comments

Co-authored-by: 灵洵 <[email protected]>
Co-authored-by: Hao Chen <[email protected]>

* [tune] fix pbt checkpoint_freq (ray-project#9517)

* Only delete old checkpoint if it is not the same as the new one

* Return early if old checkpoint value coincides with new checkpoint value

Co-authored-by: Kai Fricke <[email protected]>

* [Core] Remove socket pair exchange in Plasma Store (ray-project#9565)

* try use boost::asio for notification processing

* [Metric] new cython interface for python worker metric (ray-project#9469)

* Bazel fixes (ray-project#9519)

* GCS client add fetch operation before subscribe (ray-project#9564)

* [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (ray-project#9521)

* Change aggregation when lockstep is activated.

Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy.

fix ray-project#9295

* Line too long.

* [Core] Replace the Plasma eventloop with boost::asio (ray-project#9431)

* Fix Java named actor bug (ray-project#9580)

* Fix setup.py bug (ray-project#9581)

Co-authored-by: Mehrdad <[email protected]>

* [Serve] Serialize Query object directly (ray-project#9490)

* Add dashboard dependencies to default ray installation (ray-project#9447)

* Dashboard next-version API support in backend (ray-project#9345)

* Fix log losses (ray-project#9559)

* Close log on shutdown

* Disable log buffering

Co-authored-by: Mehrdad <[email protected]>

* [docker] run Ubuntu 20.04 as base image (ray-project#9556)

* Add PTL to README.rst (ray-project#9594)

Co-authored-by: Richard Liaw <[email protected]>

* Skip uneeded steps on CI (ray-project#9582)

Co-authored-by: Mehrdad <[email protected]>

* Fix Windows CI (ray-project#9588)

Co-authored-by: Mehrdad <[email protected]>

* [serve] Rename to `Controller` (ray-project#9566)

* Handle warnings in core (ray-project#9575)

* [New scheduler] Fix new scheduler bug (ray-project#9467)

* fix new scheduler bug

* add testcase for soft resource allocation

* modify RemoveNode

* Ensure unique log file names across same-node raylets. (ray-project#9561)

* fix tag key typo (ray-project#9606)

* Rename path variable due to zsh conflict (ray-project#9610)

* [doc] [minor] Make API docs easier to find. (ray-project#9604)

* Issue 9568: `rllib train` framework in config gets overridden with tf. (ray-project#9572)

* Use UTF-8 for encoding of python code for collision hashing (ray-project#9586)

Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: simon-mo <[email protected]>

* Add bazel to the PATH in setup.py (ray-project#9590)

Co-authored-by: Mehrdad <[email protected]>

* Fix Lint in setup.py (ray-project#9618)

Co-authored-by: Mehrdad <[email protected]>

* Shellcheck comments (ray-project#9595)

* [Serve] Document Metric Infrastructure (ray-project#9389)

* [CI] Do not run jenkins test on GHA (ray-project#9621)

* Support ray task type checking (ray-project#9574)

* [Metrics] Java metric API (ray-project#9377)

* [GCS] fix the fault tolerance about gcs node manager (ray-project#9380)

* Shellcheck quoting (ray-project#9596)

* Fix SC2006: Use $(...) notation instead of legacy backticked `...`.

* Fix SC2016: Expressions don't expand in single quotes, use double quotes for that.

* Fix SC2046: Quote this to prevent word splitting.

* Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching.

* Fix SC2068: Double quote array expansions to avoid re-splitting elements.

* Fix SC2086: Double quote to prevent globbing and word splitting.

* Fix SC2102: Ranges can only match single chars (mentioned due to duplicates).

* Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"?

* Fix SC2145: Argument mixes string and array. Use * or separate argument.

* Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string).

Co-authored-by: Mehrdad <[email protected]>

* Fix bug in Bazel version check (ray-project#9626)

Co-authored-by: Mehrdad <[email protected]>

* [Java] Avoid data copy from C++ to Java for ByteBuffer type (ray-project#9033)

* Revert "Dashboard next-version API support in backend (ray-project#9345)" (ray-project#9639)

This reverts commit fca1fb1.

* [Autoscaler] Command Line Interface improvements (ray-project#9322)

Co-authored-by: Richard Liaw <[email protected]>

* [Core] GCS Actor management on by default. (ray-project#8845)

* GCS Actor management on by default.

* Fix travis config.

* Change condition.

* Remove unnecessary CI.

* [Core] Fix concurrency issues in plasma store runner (ray-project#9642)

* fix window jni unhappy compiler (ray-project#9635)

* Fix TestObjectTableResubscribe testcase bug (ray-project#9650)

* fix named actor single process mode bug (ray-project#9652)

* [core] Fix Ray service startup when logging redirection is disabled. (ray-project#9547)

* Fix TorchDeterministic (ray-project#9241)

* [RaySGD] revised existing transformer example to work with transformers>=3.0 (ray-project#9661)

Co-authored-by: Kai Fricke <[email protected]>

* [rllib] Fix torch TD error, IMPALA LR updates (ray-project#9477)

* update

* add test

* lint

* fix super call

* speed es test up

* Auto-cancel build when a new commit is pushed (ray-project#8043)

Co-authored-by: Mehrdad <[email protected]>

* Fix lint in remote-watch.py (ray-project#9668)

* [Core] Remove unnecessary windows syscall in plasma store (ray-project#9602)

* Remove unused windows shims (ray-project#9583)

* Temporarily disable remote watcher (ray-project#9669)

* Drop support for Python 3.5. (ray-project#9622)

* Drop support for Python 3.5.

* Update setup.py

* [Core] WorkerInterface refactor (ray-project#9655)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* .

* .

* .

* Fixed tests

* Fixed tests

* .

* [core] Enable object reconstruction for retryable actor tasks (ray-project#9557)

* Test actor plasma reconstruction

* Allow resubmission of actor tasks

* doc

* Test for actor constructor

* Kill PID before removing node

* Kill pid before node

* fix java coreworker crash (ray-project#9674)

* use help proto-init-macro for streaming config (ray-project#9272)

* Update release information from 0.8.6. (ray-project#9124)

* [BRING BACK TO MASTER] Update release information.

* [MERGE TO MASTER] Add microbenchmark result.

* Update asan tests to the doc.

* Refinements to the Serve documentation (ray-project#9587)

Co-authored-by: Dean Wampler <[email protected]>

* [tune] survey (ray-project#9670)

* Fix ERROR logging not being printed to standard error (ray-project#9633)

Co-authored-by: Mehrdad <[email protected]>

* [Tune Docs] Logging doc fix (ray-project#9691)

* [rllib] Type annotations for model classes (ray-project#9646)

* [Serve] Allow multiple HTTP servers. (ray-project#9523)

* Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (ray-project#9681)

* [Serve] Fix Formatting, stale docs (ray-project#9617)

* fixed simplex initialisation seeding bug (ray-project#9660)

Co-authored-by: Petros Christodoulou <[email protected]>

* Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (ray-project#9697)

Co-authored-by: Mehrdad <[email protected]>

* Add Ray Serve to README.rst (ray-project#9688)

* Shellcheck rewrites (ray-project#9597)

* Fix SC2001: See if you can use ${variable//search/replace} instead.

* Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.

* Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames.

* Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true.

* Fix SC2028: echo may not expand escape sequences. Use printf.

* Fix SC2034: variable appears unused. Verify use (or export if used externally).

* Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options.

* Fix SC2071: > is for string comparisons. Use -gt instead.

* Fix SC2154: variable is referenced but not assigned

* Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

* Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).

* Fix SC2236: Use -n instead of ! -z.

* Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr.

* Fix SC2086: Double quote to prevent globbing and word splitting.

Co-authored-by: Mehrdad <[email protected]>

* [Autoscaler] CLI Logger docs (ray-project#9690)

Co-authored-by: Richard Liaw <[email protected]>

* Update rllib-algorithms.rst (ray-project#9640)

* [tune] move jenkins tests to travis (ray-project#9609)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>

* [RLlib] Implement DQN PyTorch distributional head. (ray-project#9589)

* Add placement group java api (ray-project#9611)

* add part code

* add part code

* add part code

* fix code style

* fix review comment

* fix review comment

* add part code

* add part code

* add part code

* add part code

* fix review comment

* fix review comment

* fix code style

* fix review comment

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [Stats] Improve Stats::Init & Add it to GCS server (ray-project#9563)

* [Core] Try remove all windows compat shims (ray-project#9671)

* try remove compat for arrow

* remove unistd.h

* remove socket compat

* delete arrow windows patch

* Fix a few flaky tests (ray-project#9709)

Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency

* [GCS]Open test_gcs_fault_tolerance testcase (ray-project#9677)

* enable test_gcs_fault_tolerance

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [Tests]lock vector to avoid potential flaky test (ray-project#9656)

* [tune] distributed torch wrapper (ray-project#9550)

* changes

* add-working

* checkpoint

* ccleanu

* fix

* ok

* formatting

* ok

* tests

* some-good-stuff

* fix-torch

* ddp-torch

* torch-test

* sessions

* add-small-test

* fix

* remove

* gpu-working

* update-tests

* ok

* try-test

* formgat

* ok

* ok

* [GCS] Fix actor task hang when its owner exits before local dependencies resolved (ray-project#8045)

* Only update raylet map when autoscaler configured (ray-project#9435)

* [Dashboard] New dashboard skeleton (ray-project#9099)

* Fixing multiple building issues

* Make wait_for_condition raise exception when timing out. (ray-project#9710)

* [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (ray-project#9718)

* Package and upload ray cross-platform jar (ray-project#9540)

* Revert "Package and upload ray cross-platform jar (ray-project#9540)" (ray-project#9730)

This reverts commit 8810325.

* Only build docker wheels in LINUX_WHEELS env (ray-project#9729)

* Keep build-autoscaler-images.sh alive in CI (ray-project#9720)

* [core] Removes Error when Internal Config is not set (ray-project#9700)

* [Cluster Launcher] Re Org the cluster launcher pages. (ray-project#9687)

* [RLlib] Offline Type Annotations (ray-project#9676)

* Offline Annotations

* Modifications

* Fixed circular dependencies

* Linter fix

* Python api of placement group (ray-project#9243)

* Include open-ssh-client for transparency (ray-project#9693)

* Fix remote-watch.py (ray-project#9625)

Co-authored-by: Mehrdad <[email protected]>

* [docker] Uses Latest Conda & Py 3.7 (ray-project#9732)

* Fix broken actor failure tests. (ray-project#9737)

* [Stats] fix stats shutdown crash if opencensus exporter not initialized (ray-project#9727)

* Fix package and upload ray jar (ray-project#9742)

* Introduce file_mounts_sync_continuously cluster option (ray-project#9544)

* Separate out file_mounts contents hashing into its own separate hash

Add an option to continuously sync file_mounts from head node to worker nodes:
monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes

* add test and default value for file_mounts_sync_continuously

* format code

* Update comments

* Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick

Fixed so setup commands run when ray up is run and file_mounts content changes

* Refactor so that runtime_hash retains previous behavior

runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run
file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur.

Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization

* fix issue with hashing a hash

* fix bug where trying to set contents hash when it wasn't generated

* Fix lint error

Fix bug in command_runner where check_output was no longer returning the output of the command

* clear out provider between tests to get rid of flakyness

* reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call

* [dist] swap mac/linux wheel build order (ray-project#9746)

* [RLlib] Enhance reward clipping test; add action_clipping tests. (ray-project#9684)

* [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (ray-project#9680)

* [Metrics]Ray java worker metric registry (ray-project#9636)

* ray worker metrics gauge init

* ray java metric mapping

* add jni source files for gauge and tagkey

* mapping all metric classes to stats object

* check non-null for tags and name

* lint

* add symbol for native metric JNI

* extern c for symbol

* add tests for all metrics

* Update Metric.java

use metricNativePointer instead.

* unify metric native stuff to one class

* fix jni file

* add comments for metric transform function in jni utils

* move metric function to native metric file

* remove unused disconnect jni

* Add a metric registry for java metircs

* Restore install-bazel.sh

* Add some comments for metric registry

* Fix thread safe problem of metrics

* Fix metric tests and remove sleep code from tests

* Fix comments of metrics

Co-authored-by: lingxuan.zlx <[email protected]>

* fix windows compile bug (ray-project#9741)

Co-authored-by: 灵洵 <[email protected]>

* Run _with_interactive in Docker (ray-project#9747)

* [New scheduler] First unit test for task manager (ray-project#9696)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* bad git >:-(

* small clean up

* CR

* .

* .

* One more fixture

* One more fixture

* .

* .

* bazel-format

* .

* [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (ray-project#9607)

* [Release] Fix release tests (ray-project#9733)

* Register function race (ray-project#9346)

* Revert "[dist] swap mac/linux wheel build order (ray-project#9746)" and "Fix package and upload ray jar (ray-project#9742)" (ray-project#9758)

* Revert "[dist] swap mac/linux wheel build order (ray-project#9746)"

This reverts commit a934056.

* Revert "Fix package and upload ray jar (ray-project#9742)"

This reverts commit c290c30.

* Fix some Windows CI issues (ray-project#9708)

Co-authored-by: Mehrdad <[email protected]>

* Pin pytest version (ray-project#9767)

* [Java] Use test groups to filter tests of different run modes (ray-project#9703)

* [Java] Fix MetricTest.java due to incomplete changes from ray-project#9703 (ray-project#9770)

* Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (ray-project#9719)

* [Stats] enable core worker stats (ray-project#9355)

* [GCS]Use a separate thread in node failure detector to handle heartbeat (ray-project#9416)

* use a sole thread to handle heartbeat

* separate signal thread

* use work to avoid exiting when task is underway

* protect shared data structure to avoid deadlock

* add comments

* decrease io service num

* minor changes

* fix test

* per stephanie's comments

* use single io service instead of 1-size io service pool

* typo

* [GCS Actor Management] Fix flaky test_dead_actors. (ray-project#9715)

* Fix.

* Add logs.

* Add an unit test.

* [TUNE] Tune Docs re-organization (ray-project#9600)

Co-authored-by: Richard Liaw <[email protected]>

* [RLlib] Trajectory View API (preparatory cleanup and enhancements). (ray-project#9678)

* [Core] Socket creation race condition bug fixes (ray-project#9764)

* fix issues

* hot fixes

* test

* test

* Always info log

* Fixed stderr logging (9765)

* [Core] Custom socket name (ray-project#9766)

* fix issues

* hot fixes

* test

* test

* socket name change only

* Fix src/ray/core_worker/common.h deleted constructor (ray-project#9785)

Co-authored-by: Mehrdad <[email protected]>

* [Stats] Fix harvestor threads + Fix flaky stats shutdown. (ray-project#9745)

* More fixes

* Applying latest changes in travis.yml

* Fixing fixture data exclusions

* Disable some java tests

* Fix some CI errors

* Update hash

* Fixing more build issues

* Fixing more build issues

* Fix pipeline cache path

* More fixes

* Fix bazel test command

* Fix bazel test

* Fix general info steps

* Custom env var for docker build

* Trying a different way to install bazel

* Bazel fix

* Updating hash

Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]>
Co-authored-by: mehrdadn <[email protected]>
Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Alisa <[email protected]>
Co-authored-by: Lingxuan Zuo <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Stefan Schneider <[email protected]>
Co-authored-by: Patrick Ames <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Co-authored-by: fangfengbin <[email protected]>
Co-authored-by: 灵洵 <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: Kai Yang <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Ian Rodney <[email protected]>
Co-authored-by: Henk Tillman <[email protected]>
Co-authored-by: Tanay Wakhare <[email protected]>
Co-authored-by: Nicolaus93 <[email protected]>
Co-authored-by: Vasily Litvinov <[email protected]>
Co-authored-by: krfricke <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: kisuke95 <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Michael Mui <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: chaokunyang <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Michael Luo <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: Tom <[email protected]>
Co-authored-by: jerrylee.io <[email protected]>
Co-authored-by: Raphael Avalos <[email protected]>
Co-authored-by: William Falcon <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: ZhuSenlin <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Maksim Smolin <[email protected]>
Co-authored-by: Dean Wampler <[email protected]>
Co-authored-by: Dean Wampler <[email protected]>
Co-authored-by: Bill Chambers <[email protected]>
Co-authored-by: Petros Christodoulou <[email protected]>
Co-authored-by: Petros Christodoulou <[email protected]>
Co-authored-by: Justin Terry <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: fyrestone <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: bermaker <[email protected]>
Edilmo added a commit to BonsaiAI/ray that referenced this pull request May 14, 2021
* Set up CI with Azure Pipelines

Specifically, we are setting a
travis like ADO pipeline following
what is already present in the .travis.yml
file in the root of the repo.

* Separating travis like pipeline from main pipeline

* Adding Jenkings jobs equivalent

* Making some improvements

* Adding validation of the upstream CI

* Disabling Tune and large memory tests

* Changing threshold for simple reservoir sampling test

* Addressing comments

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with more travis updates

* Updating CI with new cpp worker tests

* Setting code owners

* Fixing the version number generation

* Making main pipeline also our release pipeline

* Updating Azure Pipelines with travis updates

* Fixing wheels test

* Fixing codeowners

* Updating Azure Pipelines with travis updates

* Bumping up MACOSX_DEPLOYMENT_TARGET

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Disabling Serve tests

* Making explicit which branches GitHubActions workflows should watch

* Desabling Ray serve tests

* Installing numpy explicitly

* consolidating Ray test steps in one yml

* Syncing with upstream master 2020-07-30 (#21)

* [Core] Enhance common client connection (#9367)

* enhance client connection

* add write buffer async

* read message

* add test

* Bazel move more shell to native rules (#9314)

Co-authored-by: Mehrdad <[email protected]>

* [tune] Fix github readme (#9365)

Co-authored-by: Amog Kamsetty <[email protected]>

* Combine different severities into the same log files (#9230)

* Combine different severities into the same log files

Co-authored-by: Mehrdad <[email protected]>

* [core] Pass owner address from the workers to the raylet (#9299)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.

* Fix free

* fix tests

* Fix tests

* build

* build

* fix

* Change assertion to warning to fix java

* [Core] Add placement group scheduler and some api in resource scheduler (#9039)

* Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).

* change the bundle id and delete unit count in bundle

change vector<bundle_spec> to vector<shared_ptr<bundle_spec>>

Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).

change the bundle id and delete unit count in bundle

remove CheckIfSchedulable()

add comments and fix the bug in resource

* fix placement group schedule

* add placement group scheduler and change some api in resource scheduler

* fix by the comments

* fix conflict

* fix lint

* fix lint

* fix bug in merge

* fix lint

Co-authored-by: Lingxuan Zuo <[email protected]>

* [Core] New scheduler fixes (#9186)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* Fixed scheduling tests

* .

* .

* [Core] put small objects in memory store (#8972)

* remove the put in memory store

* put small objects directly in memory store

* cast data type

* fix another place that uses Put to spill to plasma store

* fix multiple tests related to memory limits

* partially fix test_metrics

* remove not functioning codes

* fix core_worker_test

* refactor put to plasma codes

* add a flag for the new feature

* add flag to more places

* do a warmup round for the plasma store

* lint

* lint again

* fix warmup store

* Update _raylet.pyx

Co-authored-by: Eric Liang <[email protected]>

* [autoscaler] Move command runners into separate file and clean up interface. (#9340)

* cleanup

* wip

* fix imports

* fix lint

* [docs][rllib] Recommended workflow for training, saving, and testing (#9319)

* [autoscaler] Allow users to disable the cluster config cache (#8117)

* [autoscaler] Remove autoscaler config cache.

* [autoscaler] Add flag allowing users to explicitly disable the config cache.

* Update hiredis and remove Windows patches (#9289)

Co-authored-by: Mehrdad <[email protected]>

* Fix flaky test_dynres.py (#9310)

* Fix gcs_table_storage testcase bug (#9393)

Co-authored-by: 灵洵 <[email protected]>

* [HOTFIX] Fix compile direct_actor_transport_test on mac (#9403)

* Change Python's `ObjectID` to `ObjectRef` (#9353)

* [Java] Improve JNI performance when submitting and executing tasks (#9032)

* Remove the RAY_CHECK in Worker::Port() (#9348)

* [RLlib] Issue #9366 (DQN w/o dueling produces invalid actions). (#9386)

* Fix macos compliation bug (#9391)

* Fix.

* [Core] Plasma RAII support (#9370)

* [Serve] Merge router with HTTPProxy (#9225)

* Pass run args to DockerCommandRunner (#9411)

* Fix copy to workspace (#9400)

* [RLlib] Tf2.x native. (#8752)

* Update conda and ray wheel on GCP images (#9388)

* [Core] Simplify Raylet Client (#9420)

* Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (#9407)

* [RLLib] WindowStat bug fix (#9213)

* WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue #7910.
https://github.com/ray-project/ray/issues/7910

* [tune] handling nan values (#9381)

* TRAVIS_PULL_REQUEST is false for non-PRs, not empty (#9439)

Co-authored-by: Mehrdad <[email protected]>

* [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (#9422)

* [Tune] Trainable documentation fix (#9448)

* Allow --lru-evict to be passed into `ray start` (#8959)

* GCP authentication using oauth tokens (#9279)

* Bazel selects compiler flags based on compiler (#9313)



Co-authored-by: Mehrdad <[email protected]>

* [Core] Build raylet client as an independent component (#9434)

* [tune] sklearn comment out (#9454)

* Add ability to specify SOCKS proxy for SSH connections (#8833)

* [docs] Render ActorPool documentation, etc (#9433)

* [tune] Put examples under proper version control (#9427)

Co-authored-by: krfricke <[email protected]>

* Fix test-multi-node (#9453)

* Machine View Sorting / Grouping (#9214)

* Convert NodeInfo.tsx to a functional component

* Update NodeRowGroup to be a functional component

* lint

* Convert TotalRow to functional component.

* lint

* move node info over to using the sortable table head component. spacing is still a little wonky.

* Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping

* Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer

* Add sort accessors for CPU

* Add sort accessors for Disk

* Add sort accessors for RAM

* add a table sort util for function based accessors (rather than flat attribute-based accessor)

* wip refactor node info features

* wip

* Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic

* wip

* wip

* wip

* Finish adding sorting and grouping of machine view

* lint

* fix bug in filtration of logs and errors by worker from recent refactor.

* Add export of Cluster Disk feature

* fix some merge issues

Co-authored-by: Max Fitton <[email protected]>

* [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (#9269)

* [RLlib] Issue 9402 MARWIL producing nan rewards. (#9429)

* Fix gcs_pubsub_test bug(#9438)

Co-authored-by: 灵洵 <[email protected]>

* change error code name of boost timer (#9417)

* [tune] PyTorch CIFAR10 example (#9338)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>

* Remove legacy C++ code (#9459)

* Fix ObjectRef and ActorHandle serialization (#9462)

* [Stats] metrics agent exporter (#9361)

* [Core] Support GCS server port assignment. (#8962)

* Add scripts symlink back (#9219) (#9475)

(cherry picked from commit 77933c922d5136c5c2e2f0ac2edb4da67111d690)

Co-authored-by: Simon Mo <[email protected]>

* [tune] Issue 8821: ExperimentAnalysis doesn't expand user (#9461)

* [docker] Include base-deps image in rayproject Docker Hub (#9458)

* [Core] remove create_and_seal and create_and_seal_batch (#9457)

* Speedups for GitHub Actions (#9343)

Co-authored-by: Mehrdad <[email protected]>

* Fix flaky test_object_manager.py (#9472)

* [Java] fix redis-server binary path (#9398)

* [core] Handle out-of-order actor table notifications (#9449)

* Drop stale actor table notifications

* build

* Add num_restarts to disconnect handler

* Unit test and increment num_restarts on ALIVE, not RESTARTING

* Wait for pid to exit

* Fix name clash on Windows (#9412)

Co-authored-by: Mehrdad <[email protected]>

* Add job configs to gcs (#9374)

* Make pip install verbose (#9496)

Co-authored-by: Mehrdad <[email protected]>

* Make more tests compatible with Windows (#9303)

* [tune] extend PTL template (GPU, typing fixes, tensorboard) (#9451)

Co-authored-by: Kai Fricke <[email protected]>

* [core] Replace task resubmission in raylet with ownership protocol (#9394)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.

* Fix free

* Regression tests - shorten timeouts in reconstruction unit tests

* Remove timeout for non-actor tasks

* Modify tests using ray.internal.free

* Clean up future resolution code

* Raylet polls the owner

* todo

* comment

* Update src/ray/core_worker/core_worker.cc

Co-authored-by: Edward Oakes <[email protected]>

* Drop stale actor table notifications

* Fix bug where actor restart hangs

* Revert buggy code for duplicate tasks

* build

* Fix errors for lru_evict and internal.free

* Revert "Drop stale actor table notifications"

This reverts commit 193c5d20e5577befd43f166e16c972e2f9247c91.

* Revert "build"

This reverts commit 5644edbac906ff6ef98feb40b6f62c9e63698c29.

* Fix free test

* Fixes for freed objects

Co-authored-by: Edward Oakes <[email protected]>

* release gil in global state accessor (#9357)

* [Java] Named java actor (#9037)

* Fix clang-cl build (#9494)

Co-authored-by: Mehrdad <[email protected]>

* [GCS Actor Management] Gcs actor management broken detached actor (#9473)

* [RLlib] Issue #9437 (PyTorch converts to CPU tensor, even if on GPU). (#9497)

* Get rid of build shell scripts and move them to Python (#6082)

* Fix broken test_raylet_info_endpoint (#9511)

* Fix. (#9464)

* [Autoscaler] Making bootstrap config part of the node provider interface (#9443)

* supporting custom bootstrap config for external node providers

* bootstrap config

* renamed config to cluster_config

* lint

* remove 2 args from importer

* complete move of bootstrap to node_provider

* renamed provider_cls

* move imports outside functions

* lint

* Update python/ray/autoscaler/node_provider.py

Co-authored-by: Eric Liang <[email protected]>

* final fixes

* keeping lines to reduce diff

* lint

* lamba config

* filling in -> adding for lint

Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Eric Liang <[email protected]>

* Fix flaky test_actor_failures::test_actor_restart (#9509)

* Fix flaky test

* os exit

* [rllib] MAML Transform (#9463)

* MAML Transform

* Moved Inner Adapt to Method in Execution Plan

* Cleanup Plasma Store (hash utilities) (#9524)

* [Serve] Improve buffering for simple cases (#9485)

* [Serve] Use pickle instead of clouldpickle (#9479)

* Fix pip and Bazel interaction messing up CI (#9506)

Co-authored-by: Mehrdad <[email protected]>

* [Core] Fix Java detached error (#9526)

* fix java createActor NPE bug (#9532)

* [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (#9516)

* [Stats] Fix metric exporter test (#9376)

* Hotfix Lint for Serve (#9535)

* Windows cleanup (#9508)

* Remove unneeded code for Windows

* Get rid of usleep()

* Make platform_shims includes non-transitive

Co-authored-by: Mehrdad <[email protected]>

* [RLlib] Issue 8384: QMIX doesn't learn anything. (#9527)

* Add placement group manager and some code in core_worker (#9120)

Co-authored-by: Lingxuan Zuo <[email protected]>

* [core] Add flag to enable object reconstruction during ray start (#9488)

* Add flag

* doc

* Fix tests

* Pipelining task submission to workers (#9363)

* first step of pipelining

* pipelining tests & default configs
- added pipelining unit tests in direct_task_transport_test.cc
- added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker
- consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_

* post-review revisions

* linting, following naming/style convention

* linting

* [New scheduler] Queueing refactor (#9491)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* .

* .

* .

* .

* .

* .

* .

* cleanup

* address reviews

* address reviews

* more refactor

* :)

* travis pls

* .

* travis pls

* .

* [Serve] Add internal instruction for running benchmarks (#9531)

* MADDPG learning confirmation test. (#9538)

* Fix Bazel in Docker (#9530)

Co-authored-by: Mehrdad <[email protected]>

* Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (#9539)

Co-authored-by: 灵洵 <[email protected]>

* [tune] Unflattened lookup for ProgressReporter (#9525)

Co-authored-by: Kai Fricke <[email protected]>

* Add plasma store benchmark for small objects (#9549)

* [Tune] Copy default_columns in new ProgressReporter instances (#9537)

* quickfix (#9552)

* [tune] pin tune-sklearn (#9498)

* [cli] ray memory: added redis_password (#9492)

* [GCS]Fix lease worker leak bug when gcs server restarts (#9315)

* add part code

* fix compile bug

* fix review comments

* fix review comments

* fix review comments

* fix review comments

* fix review comment

* fix ut bug

* fix lint error

* fix review comment

* fix review comments

* add testcase

* add testcase

* fix bug

* fix review comments

* fix review comment

* fix review comment

* refine comments

Co-authored-by: 灵洵 <[email protected]>
Co-authored-by: Hao Chen <[email protected]>

* [tune] fix pbt checkpoint_freq (#9517)

* Only delete old checkpoint if it is not the same as the new one

* Return early if old checkpoint value coincides with new checkpoint value

Co-authored-by: Kai Fricke <[email protected]>

* [Core] Remove socket pair exchange in Plasma Store (#9565)

* try use boost::asio for notification processing

* [Metric] new cython interface for python worker metric (#9469)

* Bazel fixes (#9519)

* GCS client add fetch operation before subscribe (#9564)

* [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (#9521)

* Change aggregation when lockstep is activated.

Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy.

fix ray-project/ray#9295

* Line too long.

* [Core] Replace the Plasma eventloop with boost::asio (#9431)

* Fix Java named actor bug (#9580)

* Fix setup.py bug (#9581)

Co-authored-by: Mehrdad <[email protected]>

* [Serve] Serialize Query object directly (#9490)

* Add dashboard dependencies to default ray installation (#9447)

* Dashboard next-version API support in backend (#9345)

* Fix log losses (#9559)

* Close log on shutdown

* Disable log buffering

Co-authored-by: Mehrdad <[email protected]>

* [docker] run Ubuntu 20.04 as base image (#9556)

* Add PTL to README.rst (#9594)

Co-authored-by: Richard Liaw <[email protected]>

* Skip uneeded steps on CI (#9582)

Co-authored-by: Mehrdad <[email protected]>

* Fix Windows CI (#9588)

Co-authored-by: Mehrdad <[email protected]>

* [serve] Rename to `Controller` (#9566)

* Handle warnings in core (#9575)

* [New scheduler] Fix new scheduler bug (#9467)

* fix new scheduler bug

* add testcase for soft resource allocation

* modify RemoveNode

* Ensure unique log file names across same-node raylets. (#9561)

* fix tag key typo (#9606)

* Rename path variable due to zsh conflict (#9610)

* [doc] [minor] Make API docs easier to find. (#9604)

* Issue 9568: `rllib train` framework in config gets overridden with tf. (#9572)

* Use UTF-8 for encoding of python code for collision hashing (#9586)

Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: simon-mo <[email protected]>

* Add bazel to the PATH in setup.py (#9590)

Co-authored-by: Mehrdad <[email protected]>

* Fix Lint in setup.py (#9618)

Co-authored-by: Mehrdad <[email protected]>

* Shellcheck comments (#9595)

* [Serve] Document Metric Infrastructure (#9389)

* [CI] Do not run jenkins test on GHA (#9621)

* Support ray task type checking (#9574)

* [Metrics] Java metric API (#9377)

* [GCS] fix the fault tolerance about gcs node manager (#9380)

* Shellcheck quoting (#9596)

* Fix SC2006: Use $(...) notation instead of legacy backticked `...`.

* Fix SC2016: Expressions don't expand in single quotes, use double quotes for that.

* Fix SC2046: Quote this to prevent word splitting.

* Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching.

* Fix SC2068: Double quote array expansions to avoid re-splitting elements.

* Fix SC2086: Double quote to prevent globbing and word splitting.

* Fix SC2102: Ranges can only match single chars (mentioned due to duplicates).

* Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"?

* Fix SC2145: Argument mixes string and array. Use * or separate argument.

* Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string).

Co-authored-by: Mehrdad <[email protected]>

* Fix bug in Bazel version check (#9626)

Co-authored-by: Mehrdad <[email protected]>

* [Java] Avoid data copy from C++ to Java for ByteBuffer type (#9033)

* Revert "Dashboard next-version API support in backend (#9345)" (#9639)

This reverts commit fca1fb18f366ebff6016978cb6440dd1ed8637fe.

* [Autoscaler] Command Line Interface improvements (#9322)

Co-authored-by: Richard Liaw <[email protected]>

* [Core] GCS Actor management on by default. (#8845)

* GCS Actor management on by default.

* Fix travis config.

* Change condition.

* Remove unnecessary CI.

* [Core] Fix concurrency issues in plasma store runner (#9642)

* fix window jni unhappy compiler (#9635)

* Fix TestObjectTableResubscribe testcase bug (#9650)

* fix named actor single process mode bug (#9652)

* [core] Fix Ray service startup when logging redirection is disabled. (#9547)

* Fix TorchDeterministic (#9241)

* [RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661)

Co-authored-by: Kai Fricke <[email protected]>

* [rllib] Fix torch TD error, IMPALA LR updates (#9477)

* update

* add test

* lint

* fix super call

* speed es test up

* Auto-cancel build when a new commit is pushed (#8043)

Co-authored-by: Mehrdad <[email protected]>

* Fix lint in remote-watch.py (#9668)

* [Core] Remove unnecessary windows syscall in plasma store (#9602)

* Remove unused windows shims (#9583)

* Temporarily disable remote watcher (#9669)

* Drop support for Python 3.5. (#9622)

* Drop support for Python 3.5.

* Update setup.py

* [Core] WorkerInterface refactor (#9655)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* .

* .

* .

* Fixed tests

* Fixed tests

* .

* [core] Enable object reconstruction for retryable actor tasks (#9557)

* Test actor plasma reconstruction

* Allow resubmission of actor tasks

* doc

* Test for actor constructor

* Kill PID before removing node

* Kill pid before node

* fix java coreworker crash (#9674)

* use help proto-init-macro for streaming config (#9272)

* Update release information from 0.8.6. (#9124)

* [BRING BACK TO MASTER] Update release information.

* [MERGE TO MASTER] Add microbenchmark result.

* Update asan tests to the doc.

* Refinements to the Serve documentation (#9587)

Co-authored-by: Dean Wampler <[email protected]>

* [tune] survey (#9670)

* Fix ERROR logging not being printed to standard error (#9633)

Co-authored-by: Mehrdad <[email protected]>

* [Tune Docs] Logging doc fix (#9691)

* [rllib] Type annotations for model classes (#9646)

* [Serve] Allow multiple HTTP servers. (#9523)

* Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (#9681)

* [Serve] Fix Formatting, stale docs (#9617)

* fixed simplex initialisation seeding bug (#9660)

Co-authored-by: Petros Christodoulou <[email protected]>

* Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (#9697)

Co-authored-by: Mehrdad <[email protected]>

* Add Ray Serve to README.rst (#9688)

* Shellcheck rewrites (#9597)

* Fix SC2001: See if you can use ${variable//search/replace} instead.

* Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.

* Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames.

* Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true.

* Fix SC2028: echo may not expand escape sequences. Use printf.

* Fix SC2034: variable appears unused. Verify use (or export if used externally).

* Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options.

* Fix SC2071: > is for string comparisons. Use -gt instead.

* Fix SC2154: variable is referenced but not assigned

* Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

* Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).

* Fix SC2236: Use -n instead of ! -z.

* Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr.

* Fix SC2086: Double quote to prevent globbing and word splitting.

Co-authored-by: Mehrdad <[email protected]>

* [Autoscaler] CLI Logger docs (#9690)

Co-authored-by: Richard Liaw <[email protected]>

* Update rllib-algorithms.rst (#9640)

* [tune] move jenkins tests to travis (#9609)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>

* [RLlib] Implement DQN PyTorch distributional head. (#9589)

* Add placement group java api (#9611)

* add part code

* add part code

* add part code

* fix code style

* fix review comment

* fix review comment

* add part code

* add part code

* add part code

* add part code

* fix review comment

* fix review comment

* fix code style

* fix review comment

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [Stats] Improve Stats::Init & Add it to GCS server (#9563)

* [Core] Try remove all windows compat shims (#9671)

* try remove compat for arrow

* remove unistd.h

* remove socket compat

* delete arrow windows patch

* Fix a few flaky tests (#9709)

Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency

* [GCS]Open test_gcs_fault_tolerance testcase (#9677)

* enable test_gcs_fault_tolerance

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [Tests]lock vector to avoid potential flaky test (#9656)

* [tune] distributed torch wrapper (#9550)

* changes

* add-working

* checkpoint

* ccleanu

* fix

* ok

* formatting

* ok

* tests

* some-good-stuff

* fix-torch

* ddp-torch

* torch-test

* sessions

* add-small-test

* fix

* remove

* gpu-working

* update-tests

* ok

* try-test

* formgat

* ok

* ok

* [GCS] Fix actor task hang when its owner exits before local dependencies resolved (#8045)

* Only update raylet map when autoscaler configured (#9435)

* [Dashboard] New dashboard skeleton (#9099)

* Fixing multiple building issues

* Make wait_for_condition raise exception when timing out. (#9710)

* [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (#9718)

* Package and upload ray cross-platform jar (#9540)

* Revert "Package and upload ray cross-platform jar (#9540)" (#9730)

This reverts commit 881032593d3c1b9360ea641c24d50a022677a25e.

* Only build docker wheels in LINUX_WHEELS env (#9729)

* Keep build-autoscaler-images.sh alive in CI (#9720)

* [core] Removes Error when Internal Config is not set (#9700)

* [Cluster Launcher] Re Org the cluster launcher pages. (#9687)

* [RLlib] Offline Type Annotations (#9676)

* Offline Annotations

* Modifications

* Fixed circular dependencies

* Linter fix

* Python api of placement group (#9243)

* Include open-ssh-client for transparency (#9693)

* Fix remote-watch.py (#9625)

Co-authored-by: Mehrdad <[email protected]>

* [docker] Uses Latest Conda & Py 3.7 (#9732)

* Fix broken actor failure tests. (#9737)

* [Stats] fix stats shutdown crash if opencensus exporter not initialized (#9727)

* Fix package and upload ray jar (#9742)

* Introduce file_mounts_sync_continuously cluster option (#9544)

* Separate out file_mounts contents hashing into its own separate hash

Add an option to continuously sync file_mounts from head node to worker nodes:
monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes

* add test and default value for file_mounts_sync_continuously

* format code

* Update comments

* Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick

Fixed so setup commands run when ray up is run and file_mounts content changes

* Refactor so that runtime_hash retains previous behavior

runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run
file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur.

Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization

* fix issue with hashing a hash

* fix bug where trying to set contents hash when it wasn't generated

* Fix lint error

Fix bug in command_runner where check_output was no longer returning the output of the command

* clear out provider between tests to get rid of flakyness

* reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call

* [dist] swap mac/linux wheel build order (#9746)

* [RLlib] Enhance reward clipping test; add action_clipping tests. (#9684)

* [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (#9680)

* [Metrics]Ray java worker metric registry (#9636)

* ray worker metrics gauge init

* ray java metric mapping

* add jni source files for gauge and tagkey

* mapping all metric classes to stats object

* check non-null for tags and name

* lint

* add symbol for native metric JNI

* extern c for symbol

* add tests for all metrics

* Update Metric.java

use metricNativePointer instead.

* unify metric native stuff to one class

* fix jni file

* add comments for metric transform function in jni utils

* move metric function to native metric file

* remove unused disconnect jni

* Add a metric registry for java metircs

* Restore install-bazel.sh

* Add some comments for metric registry

* Fix thread safe problem of metrics

* Fix metric tests and remove sleep code from tests

* Fix comments of metrics

Co-authored-by: lingxuan.zlx <[email protected]>

* fix windows compile bug (#9741)

Co-authored-by: 灵洵 <[email protected]>

* Run _with_interactive in Docker (#9747)

* [New scheduler] First unit test for task manager (#9696)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* bad git >:-(

* small clean up

* CR

* .

* .

* One more fixture

* One more fixture

* .

* .

* bazel-format

* .

* [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607)

* [Release] Fix release tests (#9733)

* Register function race (#9346)

* Revert "[dist] swap mac/linux wheel build order (#9746)" and "Fix package and upload ray jar (#9742)" (#9758)

* Revert "[dist] swap mac/linux wheel build order (#9746)"

This reverts commit a9340565ff46626b18fd36f22a37d0380ae18d85.

* Revert "Fix package and upload ray jar (#9742)"

This reverts commit c290c308fe1e496480db5c37489df619cff6168f.

* Fix some Windows CI issues (#9708)

Co-authored-by: Mehrdad <[email protected]>

* Pin pytest version (#9767)

* [Java] Use test groups to filter tests of different run modes (#9703)

* [Java] Fix MetricTest.java due to incomplete changes from #9703 (#9770)

* Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (#9719)

* [Stats] enable core worker stats (#9355)

* [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416)

* use a sole thread to handle heartbeat

* separate signal thread

* use work to avoid exiting when task is underway

* protect shared data structure to avoid deadlock

* add comments

* decrease io service num

* minor changes

* fix test

* per stephanie's comments

* use single io service instead of 1-size io service pool

* typo

* [GCS Actor Management] Fix flaky test_dead_actors. (#9715)

* Fix.

* Add logs.

* Add an unit test.

* [TUNE] Tune Docs re-organization (#9600)

Co-authored-by: Richard Liaw <[email protected]>

* [RLlib] Trajectory View API (preparatory cleanup and enhancements). (#9678)

* [Core] Socket creation race condition bug fixes (#9764)

* fix issues

* hot fixes

* test

* test

* Always info log

* Fixed stderr logging (9765)

* [Core] Custom socket name (#9766)

* fix issues

* hot fixes

* test

* test

* socket name change only

* Fix src/ray/core_worker/common.h deleted constructor (#9785)

Co-authored-by: Mehrdad <[email protected]>

* [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745)

* More fixes

* Applying latest changes in travis.yml

* Fixing fixture data exclusions

* Disable some java tests

* Fix some CI errors

* Update hash

* Fixing more build issues

* Fixing more build issues

* Fix pipeline cache path

* More fixes

* Fix bazel test command

* Fix bazel test

* Fix general info steps

* Custom env var for docker build

* Trying a different way to install bazel

* Bazel fix

* Updating hash

Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]>
Co-authored-by: mehrdadn <[email protected]>
Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Alisa <[email protected]>
Co-authored-by: Lingxuan Zuo <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Stefan Schneider <[email protected]>
Co-authored-by: Patrick Ames <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Co-authored-by: fangfengbin <[email protected]>
Co-authored-by: 灵洵 <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: Kai Yang <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Ian Rodney <[email protected]>
Co-authored-by: Henk Tillman <[email protected]>
Co-authored-by: Tanay Wakhare <[email protected]>
Co-authored-by: Nicolaus93 <[email protected]>
Co-authored-by: Vasily Litvinov <[email protected]>
Co-authored-by: krfricke <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: kisuke95 <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Michael Mui <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: chaokunyang <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Michael Luo <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: Tom <[email protected]>
Co-authored-by: jerrylee.io <[email protected]>
Co-authored-by: Raphael Avalos <[email protected]>
Co-authored-by: William Falcon <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: Arne Sachtler <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: ZhuSenlin <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Maksim Smolin <[email protected]>
Co-authored-by: Dean Wampler <[email protected]>
Co-authored-by: Dean Wampler <[email protected]>
Co-authored-by: Bill Chambers <[email protected]>
Co-authored-by: Petros Christodoulou <[email protected]>
Co-authored-by: Petros Christodoulou <[email protected]>
Co-authored-by: Justin Terry <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: fyrestone <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: bermaker <[email protected]>

* Sync Upstream master (#50)

* [core] Pull Manager exponential backoff (#13024)

* [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793)

* [release tests] test_many_tasks fix (#12984)

* Add "beta" documentation for enabling object spilling manually (#13047)

* [Serve] Handle Bug Fixes (#12971)

* [Dashboard] Add GET /logical/actors API (#12913)

* [GCS]Decouple gcs resource manager and gcs node manager (#13012)

* [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031)

* [GCS] Delete redis gcs client and redis_xxx_accessor (#12996)

* [RLlib] Fix broken unity3d_env import in example server script. (#13040)

* [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039)

* [joblib] Fix flaky joblib test. (#13046)

* [Tune]Add integer loguniform support (#12994)

* Add integer quantization and loguniform support

* Fix hyperopt qloguniform not being np.log'd first

* Add tests, __init__

* Try to fix tests, better exceptions

* Tweak docstrings

* Type checks in SearchSpaceTest

* Update docs

* Lint, tests

* Update doc/source/tune/api_docs/search_space.rst

Co-authored-by: Kai Fricke <[email protected]>

Co-authored-by: Kai Fricke <[email protected]>

* [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)

* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update

* Update to new form of brew cask install command

* [Autoscaler] New output log format (#12772)

* Fix typo RMSProp -> RMSprop (#13063)

* [serve] Centralize HTTP-related logic in HTTPState (#13020)

* Remove suppress output to see why wheel is not building

* Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)

* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <[email protected]>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <[email protected]>

* x

* lint

Co-authored-by: SangBin Cho <[email protected]>

* [docs] Fix args + kwargs instead of docstrings (#13068)

* functools wraps

* Fix typo (functoools -> functools)

* Fix OS X Wheel Build - Update brew cask install (#13062)

Co-authored-by: Richard Liaw <[email protected]>

* speed up local mode object store get (#13052)

Co-authored-by: senlin.zsl <[email protected]>

* [RLlib] Execution Annotation (#13036)

* [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943)

* [C++ API] Added reference counting to ObjectRef (#13058)

* Added reference counting to ObjectRef

* Addressed the comments

* [Core] Remove cuda support in plasma store (#13070)

* remove cuda support in plasma store

* [Core] Remote outdated external store (#13080)

* remove outdated external store

* [GCS] Move resource usage info to gcs resource manager (#13059)

* [RLlib] JAXPolicy prep. PR #1. (#13077)

* [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083)

* [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064)

* [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935)

* other collectives all work

* auto-linting

* mannual linting #1

* mannual linting 2

* bugfix

* add send/recv point-to-point calls

* add some initial code for communicator caching

* auto linting

* optimize imports

* minor fix

* fix unpassed tests

* support more dtypes

* rerun some distributed tests for send/recv

* linting

* [Serve] [Doc] Front page update (#13032)

* Deprecate experimental / dynamic resources (#13019)

* [docs] fix wandb url (#13094)

* [Serve] Implement Graceful Shutdown (#13028)

* [Serve] Use ServeHandle in HTTP proxy (#12523)

* [Java] Format ray java code (#13056)

* [docker] Fix restart behavior with Docker (#12898)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: ijrsvt <[email protected]>

* Disable broken streaming tests (#13095)

* [autoscaler] Make placement groups bypass max launch limit (#13089)

* Serve metrics docs (#13096)

* [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097)

* [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035)

* [Doc] Fix Sphinx.add_stylesheet deprecation (#13067)

* Fix streaming ci failure (#12830)

* [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118)

* [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113)

* [RLlib] Deflake test case: 2-step game MADDPG. (#13121)

* [RLlib] Trajectory view API docs. (#12718)

* Job module without submission (#13081)

Co-authored-by: 刘宝 <[email protected]>

* [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091)

* [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119)

* [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131)

* [serve] Async controller (#13111)

* [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)

* [Serve] Use a small object to track requests (#13125)

* [docs][kubernetes][minor] Update K8s examples in doce (#13129)

* [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698)

* [docs] Documentation + example for the C++ language API (#13138)

* [Java] Support `wasCurrentActorRestarted` in actor task. (#13120)

* Remove check.

* Add test

* fix lint

* lint

* Fix spotless lint

* Address comments.

* Fix lint

Co-authored-by: Qing Wang <[email protected]>

* [docs] Minor change to formating C++ docs. (#13151)

* Deprecate setResource java api (#13117)

* [docs] Small fix in C++ documentation. (#13154)

* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: root <[email protected]>

* [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127)

* [kubernetes][docs][minor] Kubernetes version warning (#13161)

* [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)

* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <[email protected]>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <[email protected]>

* Enabling the cancellation of non-actor tasks in a worker's queue (#12117)

* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting

* [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061)

* [Release] Update Release Process Documentation (#13123)

* [Core] Remove Arrow dependencies (#13157)

* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer

* [XGboost] Update Documentation (#13017)

Co-authored-by: Richard Liaw <[email protected]>

* [SGD] Fix Docstring for `as_trainable` (#13173)

* Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)

This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2.

* Surface object store spilling statistics in `ray memory` (#13124)

* [ray_client]: Move from experimental to util (#13176)

Change-Id: I9f054881f0429092d265cd6944d89804cce9d946

* Remove unused file(object_manager_integration_test.cc) (#12989)

* Notify listeners after registered node stored (#13069)

* [build]Update description and add some keywords (#13163)

* [Collective][PR 2/6] Driver program declarative interfaces (#12874)

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* add a Backend class to make Backend string more robust

* add several useful APIs

* add some tests

* added allreduce test

* fix typos

* fix several bugs found via unittests

* fix and update torch test

* changed back actor

* rearange a bit before importing distributed test

* add distributed test

* remove scratch code

* auto-linting

* linting 2

* linting 2

* linting 3

* linting 4

* linting 5

* linting 6

* 2.1 2.2

* fix small bugs

* minor updates

* linting again

* auto linting

* linting 2

* final linting

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* added actor test

* lint

* remove local sh

* address most of richard's comments

* minor update

* remove the actor.option() interface to avoid changes in ray core

* minor updates

Co-authored-by: YLJALDC <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* [serve] Merge ActorReconciler and BackendState (#13139)

* [tune] better signature check for `tune.sample_from` (#13171)

* [tune] better signature check for `tune.sample_from`

* Update python/ray/tune/sample.py

Co-authored-by: Sumanth Ratna <[email protected]>

Co-authored-by: Sumanth Ratna <[email protected]>

* Disable atexit test on windows (#13207)

* [serve] Move controller state into separate files (#13204)

* Update multi_agent_independent_learning.py (#13196)

pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead

* [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162)

* [Tune] Fix PBT Transformers Example (#13174)

* [Serve] HTTPOptions for deployment modes (#13142)

* [tests] Fix Autoscaler Test failure on Windows (#13211)

* skip create_or_update tests

* Update python/ray/tests/test_autoscaler.py

Co-authored-by: Ameer Haj Ali <[email protected]>

Co-authored-by: Ameer Haj Ali <[email protected]>

* [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158)

* [GCS]Fix TestActorSubscribeAll bug (#13193)

* [Metrics] Record per node and raylet cpu / mem usage (#12982)

* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.

* [Tune] Fix tune serve integration example (#13233)

* [Redis] Note that each Redis Connect retry takes two minutes (#12183)

* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.

* [Log] fix spdlog init race (#12973)

* fix spdlog init race

* use global logger

* refine logger name and constructor

* [Release] Add 1.1.0 release test logs (#13054)

* Add microbenchmark to release logs

* check in many_tasks stress test result

* Add results of placement group stress test for 1.1.0

* Add result for test_dead_actors test and correct the name of test_many_tasks.txt

* Add rllib regression test result

* Add pytorch test results for rllib

* remove extraneous log entries

* [Core] Fix incorrect comment (#13228)

* [Serialization] Fix cloudpickle (#13242)

* [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195)

* Start ray client server with 'ray start' (#13217)

* [GCS]Add gcs actor schedule strategy (#13156)

* Publish job/worker info with Hex format instead of Binary (#13235)

* [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126)

* [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)

Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)

* Update autoscaler-cluster yaml files for release tests (#13114)

* [Release] Use ray-ml image for logn running test (#13267)

* [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237)

* [Tune] Improve error message for Session Detection (#13255)

* Improve error message

* log once

* [Tune] Pin Tune Dependencies (#13027)

Co-authored-by: Ian <[email protected]>

* [Dependabot] Add Dependabot (#13278)

Co-authored-by: Ian <[email protected]>

* [docker] Pull if image is not present (#13136)

* [GCS] Remove old lightweight resource usage report code path (#13192)

* [Dashboard] Add GET /log_proxy API (#13165)

* Fix a crash problem caused by GetActorHandle in ActorManager (#13164)

* [ray_client] Add metadata to gRPC requests (#13167)

* [RLlib] Preparatory PR for: Documentation on Model Building. (#13260)

* [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286)

* [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287)

* Remove top-level ray.connect() and ray.disconnect() APIs (#13273)

* [Pull manager] Only pull once per retry period (#13245)

* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <[email protected]>

* [Cancellation] Make Test Cancel Easier to Debug (#13243)

* first commit

* lint-fix

* [ray_client]: first draft of documentation (#13216)

* Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305)

* Finalize handling of RAY_ADDRESS

* lint

* [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215)

* [RLlib] SlateQ Documentation (#13266)

* [RLlib] Add more detailed Documentation on Model building API (#13261)

* [tune] convert search spaces: parse spec before flattening (#12785)

* Parse spec before flattening

* flatten after parse

* Test for ValueError if grid search is passed to search algorithms

* remove empty extras streaming deps (#12933)

* add the method annotation and a comment explaining what's happening (#13306)

Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a

* Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210)

* [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332)

* [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298)

* fix removal of task dependencies (#13333)

Co-authored-by: senlin.zsl <[email protected]>

* [Serve] Support Starlette streaming response (#13328)

* [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)

* [client] Report number of currently active clients on connect (#13326)

* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle

* Implement internal kv in ray client (#13344)

* kv internal

* fix

* [Tune] Rename MLFlow to MLflow (#13301)

* Forgot overwrite parameter in Ray client internal kv

* Fix typo in Tune Docs (Checkpointing) (#13348)

See issue #13299

* [Kubernetes][Docs] GPU usage (#13325)

* gpu-note

* gpu-note

* More info

* lint?

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* GKE->Kubernetes

Co-authored-by: Richard Liaw <[email protected]>

* Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361)

This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419.

* [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359)

* [tune] buffer trainable results (#13236)

* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <[email protected]>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <[email protected]>

* [Serve] Add dependency management support for driver not running in a conda env (#13269)

* [RLlib] Add `__len__()` method to SampleBatch (#13371)

* [Serve] Backend state unit tests (#13319)

* trigger doc build for serve updates (#13373)

* [Object Spilling] Long running object spilling test (#13331)

* done.

* formatting.

* Remove unimplemented GetAll method in actor info accessor (#13362)

* [Doc] Remove trailing whitespaces (#13390)

* Enable Ray client server by default (#13350)

* update

* fix

* fix test

* update

* [RLlib] Trajectory View API: Atari framestacking. (#13315)

* [ray_client]: Wait for ready and retry on ray.connect() (#13376)

* [ray_client]: wait until connection ready

Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6

* lint

Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0

* docs and retry minimum

Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

* [Dashboard] Fix missing actor pid (#13229)

* [ray_client]: Fix multiple attempts at checking connection (#13422)

* Plumb retries update (#13411)

* [Serve] [Doc] Improve batching doc (#13389)

* [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514)

* Fix Serve release test (#13385)

* Add bazel logs upload to GHA (#13251)

* [tune] Fix f-string in error message (#13423)

* [serve] Pull out goal management logic into AsyncGoalManager class (#13341)

* Make request_resources() use internal kv instead of redis pub sub (#13410)

* Remove unused handler methods (#13394)

* [Tune] Pin Transitive Dependencies (#13358)

* Split out the part of get_node_ip_address for which the docstring is correct (#12796)

* Fix raylet::MockWorker::GetProcess crashes (#13440)

Co-authored-by: 刘宝 <[email protected]>

* Revert "Enable Ray client server by default (#13350)" (#13429)

This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d.

* Fix linter error (#13451)

* [GCS]Add gcs resource scheduler (#13072)

* [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363)

* [Core]Fix raylet scheduling bug (#13452)

* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [joblib] joblib strikes again but this time on windows (#13212)

* [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424)

* [kubernetes][minor] Operator garbage collection fix (#13392)

* [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391)

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Job 38482.1 should now pass

* Resolve merge conflict

* [RLlib] Deflake 2x remote & local inference tests (external env). (#13459)

* [docs] Add more guideline on using ray in slurm cluster (#12819)

Co-authored-by: Sumanth Ratna <[email protected]>
Co-authored-by: PENG Zhenghao <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* [Dashboard] Fix GPU resource rendering issue (#13388)

* [Release] Fix Serve release test (#13303)

The Docker image we were using now uses `ray` users so we have to call
sudo.

* [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460)

* Fix getting runtime context dict in driver (#13417)

* [xgb] re-enable xgboost_ray tests (#13416)

* re-enable

* fix

* update xgb_ray version

* [Serialization] New custom serialization API (#13291)

* new serialization API with doc & test

* add more notes

* refine notes

* doc

* [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)

* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.

* [ray_client]: Support runtime_context as metadata (#13428)

* [GCS]Remove unused class variable (#13454)

* [Object Spilling] Dedup restore objects (#13470)

* done.

* Addressed code review.

* [CI] Enable Dashboard tests for master (#13425)

* [docker/dashboard] Fix ray dashboard (#12899)

* [CI] Fix Windows Bazel Upload (#13436)

* Return version info from Ray client connect, to allow for discovering version mismatches

* Update ID specification doc (#13356)

* [ray_client]: fix wrong reference in server_pickler (#13474)

Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf

* Bump dev branch to 2.0 to avoid endless version bump toil (#13497)

* wip

* fix

* fix

* Remove an unnecessary file (#13499)

* [Tests] Skip failing windows tests (#13495)

* skip failing windows tests

* skip more

* remove

* updates

* [tune] fix small docs typo (#13355)

Signed-off-by: Richard Liaw <[email protected]>

* move message to debug (#13472)

* Minimal version of piping autoscaler events to driver logs (#13434)

* sync write internal config in gcs (#13197)

* Refactor node manager to eliminate `new_scheduler_enabled_` (#12936)

* [GCS]Only publish changed field when node dead (#13364)

* Only update changed field when node dead

* node_id missed

* [CI] Buildkite PR Environment for Simple Tests (#13130)

* [GCS] Remove task info publish as nowhere uses it (#13509)

* Remove task info publish as nowhere uses it

* simplify right publish channel

* [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467)

* [tune] placement group support (#13370)

* [Serve] Allow ObjectRef for Composition (#12592)

* Add Dashboard Python Test to Buildkite (#13530)

* Add ability to not start Monitor when calling `ray start` (#13505)

* [tune] support experiment checkpointing for grid search (#13357)

* Fix typo (#13098)

* Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544)

* [RLlib] MARWIL loss function test case and cleanup. (#134…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants