Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: Add support for active_set broadcast with knomial and dbt for size greater than 2 #926

Merged
merged 1 commit into from
May 23, 2024

Conversation

nsarka
Copy link
Collaborator

@nsarka nsarka commented Feb 15, 2024

Active set is a sort of lightweight subcommunicator for TL/UCP and TL/NCCL. It was originally used as a hack that enables point to point communication. This PR:

  • Removes the restriction that the active set comm size is 2
  • Fixes a bug that the user tag was not being properly read
  • Adds support for active set double binary tree broadcast in TL/UCP
  • Updates the active set gtest to more thoroughly test all of the above

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me! thanks

src/components/tl/ucp/tl_ucp_coll.h Show resolved Hide resolved
test/gtest/active_set/test_active_set.cc Show resolved Hide resolved
@nsarka nsarka force-pushed the nsarka/bcast-active-set branch 2 times, most recently from b7820a2 to e86c6da Compare February 20, 2024 20:27
@Sergei-Lebedev
Copy link
Contributor

ok to test

@manjugv
Copy link
Contributor

manjugv commented Apr 3, 2024

@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?

@nsarka
Copy link
Collaborator Author

nsarka commented Apr 3, 2024

@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?

This was requested by Almog from the cuBLAS team. I'm not sure the details after that. @Sergei-Lebedev do you know what Almog wanted with this?

@almogsegal
Copy link

almogsegal commented Apr 4, 2024

The use-case for that is to be able perform a "broadcast" on row/col comms without having to create these comms.
For linear algebra functionality, when you use 2DBC data layout, it simplifies the code and improves performance to use row/col communicators. The problem is that NCCL comm creation is very expensive.
That way, we can use the active_set broadcast and the stride to perform "row/col bcast" without paying the comm split price while also enjoying the ncclGroupStart/End functionality that is not exposed through UCC.

@nsarka nsarka force-pushed the nsarka/bcast-active-set branch 3 times, most recently from 0a9a5d7 to 8e5bac3 Compare April 16, 2024 15:45
@janjust
Copy link
Collaborator

janjust commented Apr 16, 2024

@manjugv @Sergei-Lebedev @samnordmann
Tested on @almogsegal container - works as intended

@janjust
Copy link
Collaborator

janjust commented Apr 22, 2024

bot:retest

@Sergei-Lebedev
Copy link
Contributor

bot:retest

@Sergei-Lebedev
Copy link
Contributor

@nsarka can you please check why gtest failed in CI?

@nsarka
Copy link
Collaborator Author

nsarka commented May 8, 2024

@nsarka can you please check why gtest failed in CI?

@Sergei-Lebedev it seems like the gtest passed. However, the ucc test failed with Cancelling nested steps due to timeout. It looks like it ran for 2 hours, passed building, codestyle, and then hangs for 4 hours in UCC / Torch-UCC test until the 6 hour timeout.

@janjust
Copy link
Collaborator

janjust commented May 9, 2024

@Sergei-Lebedev , it seems we're facing the same issue (something with containers) on several PRs

@Sergei-Lebedev
Copy link
Contributor

@janjust
I think we see two different problems. Here it's

[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.create_destroy
[2024-05-07T10:35:28.799Z] [       OK ] test_context.create_destroy (236 ms)
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.init_multiple
[2024-05-07T10:35:29.727Z] [       OK ] test_context.init_multiple (983 ms)
[2024-05-07T10:35:29.727Z] [ RUN      ] test_context.global
[2024-05-07T11:13:33.528Z]  For more details, please look at: 
[2024-05-07T11:13:33.528Z]     /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt
[2024-05-07T11:13:33.528Z] Running Coverity analysis...
[2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64
[2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625
[2024-05-07T11:13:33.528Z] 

Note that test_context.global test hangs.
In another PR it's different (probably CI issue)

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

@nsarka
Copy link
Collaborator Author

nsarka commented May 10, 2024

@janjust I think we see two different problems. Here it's

[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.create_destroy
[2024-05-07T10:35:28.799Z] [       OK ] test_context.create_destroy (236 ms)
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.init_multiple
[2024-05-07T10:35:29.727Z] [       OK ] test_context.init_multiple (983 ms)
[2024-05-07T10:35:29.727Z] [ RUN      ] test_context.global
[2024-05-07T11:13:33.528Z]  For more details, please look at: 
[2024-05-07T11:13:33.528Z]     /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt
[2024-05-07T11:13:33.528Z] Running Coverity analysis...
[2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64
[2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625
[2024-05-07T11:13:33.528Z] 

Note that test_context.global test hangs. In another PR it's different (probably CI issue)

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

Hi Sergey, in the other PR (#958), @B-a-S fixed and reran the failing pipeline. He posted a link to the new log (http:https://hpc-master.lab.mtl.com:8080/job/ucc/3282/) which is hanging in the same test_context.global test this PR is hanging.

@nsarka
Copy link
Collaborator Author

nsarka commented May 10, 2024

@Sergei-Lebedev I ran the hanging gtest manually with my nsarka/active-set-broadcast branch and it passed, which suggests this is a CI issue:

[ RUN      ] test_context.global
[       OK ] test_context.global (2061 ms)

@nsarka nsarka force-pushed the nsarka/bcast-active-set branch 6 times, most recently from a7cd88b to b122da3 Compare May 22, 2024 20:00
@nsarka
Copy link
Collaborator Author

nsarka commented May 23, 2024

@Sergei-Lebedev Hey Sergey, I updated the PR so that the root is in terms of the ucc_team only if the active_set flag was set on coll_args. This fixed the hang in tl/mlx5 in the ucc_service_coll_bcast used for broadcasting protection domain details, which hardcodes the root to 0 in terms of the node-level subset instead of whatever that rank is in the ucc team. tl/sharp and I think some others do this too.

@Sergei-Lebedev Sergei-Lebedev merged commit 5ced790 into openucx:master May 23, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants