Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Intel xpu as a new backend of PyTorch-Lightning #16834

Closed
wants to merge 8 commits into from

Conversation

jingxu10
Copy link

@jingxu10 jingxu10 commented Feb 21, 2023

Enable Intel xpu as a new backend of PyTorch-Lightning.
Contributed by [email protected] and [email protected].

Fixes #<issue_number>

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Feb 21, 2023
@jingxu10 jingxu10 force-pushed the jingxu10/ipex branch 2 times, most recently from eae506e to 6dad88e Compare February 23, 2023 20:25
@mergify mergify bot removed the has conflicts label Feb 23, 2023
@github-actions github-actions bot added the app (removed) Generic label for Lightning App package label Feb 25, 2023
@jingxu10 jingxu10 force-pushed the jingxu10/ipex branch 4 times, most recently from 8004d6f to eb2023a Compare February 25, 2023 12:08
@jingxu10 jingxu10 force-pushed the jingxu10/ipex branch 2 times, most recently from ffe2f1b to db300a9 Compare February 25, 2023 23:06
@jingxu10 jingxu10 marked this pull request as draft February 25, 2023 23:28
@jingxu10 jingxu10 changed the title WIP: enable Intel xpu as a new backend of PyTorch-Lightning enable Intel xpu as a new backend of PyTorch-Lightning Feb 25, 2023
weishi-deng and others added 3 commits February 26, 2023 10:57
[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

change version comparison to base version number

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

enable bf16 for xpu, enable ccl

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

switch from deprecated set_default_tensor_type to set_default_dtype

switch to info to print ipex and torch-ccl version number

fix set_default_dtype incorrect argument error
fix crashes when ipex and/or torch-ccl are not installed

fix crashes when ipex and/or torch-ccl are not installed

fix crashes when ipex and/or torch-ccl are not installed
update docs

update docs
@jingxu10 jingxu10 changed the title enable Intel xpu as a new backend of PyTorch-Lightning Enable Intel xpu as a new backend of PyTorch-Lightning Feb 26, 2023
@lantiga
Copy link
Collaborator

lantiga commented Feb 27, 2023

Hi @jingxu10 thank you for the contribution, lots of work went into that.

We are currently looking into making adding devices something you can do externally, so we keep Lightning core lean. This PR looks like a very good case we can look at to enable that kind of mechanism. This will likely happen after 2.0 lands.

So we can't merge the PR as is, but you're encouraged to reach out in our Slack at pytorch-lightning.slack.com so we can coordinate on how we can make it happen.

@abhilash1910
Copy link

@lantiga Thanks for the info; when is the tentative plan for 2.0 release . I do see some refactorings which are in progress.

Copy link

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhilash1910 to check deepspeed /ddp for WLs.

Any xpu.empty_caches should be placed in try blocks to ensure satefy.

return torch.xpu.memory_stats(device)

def teardown(self) -> None:
# clean up memory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   try: 
        torch.xpu.empty_cache()
    except AttributeError:
        pass

To ensure safe teardowns

def setup(self, trainer: "pl.Trainer") -> None:
# TODO refactor input from trainer to local_rank @four4fish
# self.set_intel_flags(trainer.local_rank)
# clear cache before training

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    try: 
        torch.xpu.empty_cache()
    except AttributeError:
        pass

To ensure safe teardowns

abhilash1910 and others added 3 commits March 3, 2023 12:15
- After some sanity checks multiprocessing does not appear to work well 
- Need to incorporate separate function  for xpu forking.
Multiprocessing issue with XPU backend
@stale stale bot added the won't fix This will not be worked on label Mar 18, 2023
@Lightning-AI Lightning-AI deleted a comment from stale bot Mar 18, 2023
@Borda
Copy link
Member

Borda commented Mar 18, 2023

Will be ported to separate repo, stay tuned! 🐿️

@stale stale bot removed the won't fix This will not be worked on label Mar 18, 2023
@stale
Copy link

stale bot commented Apr 2, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Apr 2, 2023
@stale
Copy link

stale bot commented Apr 13, 2023

This pull request is going to be closed. Please feel free to reopen it or create a new one based on top of the 'master' branch.

@stale stale bot closed this Apr 13, 2023
@Borda
Copy link
Member

Borda commented Apr 13, 2023

will be implemented as standalone extension

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app (removed) Generic label for Lightning App package fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants