Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

posts/intel-pytorch-extension-tutorial/native-ubuntu/ #38

Open
utterances-bot opened this issue Jul 3, 2023 · 28 comments
Open

posts/intel-pytorch-extension-tutorial/native-ubuntu/ #38

utterances-bot opened this issue Jul 3, 2023 · 28 comments

Comments

@utterances-bot
Copy link

Christian Mills - Getting Started with Intel’s PyTorch Extension for Arc GPUs on Ubuntu

This tutorial provides a step-by-step guide to setting up Intel’s PyTorch extension on Ubuntu to train models with Arc GPUs.

https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/

Copy link

the mamba part doesn't work for me
it installs mamba, however with a warning:
please verify that your PYTHONPATH only points to
directories of packages that are compatible with the Python interpreter
in Mambaforge: /home/daaz/mambaforge

and does not recognize mamba

Copy link
Owner

cj-mills commented Jul 4, 2023

Hi @Danyal-sab,

Thanks for pointing that out. I forgot to include the line after running the Mambaforge install script to initialize Mamba. I updated the post with the missing line.

You can run the following commands to initialize Mamba and relaunch the current bash shell to apply the changes:

~/mambaforge/bin/mamba init
bash

I saw you posted another comment earlier, but it got deleted before I could respond. Did you resolve your previous issue?

@Danyal-sab
Copy link

Danyal-sab commented Jul 4, 2023

Hi @cj-mills,

Thanks for your speedy reply. And thanks for your help.

And yes, I posted about a minor issue with executing the orders for "Install oneAPI Base Toolkit" section, where the second line gave typing error, I resolved it by simply removing the \s at the end of lines and after that it performed right. that's why I deleted the post.
similar issue happened in the "Apply OneAPI Patch" section, gave an error in the 6th line. Again I just removed the \s and it went through.

Thanks again for your great help

Copy link

Hi @cj-mills,

is there any tool for monitoring the Arc gpu memory usage (like nvidia-smi for nvidia)

I checked a few tools like "intel_gpu_top", intel Vtune and intel GPA, but they either they weren't compatible with ubuntu 23.04 or they don't offer monitoring gpu memory usage.

is there other tools we possibly can use?

@cj-mills
Copy link
Owner

cj-mills commented Jul 7, 2023

Hi @Danyal-sab,

The only one I know of is the sysmon tool included with Intel's Profiling Tools Interfaces for GPU (PTI for GPU) GitHub project.

Unfortunately, you would need to compile the tool from the source code.

Also, it does not seem fully functional on my system, as it does not show any running processes:

$ sudo sysmon
=====================================================================================
GPU 0: Intel(R) Arc(TM) A770 Graphics    PCI Bus: 0000:03:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26241    Subdevices: 0
EU Count: 512    Threads Per EU: 8    EU SIMD Width: 8    Total Memory(MB): 15473.6
Core Frequency(MHz): 2000.0 of 2400.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown
=====================================================================================
GPU 1: Intel(R) UHD Graphics 750    PCI Bus: 0000:00:02.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26241    Subdevices: 0
EU Count: 32    Threads Per EU: 7    EU SIMD Width: 8    Total Memory(MB): 25360.9
Core Frequency(MHz): 350.0 of 1300.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown

Copy link

ricable commented Jul 9, 2023

great tutorial that helped me a lot setting up an environment for ML / DL with Arc GPU. It really saved my life and I hope to read more of this type of excellent materials. Thanks again, much appreciated

Copy link

I really appreciate your work! I am trying to set the GPU up for SciKit monkey-patch: https://github.com/intel/scikit-learn-intelex but I am struggling to go beyond the CPU acceleration. I have no idea how to 1. List the device and 2. to point to that device. Do you have any experience with that?

@cj-mills
Copy link
Owner

Hi @psmgeelen,

I have not tried Intel's Scikit-learn extension, so I don't know if it even supports Arc GPUs. The DPC++ compiler runtime does support Arc GPUs, meaning it should work in theory.

Have you tried the example code for performing computations on the GPU in the extension's documentation?

Based on the example code, the Arc GPU should be the "gpu:0" device, assuming it is the only discrete GPU installed on the system. The integrated graphics should be the "gpu:1" device.

Copy link

Danyal-sab commented Jul 12, 2023

Hi @cj-mills,
I got an issue which may be slightly unrelated to this topic.
At the beginning of working with my a770 it used to crash the system when running a training session after 10ish epochs under medium load. (I don't know exactly how much the load was, as I couldn't monitor the GPU at all, However using my old NVIDIA 1070 the code used to use less than 2 GB of the GPU ram.)

Then I started running heavier codes close to the limit of the a770 and then the crashes stopped for the day.
After two more days it totally stopped crashing.
Now, every now and then it does crash the computer while deep learning training session is running.
Do you have any idea what can be the cause and why it isn't consistent?
I looked up online, there are quite a few who experienced crashing with this card, but didn't find anyone with occasional crash down.
And also the card makes a noise as well that changes from time to time. I suppose it should be coil whining, is it safe?

@psmgeelen
Copy link

@cj-mills , I have, and it's not finding the device for whatever reason..I created a ticket at intelex here: intel/scikit-learn-intelex#1357 (comment)

Copy link

Hi @cj-mills,
I see that the tutorial has been updated to use the new extension. I saw in your fastai forum that you concluded the extension has a bug. is it safe to install now?

Copy link
Owner

@Danyal-sab
It depends on what you need to use it for. The code for my image classification tutorial works fine, but the training code for my YOLOX tutorial does not reach usable performance with the intel extension on the ARC GPU.

I have not tested the YOLOX training code with the previous extension because the code requires torchvision 0.15+ (which requires PyTorch 2.0+).

I updated the tutorial because everything I tested that worked with the previous extension version still works with the new version, and the current Ubuntu LTS now ships with a kernel that supports ARC GPUs.

Copy link

@cj-mills,
Thanks for updating the tutorial.
Just a minor change is needed for the "Update PyTorch Imports" section:
As you provided in the sample codes, on of the import lines, (from torcheval.tools import get_module_summary) should be replaced with this:

from torchtnt.utils import get_module_summary

Thanks again for your great help

Copy link
Owner

@Danyal-sab
Thanks for catching that!

@Danyal-sab
Copy link

@cj-mills,
after upgrading to the newer version it worked well. Yesterday I updated the gpu drivers too (as ubuntu offers available software update when they are available). After that the performance dropped significantly. It take almost three times to run the same code before updating the drivers.
have you tested that?

Copy link
Owner

cj-mills commented Oct 4, 2023

@Danyal-sab
I don't run the Arc GPU as my daily driver, so I have not used it for nearly a month. I was not planning to install it back into my desktop until Intel's PyTorch extension gets a new update.

It sounds like a similar performance difference to not having the IPEX_XPU_ONEDNN_LAYOUT environment variable set. I don't know if that's related to your issue, but maybe try setting that environment variable to 0 and 1 to see if it impacts performance.

It might also just be a bad driver update. Can you roll back to the previous driver version?

Copy link

@cj-mills,
Thanks again for your support.
That time I went back to previous version, however as lately there has been an upgrade for the extension, I decided to try it again.
After a quick search in the web it seems to me that this extension still does not fully support python 3.11.
And probably that is the issue, what do you think?

@cj-mills
Copy link
Owner

cj-mills commented Jan 6, 2024

@Danyal-sab,
I've been meaning to go in-depth with the most recent release of the extension and Intel's BigDL-LLM library, but I have not had time yet.

I briefly swapped in the Arc card a couple of weeks ago, and the training notebooks that worked in the previous versions no longer produced usable models. It was the same issue I described here, but it occurred even with the baseline image classification notebook.

I think I tried with Python 3.9, 3.10, and 3.11, and I had the same issue with all of them. I did not have time to investigate, so I held off making a post about it.

Copy link

Danyal-sab commented Jan 6, 2024

@cj-mills,
thanks for your response.
Are you going to try with the newest version sometime soon?

@cj-mills
Copy link
Owner

cj-mills commented Jan 6, 2024

@Danyal-sab,
That was with version 2.1.10+xpu. I don't currently know if the source of the issue is the extension or the oneAPI Base Toolkit (or both).

Copy link

@cj-mills,
you are right. 2.1.10+xpu doesn't seem to be the stable version yet, as it can seen in the repository recommends 2.0.110+xpu for install.
https://github.com/intel/intel-extension-for-pytorch

@Danyal-sab
Copy link

@cj-mills,
Alright, then.
I am still using version 1.13.0a+xpu
Do you thing it makes sense for me to move to 2.0.110+xpu? and if so shall I use python 3.10 or 3.11?

Copy link

Hi @cj-mills and everone,
First of all thanks for the doc.
I am able to install everything as mentioned but with latest versions (of oneAPI Base Toolkit and python packages), and run the notebook it is as faster as your example (around 12 minutes). How ever, the accuracy stopped around 0.18 and not improved further even after 3 epochs.
I also tried changing following line

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
to
model, optimizer = ipex.optimize(model, optimizer=optimizer)

But it does not make the things better.
Do you have any idea what could be the problem?

Thanks in advance!

@cj-mills
Copy link
Owner

Hi @contryboy,
Your experience matches my brief testing of the v2.1.10+xpu release. I did not have time to investigate the issue further, so I did not make a post about it. It's the same issue I described with the 2.0.110+xpu release for my YOLOX training notebook. However, with v2.1.10+xpu, it occurred even with the baseline image classification notebook.

I have not had a chance to investigate the source of the issue, but I plan to give it another shot when the next xpu release comes out.

Copy link

Hi @cj-mils,
Thanks for quick reply. I also tried the sample code in their official doc [1], it has the same problem. So I created an issue [2] in their github project, see if there would be any findings.

[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32
[2] intel/intel-extension-for-pytorch#537

@cj-mills
Copy link
Owner

@contryboy Nice! It would certainly be more convenient for me if they resolved the issue for the next release.

Copy link

Hi @cj-mills,
Are you planning to update the tutorial with the latest version?

@cj-mills
Copy link
Owner

@Danyal-sab, I will when I have enough time to swap my Arc card into my desktop and test the latest version. I've been too busy with work projects lately to swap out my NVIDIA card.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants