Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DIPU] mock_cuda设为true后,torch.device导致guard检查失败造成重新编译的问题。 #607

Open
Reinerzhou opened this issue Jan 8, 2024 · 0 comments

Comments

@Reinerzhou
Copy link
Member

背景

编译模式下,在燧原和华为机器上运行llama_finetune。

问题描述

在mock_cuda为true的情况下,若在模型运行前调用torch.set_default_device,分析timeline发现每次都会触发dynamo的compile,查看日志调查发现:

torch._dynamo.convert_frame.__recompiles: [DEBUG] ('Recompiling function _is_fp16_bf16_tensor in /home/cse/zhousl/accelerate/src/accelerate/utils/operations.py', "triggered by the following guard failure: utils_device.CURRENT_DEVICE == device(type='cuda', index=0)")

是utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 检查失败导致每次都会重新编译。

分析发现:

  • 调用torch.set_default_device时,会触发torch中的这段代码对CURRENT_DEVICE进行设置,这时候用到的torch.device为torch原本的api。
    https://github.com/DeepLink-org/pytorch/blob/eb31e39bb07f00cf1c917a3bc83867981b2c04cd/torch/utils/_device.py#L57C1-L65C35

  • 而在运行过程中对utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 做检查时,调用的是dipu mock后的torch.device api,mock前后的torch.device调用结果并不一致。

    class _DIPUDevice(metaclass=_MetaDeviceType):
    @staticmethod
    def __replacedipu(arg):
    if (__dipu__ in arg):
    arg = arg.replace(__dipu__, __dipu_device_type__)
    if (mockcuda and "cuda" in arg):
    arg = arg.replace("cuda", __dipu_device_type__)
    return arg
    def __new__(cls, *args, **kwargs):
    if len(args) == 1 and isinstance(args[0], int) and mockcuda:
    # modify default int device type only when "mock cuda".
    dev_name = __dipu_device_type__ + ":" + str(args[0])
    _device = _MetaDeviceType._torch_device(dev_name)
    return _device
    # handle device as str
    if len(args) >= 1 and isinstance(args[0], str):
    argList = list(args)
    argList[0] = cls.__replacedipu(args[0])
    args = tuple(argList)
    # handle parameter type: str, not support int type but str and device
    deviceValue = kwargs.get("type", None)
    if deviceValue != None and isinstance(deviceValue, str):
    kwargs["type"] = cls.__replacedipu(deviceValue)
    _device = _MetaDeviceType._torch_device(*args, **kwargs)
    return _device
    # always patch: device class is immutable, cannot directly patch __new__ method on python layer.
    torch.device = _DIPUDevice

  • 简单来说就是torch_dipu中缺少torch中从torch.set_default_device到CURRENT_DEVICE 的完整逻辑链路,且torch.device mock前后在做guards检查时并不一致。

  • 可通过如下代码复现:
    'import torch
    t1 = torch.device(type='cuda', index=0)
    import torch_dipu
    t2 = torch.device(type='cuda', index=0)
    print(t1 == t2) # 结果为False'

  • 目前可通过不调用torch.set_default_device,或者在import torch之前先import torch_dipu(这样可以在设置CURRENT_DEVICE前就让dipu mock掉torch.device api)来避免guards检查失败导致重编这个问题。

NeosZhang pushed a commit to DeepLink-org/deeplink.framework.dev that referenced this issue Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant