Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: change the dtype of self._kernel when input args have a different dtype #3034

Merged
merged 10 commits into from
Aug 22, 2023
Merged

Fix: change the dtype of self._kernel when input args have a different dtype #3034

merged 10 commits into from
Aug 22, 2023

Conversation

MarcBresson
Copy link
Contributor

@MarcBresson MarcBresson commented Aug 21, 2023

When the SSIM metric is given a float16 (Half) or float64 (Double), it fails because self._kernel will be of the default type.

There is two ways around that:

  1. set the default dtype to Half or Double: torch.set_default_dtype(torch.float16).
  2. change self._kernel to take the dtype of the input tensor.

This pull request put the second option in action. The result is now transparent to the user who can use float16 or float64 on GPU directly.

A new test has been added to ensure of the right behaviour of this change.

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

…t dtype

If both dtype do not have the torch default type, F.conv2D will fail because self._kernel is of the default dtype.

This solve this issue by switching the dtype of self._kernel
@github-actions github-actions bot added the module: metrics Metrics module label Aug 21, 2023
ignite/metrics/ssim.py Outdated Show resolved Hide resolved
It may be clearer this way. But _kernel should always have the pytorch default dtype, so the two solutions are equivalent.

Co-authored-by: vfdev <[email protected]>
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 21, 2023

@MarcBresson thanks for a fix, can you also write a test here: https://github.com/pytorch/ignite/blob/master/tests/ignite/metrics/test_ssim.py, let's create a separate test function like test_ssim_variable_batchsize.

@MarcBresson
Copy link
Contributor Author

Nice idea to write the test ! I just discovered that a few operation are not available in float16 on CPU.

Working on a fix.

@MarcBresson
Copy link
Contributor Author

MarcBresson commented Aug 22, 2023

Just need a little help, how to exclude a test from being tested on CPU? I don't see any exclusion pattern in run_cpu_tests.sh

And by the way, how is it possible that some tests failed while I haven't touched a thing yet in this remote repo? ☺️

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 22, 2023

Just need a little help, how to exclude a test from being tested on CPU? I don't see any exclusion pattern in run_cpu_tests.sh

I think the simpliest way to do that is just call pytest:

pytest -vvv tests/ignite/metrics/test_ssim.py -k <pattern>

And by the way, how is it possible that some tests failed while I haven't touched a thing yet in this remote repo?

GH CI is rather flaky so some jobs are failing due to issues with distributed configs for the tests. Do not worry about that.

Some functions used in SSIM.update do not exist on CPU and
half type, so CPU tests are excluded.
ignite/metrics/ssim.py Outdated Show resolved Hide resolved
@MarcBresson
Copy link
Contributor Author

Here is a little analysis of the difference between ignite and scikit-image:

image
image

code sample:

import torch
from ignite.metrics import SSIM

import tqdm

from skimage.metrics import structural_similarity as ski_ssim
import numpy as np

from matplotlib import pyplot as plt
def test_ssim(available_device, shape, dtype):
    y_pred = torch.rand(shape, device=available_device, dtype=dtype)
    y = y_pred * 0.8

    sigma = 1.5
    data_range = 1.0
    ssim = SSIM(data_range=data_range, sigma=sigma, device=available_device)
    ssim.update((y_pred, y))
    ignite_ssim = ssim.compute()

    if y_pred.dtype == torch.bfloat16:
        y_pred = y_pred.to(dtype=torch.float16)

    skimg_pred = y_pred.cpu().numpy()
    skimg_y = skimg_pred * 0.8
    skimg_ssim = ski_ssim(
        skimg_pred,
        skimg_y,
        win_size=y_pred.shape[0]-1,
        sigma=sigma,
        channel_axis=1,
        data_range=data_range,
    )

    difference = ignite_ssim - skimg_ssim
    return difference

diff_values = {}

for device in [torch.device("cpu"), torch.device("cuda")]:
    available_dtypes = [torch.float32, torch.float64]
    if device == torch.device("cuda"):
        available_dtypes.extend((torch.bfloat16, torch.float16))

    for dtype in available_dtypes:

        diff[str(dtype) + " on " + str(device)] = []

        for _ in tqdm.tqdm(range(1000)):
            difference_ssim = test_ssim(device, (12, 3, 28, 28), dtype=dtype)
            diff[str(dtype) + " on " + str(device)].append(difference_ssim)
for key, values in diff.items():
    # they are too fart apart, so we exclude them
    if key in ["torch.float16 on cuda", "torch.bfloat16 on cuda"]:
        continue

    plt.hist(values, bins=10, label=key)

axes = plt.gca()
axes.ticklabel_format(style="scientific", scilimits=(0, 3))

plt.title("SSIM difference between ignite and scikit-image")

plt.xlabel("difference value")
plt.ylabel("frequency")

plt.legend()
plt.show()

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 22, 2023

@MarcBresson can you run the following to fix formatting issue:

bash ./tests/run_code_style.sh install
bash ./tests/run_code_style.sh fmt

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the PR @MarcBresson !

I'll merge it once required CI jobs is done.
By the way, while checking your PR I saw in the actual code that we are doing rather strange things with device of self._kernel. Would you like to tackle this problem?
Basically, self._kernel should respect self._device but there may be a trade-off to run all padding and conv ops on cuda before moving to self._device. We may want to measure that and make an appropriate decision.

@MarcBresson
Copy link
Contributor Author

@vfdev-5 you are welcome! Happy to help.

I think that self._kernel is already on the right device. It originates either from https://github.com/MarcBresson/ignite/blob/master/ignite/metrics/ssim.py#L127 or https://github.com/MarcBresson/ignite/blob/master/ignite/metrics/ssim.py#L116 and they are both created directly on the device.

The .to(device) (https://github.com/MarcBresson/ignite/blob/master/ignite/metrics/ssim.py#L162) originates from the initial SSIM commit (MarcBresson@4a7428c). I think it has just been forgotten to be removed on this commit MarcBresson@f86d42e.

@vfdev-5 vfdev-5 merged commit b687f1e into pytorch:master Aug 22, 2023
17 of 20 checks passed
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 22, 2023

Well, this code:

        if len(self._kernel.shape) < 4:
            self._kernel = self._kernel.expand(channel, 1, -1, -1).to(device=y_pred.device)

seems a bit weird as only kernel with ndims < 4 will have y_pred.device and do we need to change self._kernel device from self._device to y_pred.device...
I think we should handle the device separately from the shape expand...

@MarcBresson
Copy link
Contributor Author

Just a new analysis on dtype. These are the comparisons of the precision between float32 and float16, bfloat16 and float64. The error is not absolute so we can see if a dtype overshoots compared to float32.

image
image
image

import tqdm
import torch
from ignite.metrics import SSIM
from matplotlib import pyplot as plt
def test_ssim_dtype_error(available_device, shape, dtype):
    y_pred_float32 = torch.rand(shape, device=available_device, dtype=torch.float32)
    y_float32 = y_pred_float32 * 0.8

    y_pred_typed = torch.clone(y_pred_float32)
    y_pred_typed = y_pred_typed.to(dtype=dtype)
    y_typed = y_pred_typed * 0.8

    sigma = 1.5
    data_range = 1.0
    ssim = SSIM(data_range=data_range, sigma=sigma, device=available_device)
    ssim.update((y_pred_float32, y_float32))
    ssim_float32 = ssim.compute()

    ssim_ty = SSIM(data_range=data_range, sigma=sigma, device=available_device)
    ssim_ty.update((y_pred_typed, y_typed))
    ssim_typed = ssim_ty.compute()

    difference = ssim_float32 - ssim_typed
    return difference

diff = {}

for device in [torch.device("cuda")]:
    available_dtypes = [torch.float32, torch.float64]
    if device == torch.device("cuda"):
        available_dtypes.extend((torch.bfloat16, torch.float16))

    for dtype in available_dtypes:

        diff[str(dtype) + " on " + str(device)] = []

        for _ in tqdm.tqdm(range(1000)):
            difference_ssim = test_ssim_dtype_error(device, (12, 3, 28, 28), dtype=dtype)
            diff[str(dtype) + " on " + str(device)].append(difference_ssim)
for dtype in diff:
    plt.hist(diff[dtype], bins=50)
    axes = plt.gca()
    axes.ticklabel_format(axis='x', style="scientific", scilimits=(0, 0))

    plt.title(f"SSIM difference between float32 and a {dtype}")

    plt.xlabel("difference value")
    plt.ylabel("frequency")

    plt.show()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: metrics Metrics module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants