Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get struck when change parallel to cone in single-GPU #560

Open
GreameLee opened this issue Jun 20, 2024 · 27 comments
Open

Get struck when change parallel to cone in single-GPU #560

GreameLee opened this issue Jun 20, 2024 · 27 comments

Comments

@GreameLee
Copy link

GreameLee commented Jun 20, 2024

I can run iterative algorithms such as ossart 2D parallel beam CT reconstruction smooth on ubuntu. But when it comes for cone-beam( for 2d it is fanbeam) The running just gets stuck and prints nothing, even the estimation time.
This is the code:

import tigre
import numpy as np
from tigre.utilities import sample_loader
from tigre.utilities import CTnoise
import tigre.algorithms as algs
from scipy.io import loadmat
import matplotlib.pyplot as plt
import torch

geo = tigre.geometry()
# VARIABLE                                   DESCRIPTION                    UNITS
# -------------------------------------------------------------------------------------
# Distances
geo.DSD = 1536  # Distance Source Detector      (mm)
geo.DSO = 1000 # Distance Source Origin        (mm)
# Image parameters
geo.nVoxel = np.array([1, 256, 256])  # number of voxels              (vx)
geo.sVoxel = np.array([1, 256, 256])  # total size of the image       (mm)
geo.dVoxel = geo.sVoxel / geo.nVoxel  # size of each voxel            (mm)
print(geo.dVoxel)
# Detector parameters
geo.nDetector = np.array([1, 1512])  # number of pixels              (px)
geo.dDetector = np.array([geo.dVoxel[0], 0.8])  # size of each pixel            (mm)
# geo.dDetector = np.array([1, 0.1])
geo.sDetector = geo.nDetector * geo.dDetector  # total size of the detector    (mm)
# Offsets
geo.offOrigin = np.array([0, 0, 0])  # Offset of image from origin   (mm)
geo.offDetector = np.array([0, 0])  # Offset of Detector            (mm)
# MAKE SURE THAT THE DETECTOR PIXELS SIZE IN V IS THE SAME AS THE IMAGE!

geo.mode = "cone"

#%% Define angles of projection and load phatom image

angles = np.linspace(0, 2 * np.pi, 180)


img = sample_loader.load_head_phantom(geo.nVoxel)
limited_angle = np.linspace(0, 2*np.pi, 90)
limited_projection = tigre.Ax(img, geo, limited_angle)
tigre.plotSinogram(limited_projection, 0)

niter=500
print("start ossart_tv")
rec_img_ossart_tv = algs.ossart_tv(limited_projection, geo, limited_angle, niter)
plt.imshow(rec_img_ossart_tv.squeeze(), cmap='gray')
plt.savefig('ossart_tv_90.png', bbox_inches='tight', pad_inches=0)
plt.axis('off')
  • python version: 3.9
  • OS:Ubuntu 18.04.6 LTS
  • CUDA version:12.3
@GreameLee GreameLee changed the title struck when change parallel to cone Get struck when change parallel to cone Jun 20, 2024
@GreameLee GreameLee changed the title Get struck when change parallel to cone Get struck when change parallel to cone in single-GPU Jun 20, 2024
@AnderBiguri
Copy link
Member

That is strange! if you do it for 3D, does it work well?
Can you try to figure out where does it get stuck?

Probably related to #552

@GreameLee
Copy link
Author

GreameLee commented Jun 20, 2024

I try to make a breakpoint and it got stuck in this column:

194        self.set_w()

of iterative_recon_alg.py

@AnderBiguri
Copy link
Member

@GreameLee and inside there, do you know where?

@GreameLee
Copy link
Author

GreameLee commented Jun 20, 2024

@AnderBiguri yes, I step inside and it is in

222        W = Ax(
223            np.ones(geox.nVoxel, dtype=np.float32), geox, self.angles, "Siddon", gpuids=self.gpuids
224        )

@AnderBiguri
Copy link
Member

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

@GreameLee
Copy link
Author

GreameLee commented Jun 21, 2024

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

Do you mean that replace all geox with geo in iterative_recon_alg.py? Cause there is an assignment about geox before this line:

216     geo = copy.deepcopy(self.geo)

I think geox is geo already

@AnderBiguri
Copy link
Member

Yes, in particular in the set_w function.

@AnderBiguri
Copy link
Member

AnderBiguri commented Jun 21, 2024

I meant:

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

@GreameLee
Copy link
Author

Yes, I tried as

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

it but still got struck

@AnderBiguri
Copy link
Member

Hum, confusing. I don't really know why it happens then.... I will investigate, but its hard, as I can not make it happen in any of my 5 computers.

@GreameLee
Copy link
Author

GreameLee commented Jun 21, 2024 via email

@GreameLee
Copy link
Author

I go deep inside that step and go into the Ax function in utilities, and find it got struck in gpu.py with this line:

31   def __len__(self):
32           return len(self.devices)

@AnderBiguri
Copy link
Member

Hum that is strange! What if you change it to return your_number_of_gpus , however many those are?

@GreameLee
Copy link
Author

GreameLee commented Jun 22, 2024

yes, I edit it as :

def __len__(self):
        return 1

But it still get struck in following line of Ax.py:

35  return _Ax_ext(img, geox, geox.angles, projection_type, geox.mode, gpuids=gpuids)

can I ask what is the "_Ax_ext" function? I see that

from _Ax import _Ax_ext

But I did not see any function named _Ax_ext
The strange thing is that before ossarttv, there is tigre.Ax using Ax function and that line goes smooth.

@AnderBiguri
Copy link
Member

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

@GreameLee
Copy link
Author

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

I think this may happen on supercomputers based on Ubuntu. I tried it on another supercomputer but got struck, too. It is kind of serious issue

@AnderBiguri
Copy link
Member

I have used ~15 Ubuntu based machines and I can't reproduce :(
It is indeed a serious issue.

I suggest the following: replace set_w() such that it returns geo.sVoxel(1). Its not perfect, but it should make the algorithms work.

@GreameLee
Copy link
Author

I set:

def set_w(self):
    self.W = self.geo.sVoxel[1]
    return self.W

But this bug happen when it comes to "self.W[ang_index]" of this part in iterative_recon_alg.py:

 ang_index = self.angle_index[iteration].astype(np.int32)

        self.res += (
            self.lmbda
            * 1.0
            / self.V[iteration]
            * Atb(
                self.W[ang_index]

The error shows like:

IndexError: invalid index to scalar variable.

@AnderBiguri
Copy link
Member

Ah yes, sorry I made a mistake. its more or less that but not exactly what I said. I don't have time now to fix it.

If you want to fix it yourself:

  • Understand what W is in size.
  • Replace it by something of the same size but of value 1/geo.svoxel[1]
    I'll try to come back and fix this at some point. Apologies, end of the year gets a bit busy .

@GreameLee
Copy link
Author

GreameLee commented Jun 26, 2024

hello, Ander, @AnderBiguri I try to use Windows to retry this, and interesting, it runs smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happen:

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

In the same codes, for parallel-beam CT, it does not have this error.
I guess when using Ax for cone-beam CT. The _Ax_ext makes the data have wrong leakage so the data can not transfer from CPU to GPU

@AnderBiguri
Copy link
Member

AnderBiguri commented Jun 26, 2024 via email

@GreameLee
Copy link
Author

GreameLee commented Jun 26, 2024 via email

@AnderBiguri
Copy link
Member

AnderBiguri commented Jun 26, 2024

TIGRE variables you have access too are always in the CPU.

There have been problems in the past relating TIGRE and pytorch (e.g. #509) which we will look into solving

@GreameLee
Copy link
Author

GreameLee commented Jun 30, 2024

I can run cone-beam ossart-tv by running the example code in UBUNTU now.
But when I insert this process in my diffusion model based on Pytorch and tensor in GPU. This reconstruction of ossart-tv will get struck and even I want to use "ctrl+C" to stop it can not be killed. I think I get struck due to the same reason as WINDOWS.
And interestingly, all stuff goes smoothly in parallel-beam CT. And I guess this error is same as #509.
The _Ax_ext works wrong for cone-beam CT reconstruction. This issue may be because there is some problem when transferring data

@AnderBiguri
Copy link
Member

AnderBiguri commented Jun 30, 2024

@GreameLee That is actually quite useful information yes.

Just to clarify, TIGRE+pytorch is not suppored yet, as I have not investigated in detail how to make it work. Likely Ubuntu/windows has nothing to do with this problem.

But if it works for parallel and not cone, this is something I can dig into.

@GreameLee
Copy link
Author

Look forward to your exploring

@AnderBiguri
Copy link
Member

Related: #563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants