Circuit Aligner with CUDA
- Rotation-invariant template matching considering GPU architecture
- Providing a web frontend with Django
- Available at 1/10 the price of traditional products
Accelerating rotation-invariant template matching with CUDA parallel algorithms while considering GPU architecture
- Red crosshairs: Characteristic parts of aligned circuits saved in the aligner
- Blue crosshairs: Characteristic parts of the misaligned circuit that entered the aligner
Rotation-invariant template matching is necessary to compute the amount of rotation and translation of a circuit.
The user wants to align the misaligned circuit by rotating and translating it in two dimensions. The aligned circuit already has the characteristic patches of the aligned circuit stored. The aligner finds the same patches in the misaligned circuit and calculates the amount of rotation and translation of the circuit based on the degree of misalignment between the patches.
Template matching performs the following coarse to fine search:
- Coarse search: Fast search based on pixel statistics, taking advantage of the GPU architecture.
- Fine search: Precise search using circular sum
Fast search based on pixel statistics, taking advantage of the GPU architecture. A coarse search is performed as follows:
- Find the 1st-4th order moments of a reference patch.
- Construct an ROI around a reference patch in a misaligned circuit image.
- Traverse all pixels in the ROI to calculate the 1st-4th order moments. The region where the moments are calculated is centred on the pixel being traversed and has the same size as the reference patch.
- Compare the reference patch moment vector to the moment vector of all patches in the ROI.
- Extract candidate regions using an adaptive threshold.
The 1st to 4th order moments represent the mean, variance, skewness, and kurtosis of the pixels in the patch. These features are invariant to rotation or translation of the patch. For example, a patch with a mean of 3 will still have a mean of 3 if you move it slightly to the right.
The unit for performing kernels (functions) on the GPU is a grid. A grid is made up of blocks. Blocks are composed of threads. Threads within the same block can access shared memory and share data. Shared memory is very fast, so it should be actively used to improve performance.
In Step 1, the aligner's grid consists of (columns/16 x rows/16) blocks. Each block consists of (16 x 16) threads. Each thread is responsible for one patch to compute moments.
Within each block, threads share a pixel area of (16+margin x 16+margin). Due to the limitations of the Jetson Nano, it is not possible to allocate a large amount of shared memory.
For speed, all GPU memory, including shared memory, is pre-allocated and the image is uploaded.
The aligner accumulates pixel values along a circular path. Each radius of the circle has a cumulative value. With n radiuses, an n-dimensional vector of accumulated values can be constructed. This vector encodes the rotation-invariant features of the image.
- Find the circular sum of the reference patches.
- Perform a circular sum in parallel similar to Step1 on the candidates that passed the adaptive threshold in Step1. Each thread is responsible for two circular paths.
- Apply NCC between the reference and candidate vectors to pick the pixel with the maximum value.
As mentioned in Step 2, each thread is responsible for two circular paths. One thread is responsible for the k th circle and the n-k th circle to reduce bottlenecks between threads.
Additionally, a reduction algorithm is used to extract maxima and sums to accelerate the process.
Django provides a web interface. Users can access the aligner from any device, not just a PC. In addition to PC, the responsive web is also conveniently accessible on mobile.
The sorter backend and Django frontend send and receive data through shared memory. For image sharing, a reference-only object is used to minimise latency.
At the time the aligner was developed, previous circuit alignment PCs cost around $1,000. However, this aligner, based on Jetson Nano, costs around $100. Despite the low price, there is no reduction in performance (search time < 50ms).
- AlignerLauncher: main()
- Aligner: sets up and tears down shared memory, sets up and tears down cameras, processes images, and communicates with the web.
- PatternMatching: Manages GPU memory, runs the CUDA kernel, and performs image processing using OpenCV.
- CudaSupporter: Image processing, finding the moment vector and circular sum vector for each pixel using CUDA.
- Grabber: Pylon camera classes, setting up and destroying cameras and grabbers
- MemorySharing: Sharing memory with the web.
- Webcam: Enables the webcam if there is no Pylon camera.
- AlignerConsts: Constants
- kbhit: checks for key presses