研究目的
To develop a GPU-accelerated implementation of the OCT signal processing chain using CUDA to achieve video frame rates of 25 frames/s, and to derive a performance model for predicting execution times including compute and data transfer aspects.
研究成果
The GPU implementation achieved significant speed-ups, with OCT sync providing 5-7 times faster processing than the serial CPU version and OCT async further improving this to 8-21 times faster. The performance model accurately predicted runtimes with deviations below 15%, enabling estimates for processing larger data sets up to 2048×24576 px at video rates. This work demonstrates the feasibility of using GPUs for real-time OCT signal processing in medical and industrial applications, with potential for further optimizations such as direct GPU display to eliminate copy operations.
研究不足
The study is limited to consumer GPUs with Pascal architecture, and the performance model may underestimate runtimes for small data sizes due to using maximum bandwidth parameters. The model assumes contiguous memory access patterns, which may not fully capture scattered accesses in some kernels. Future work could extend to other GPU architectures and optimize for volumetric 3D-scans.
1:Experimental Design and Method Selection:
The study involved designing a CUDA-based GPU implementation of the OCT signal processing chain, leveraging high-performance libraries such as CUBLAS and CUFFT. The performance model was derived based on the Boat Hull Model and data transfer models, incorporating both synchronous and asynchronous operations using NVIDIA's streaming concept.
2:Sample Selection and Data Sources:
Real-life OCT images were used as test datasets, including B-scans of a pill (1120×256 px, 1120×500 px) and cancerous tissue (2048×512 px). An artificially enlarged dataset (up to 2048×8192 px) was also created to test performance on larger data sizes.
3:List of Experimental Equipment and Materials:
Two NVIDIA GPUs with Pascal architecture were used: Geforce GTX Titan X and Geforce 1050 Ti. CPUs included Intel i7 3820 Sandy Bridge and Intel Broadwell EP E5-2650v4. Software involved CUDA 8.0, Microsoft Visual Compiler 14.0, Intel Compiler 17.0, OpenBLAS, MKL, FFTW, and custom implementations.
4:Software involved CUDA 0, Microsoft Visual Compiler 0, Intel Compiler 0, OpenBLAS, MKL, FFTW, and custom implementations.
Experimental Procedures and Operational Workflow:
4. Experimental Procedures and Operational Workflow: The signal processing chain was implemented in CUDA, with data transferred to and from the GPU using pinned memory. For synchronous transfers (OCT sync), operations were sequential; for asynchronous transfers (OCT async), multiple CUDA streams were used to overlap computations and data transfers. Performance was measured using CUDA events and boost::timer, with 100 runs averaged for each configuration.
5:Data Analysis Methods:
Runtime measurements were compared against the derived performance model. Statistical analysis ensured standard deviations within 10% of the mean, and speed-ups were calculated relative to serial and parallel CPU implementations.
独家科研数据包,助您复现前沿成果,加速创新突破
获取完整内容