I was able to port yesterday's summed area table (SAT) program from CUDA to OpenCL today. The code is in my SVN repository under sat_cl/. The porting took a little longer than anticipated. OpenCL is more akin to the CUDA Driver API than the CUDA Runtime API, where the kernel and its parameters are configured with objects instead of as a function call. It reminded me a lot of the GLSL shaders we used at the beginning of CIS 565, where the GPU code is taken as a string and compiled at runtime. While the interface is a bit quirky compared to the can't-believe-it's-not-C CUDA Runtime API, both the OpenCL API and documentation are well done.
Here are a few places that tripped me up in the porting process, in case anyone else runs into the same issues:
Here are a few places that tripped me up in the porting process, in case anyone else runs into the same issues:
- API error codes are very nice. NVIDIA has an oclErrorString() function in their SDK that I highly recommend copying.
- cl_float4 and CUDA float4 are very different. The former is implemented as an SSE vector from what I can tell, while the latter is a struct. In Visual Studio 2008, I had to set the components of the cl_float4 with .s[0], .s[1], etc. Apparently this might not be the case for other dev environments though.
- In OpenCL, the size of the NDRange (aka CUDA grid) is specified in terms of the number of work_items/threads, whereas in CUDA it is specified in terms of the number of thread blocks.
I was unfortunately a bit sick to travel to the SIG lab today, but hopefully Fermi and I will have some benchmarking numbers soon to at least compare CUDA and OpenCL performance. One thing I will have to do first is find the best timer to use. My CUDA code uses the high-res GPU timer, but my thought is that results from that will not be comparable when running code on AMD hardware too. My next step is to find a good CPU timer to use for both CUDA and OpenCL. I'd appreciate any suggestions.
Thanks for the lessons learned.
ReplyDelete