I spent today refactoring the OpenCL summed area table implementation to output plenty of timing information and take the thread block size and the input matrix size as parameters. I will port those changes to the CUDA implementation soon, and then performance can be compared for the two runtimes. Since they both compile to PTX code in the end, it will be a fight between NVCC and the OpenCL compiler (which I hear from Wikipedia is based off the Clang LLVM project).
I ran into quite a few oddities when working with the program. For some reason, a block size of 384 crashed the display driver. That was rather disheartening, especially when running this program on a Tesla C2070. The other good chunk of problems was working with this code on an ATI Mobility Radeon HD 4650. I was basing my device set up off of an AMD guide for revision 2.3 of their Stream SDK (now called AMD APP). I was annoyed to find that the vendor string changed from "Advanced Micro Devices, Inc." to "AMD APP...", which completely breaks the device checking. I would think that such systems would be a bit more robust. The 4650 also could not handle thread block sizes of 128 or even 64, which I find hard to believe. I hope though that the program runs a lot better on newer cards.
I next have to find a high arithmetic intensity benchmark to analyze. We press on.
I ran into quite a few oddities when working with the program. For some reason, a block size of 384 crashed the display driver. That was rather disheartening, especially when running this program on a Tesla C2070. The other good chunk of problems was working with this code on an ATI Mobility Radeon HD 4650. I was basing my device set up off of an AMD guide for revision 2.3 of their Stream SDK (now called AMD APP). I was annoyed to find that the vendor string changed from "Advanced Micro Devices, Inc." to "AMD APP...", which completely breaks the device checking. I would think that such systems would be a bit more robust. The 4650 also could not handle thread block sizes of 128 or even 64, which I find hard to believe. I hope though that the program runs a lot better on newer cards.
I next have to find a high arithmetic intensity benchmark to analyze. We press on.
No comments:
Post a Comment