Tuesday, April 19, 2011

SAT Testing

Hello and sorry for the long delay between posts. I only wanted to post next when I had something to share. I have a few things this time around.

First is my SVN repository. It is hosted on Google Code here: http://code.google.com/p/cbench-cis565s11/ The code is under an MIT license, so feel free to use or modify it in any way you please.

First up in my SVN repository is my super flexible templated forwards and backwards scan. This is located in the scan/ folder in scan_kernel.cu. I wrote it after thinking I would need it for summed area table calculation (more on that later). Turns out I was grievously wrong, but the code is tested and may be useful to someone. The kernel can do both forwards and backwards scans for arbitrary sizes.

The more important program in there is my summed area table implementation in the sat/ folder. I based this implementation on the work of Harris, Sengupta, and Owens in GPU Gems 3: http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html. I implemented the naive scan algorithm instead of the work-efficient one (which is pretty complex for an inclusive scan). After a good night's stay in the SIG lab, I managed to squash the bugs in the implementation. The key things I learned are that naive scans just don't like to work all the time in-place, and printf() sometimes fixes bugs all by itself ;)

The code is broken up into a few kernels. One is a SAT scan kernel that scans all the rows of the 2D matrix in parallel. Instead of applying this kernel to the columns as well (which would have severe memory coalescing issues), the matrix is transposed and then the kernel is again applied to the rows. The result is transposed again to arrive at the final answer.

I implemented SAT at the suggestion of my course instructor, Patrick Cozzi. Since scans and transposes are heavily bandwidth-limited, this program will give good insight into memory performance for both Fermi and Cayman GPUs.

Next on the to-do list is to port this program to OpenCL so it can run on AMD cards. Afterwards I can do some benchmarking and performance analysis, and then move on to picking a high-arithmetic intensity application. One week to go.

1 comment:

  1. I'm glad you worked through the issues in your SAT implementation.