Learning to Engineer: April 2011

Tuesday, April 26, 2011

Monday, April 25, 2011

Black-Scholes

As stated before, here is the post about my high arithmetic intensity kernel used for benchmarking. At the advice of course TA Jon McCaffrey, I looked into Black-Scholes option pricing. Black-Scholes is a model developed by Fischer Black and Myron Scholes for pricing call and put European-style stock options. Call options are where the buyer has a right to buy a stock in the future for an agreed upon price (the strike price). Put options are the opposite, where the buyer of the deal has the right to sell a stock in the future for the strike price.

Econ lesson aside, the closed form solution for pricing these options has a lot of floating point arithmetic. A lot. You can see this by perusing the short kernel. Here is the OpenCL version: http://code.google.com/p/cbench-cis565s11/source/browse/trunk/bs_cl/bs_kernel.cl. Each option pricing is distinct, so the problem is embarrassingly parallel. Lots of independent floating-point instructions, very few memory operations, and no control flow at all; is this AMD's GPU dream come true?

In the interest of time, I used the kernels and CPU implementations provided by the NVIDIA GPU Computing SDK for benchmarking (although I used my own test framework). All of the benchmark data is available at the SVN repository linked above, and I will be posting pretty graphs of this data soon.

Saturday, April 23, 2011

Data Collection Complete

I spent today finalizing the CUDA implementations for the SAT and my high compute kernel, a Black-Scholes option pricer. With that, all of my benchmarking programs are as complete as I can make them for this project. I ran the CUDA and OpenCL versions of both programs on the Fermi Tesla card in the lab, and have CSV files of data.

The other half of the equation was getting data for AMD cards for comparison. Very fortunately, Patrick found a friend, Aleksandar Dimitrijevic, who knew someone with a Radeon HD 6870. While this is not a Cayman-class GPU with a VLIW4 architecture, it is still the latest VLIW5 chip (codename Barts), and definitely would give interesting competition. Aleksandar sent me results for both the SAT and Black-Scholes kernels this morning, which was less than a day after I sent him the Black-Scholes implementation. A hearty thanks to both Aleksandar and Patrick for helping out so much in making this project possible.

Without further delay, data for both my low compute and high compute kernels is available in the project SVN repository here: http://code.google.com/p/cbench-cis565s11/source/browse/#svn%2Ftrunk%2Fdata. I will post later on why the Black-Scholes kernel was selected for high arithmetic intensity and some implementation details.

I have Radeon HD 6870 Data!

I am a couple posts behind in explaining my progress over the last few days, so I'll keep this brief. I have data for my two tests, and from my brief perusal, it looks great. More updates coming over this final weekend.

Thursday, April 21, 2011

SAT OpenCL Port Upgraded and Complete

I spent today refactoring the OpenCL summed area table implementation to output plenty of timing information and take the thread block size and the input matrix size as parameters. I will port those changes to the CUDA implementation soon, and then performance can be compared for the two runtimes. Since they both compile to PTX code in the end, it will be a fight between NVCC and the OpenCL compiler (which I hear from Wikipedia is based off the Clang LLVM project).

I ran into quite a few oddities when working with the program. For some reason, a block size of 384 crashed the display driver. That was rather disheartening, especially when running this program on a Tesla C2070. The other good chunk of problems was working with this code on an ATI Mobility Radeon HD 4650. I was basing my device set up off of an AMD guide for revision 2.3 of their Stream SDK (now called AMD APP). I was annoyed to find that the vendor string changed from "Advanced Micro Devices, Inc." to "AMD APP...", which completely breaks the device checking. I would think that such systems would be a bit more robust. The 4650 also could not handle thread block sizes of 128 or even 64, which I find hard to believe. I hope though that the program runs a lot better on newer cards.

I next have to find a high arithmetic intensity benchmark to analyze. We press on.

Wednesday, April 20, 2011

SAT OpenCL Port

I was able to port yesterday's summed area table (SAT) program from CUDA to OpenCL today. The code is in my SVN repository under sat_cl/. The porting took a little longer than anticipated. OpenCL is more akin to the CUDA Driver API than the CUDA Runtime API, where the kernel and its parameters are configured with objects instead of as a function call. It reminded me a lot of the GLSL shaders we used at the beginning of CIS 565, where the GPU code is taken as a string and compiled at runtime. While the interface is a bit quirky compared to the can't-believe-it's-not-C CUDA Runtime API, both the OpenCL API and documentation are well done.

Here are a few places that tripped me up in the porting process, in case anyone else runs into the same issues:

API error codes are very nice. NVIDIA has an oclErrorString() function in their SDK that I highly recommend copying.
cl_float4 and CUDA float4 are very different. The former is implemented as an SSE vector from what I can tell, while the latter is a struct. In Visual Studio 2008, I had to set the components of the cl_float4 with .s[0], .s[1], etc. Apparently this might not be the case for other dev environments though.
In OpenCL, the size of the NDRange (aka CUDA grid) is specified in terms of the number of work_items/threads, whereas in CUDA it is specified in terms of the number of thread blocks.

I was unfortunately a bit sick to travel to the SIG lab today, but hopefully Fermi and I will have some benchmarking numbers soon to at least compare CUDA and OpenCL performance. One thing I will have to do first is find the best timer to use. My CUDA code uses the high-res GPU timer, but my thought is that results from that will not be comparable when running code on AMD hardware too. My next step is to find a good CPU timer to use for both CUDA and OpenCL. I'd appreciate any suggestions.

Tuesday, April 19, 2011

SAT Testing

Hello and sorry for the long delay between posts. I only wanted to post next when I had something to share. I have a few things this time around.

First is my SVN repository. It is hosted on Google Code here: http://code.google.com/p/cbench-cis565s11/ The code is under an MIT license, so feel free to use or modify it in any way you please.

First up in my SVN repository is my super flexible templated forwards and backwards scan. This is located in the scan/ folder in scan_kernel.cu. I wrote it after thinking I would need it for summed area table calculation (more on that later). Turns out I was grievously wrong, but the code is tested and may be useful to someone. The kernel can do both forwards and backwards scans for arbitrary sizes.

The more important program in there is my summed area table implementation in the sat/ folder. I based this implementation on the work of Harris, Sengupta, and Owens in GPU Gems 3: http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html. I implemented the naive scan algorithm instead of the work-efficient one (which is pretty complex for an inclusive scan). After a good night's stay in the SIG lab, I managed to squash the bugs in the implementation. The key things I learned are that naive scans just don't like to work all the time in-place, and printf() sometimes fixes bugs all by itself ;)

The code is broken up into a few kernels. One is a SAT scan kernel that scans all the rows of the 2D matrix in parallel. Instead of applying this kernel to the columns as well (which would have severe memory coalescing issues), the matrix is transposed and then the kernel is again applied to the rows. The result is transposed again to arrive at the final answer.

I implemented SAT at the suggestion of my course instructor, Patrick Cozzi. Since scans and transposes are heavily bandwidth-limited, this program will give good insight into memory performance for both Fermi and Cayman GPUs.

Next on the to-do list is to port this program to OpenCL so it can run on AMD cards. Afterwards I can do some benchmarking and performance analysis, and then move on to picking a high-arithmetic intensity application. One week to go.

Saturday, April 9, 2011

Larrabee

On Wednesday, I gave a talk on Intel's Larrabee project. After reading through the SIGGRAPH 08 paper and presentations given by Tom Forsyth, I became very impressed with the work Intel's engineers did in advancing manycore computing. It's a shame that a Larrabee GPU won't be in our desktops and laptops. I hope that Intel's Many Integrated Core (MIC) project does at least bring a commercial product to the HPC developers currently confined to CUDA and OpenCL.

Slides to my presentation are available here: http://static.vsampath.com/cis565/s11/larrabee.pdf

The slides are a bit sparse, but here's a video with the slides and voice-over from the presentation: http://static.vsampath.com/cis565/s11/larrabee.wmv

Feel free to post any questions and/or mourn the loss of the Larrabee GPU in the comments.

Learning to Engineer