Wednesday, September 21, 2011

An Intro to Hardware Hacking for CS folk: Part 1, What's a microcontroller?

N.B. You can find a prelude to this post here.

Microcontrollers are everywhere. They engage the brakes in your car, open the doors of your building's elevator, and yes, cook the frozen dinner in your microwave.  But what are they, and how can you, the budding hardware hacker, take advantage of them? Hopefully this post will serve as an introduction to the most popular computing devices around.

An Intro to Hardware Hacking for CS folk. A Prelude with µWave

I'd like to start off this post with a big thanks to the PennApps team. They put together a fantastic hackathon last weekend, with 40 teams developing crazy apps in only 48 hours. I was fortunate to be a part of one of those teams, and we worked on a project called µWave. µWave is a microwave (how punny is that name) that we hacked to figure out how long you are cooking food and to transmit that data over HTTP. A server then uses that information to find a YouTube video of suitable length to play for you while you wait, and to @mention you in a tweet and text you when that food is ready. I am happy to report that this project won the competition, which is especially cool since it was the only hardware hack around. While we µWave creators are a team of electrical engineers (except for me the computer engineering student), I refuse to believe that you need a EE degree to tinker with electronics and devices. The next post will introduce microcontrollers, the bridge between hardware and software development. In a later post, we can discuss how to work with analog and digital signals to interact with different devices.

TL;DR prelude: CS people can (and should) hack hardware too. Let's talk about it.

Monday, June 20, 2011

Intel MIC/Larrabee is Coming!

(Note: in this post, like many others, I unfortunately don't give too much background info. If you would like to learn more about Larrabee the Intel MIC, you are welcome to check out the slides and video lecture I gave on the subject here. Larrabee architect Tom Forsyth also links to a wealth of sources on Larrabee at his home page).

Intel Vice President Kirk Skaugen presented the Intel Many Integrated Core (MIC) platform at the International Supercomputing Conference (ISC) today. He unveiled the platform at the same event last year, showcasing then the Knights Ferry development kit and announcing the Knights Corner product. While I don't see it in the presentation slides, news sites are reporting that Knights Corner products will be available (or go into production?) next year. Kirk's slides at least confirm that Knights Corner will be a > 50 core part built with Intel's new 22nm tri-gate process.

The presentation at ISC 2011 also showed a variety of benchmarks from different research centers using the Knights Ferry development board. The GFLOPS numbers for the year-old Knights Ferry don't look too impressive to me; you can get 2 TFLOP SGEMM performance out of an AMD Radeon HD 5870 (released in 2009), for instance. However, we must keep in mind that those 2 TFLOPs came out of writing the routine entirely in AMD intermediate language (IL), which is 5-wide VLIW assembly. Larrabee's true importance is in its ease of use. Take a look at the second page of this PDF to see how simple it can be to write parallel code for Larrabee. One could even argue that it's nicer than Microsoft's unreleased C++ AMP.

2012 is not yet here though, and time is of the essence in the extremely fast-moving GPU world. NVIDIA is preparing Fermi's successor, and AMD's Graphics Core Next is also around the corner (I hope to write a post on that soon). The semiconductor industry knows to never underestimate Intel, though, and with 50 Larrabee cores, the advantages of the tri-gate process, and the might of Intel's software development tools, Knights Corner has the potential to shake up the GPU compute industry.

Thursday, June 16, 2011

Microsoft's C++ AMP (Accelerated Massive Parallelism)

Microsoft has just announced a new parallel programming technology called C++ AMP (which stands for Accelerated Massive Parallelism). It was unveiled in a keynote by Herb Sutter at AMD's Fusion Developer Summit 11. Video and slides from the keynote are available on MSDN Channel 9 here (Herb begins talking about C++ AMP around a half hour into the keynote).

The purpose of C++ AMP is to tackle the problem of heterogeneous computing. Herb argues for a single programming platform that can account for the differences in processing ability and memory models of CPUs, GPUs, and Infrastructure-as-a-Service (IaaS) cloud platforms. By basing it off of C++ 0x, such a platform could provide the abstractions necessary for productivity, but also allow the best performance and hand-tuning ability. Let's dive straight into the code with an example given during Herb's keynote:

void MatrixMult( float* C, const vector<float>& A, const vector<float>& B,
int M, int N, int W )
	array_view<const float,2> a(M,W,A), b(W,N,B);
	array_view<writeonly<float>,2> c(M,N,C);
	parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) {
		float sum = 0;
		for(int i = 0; i < a.x; i++)
			sum += a(idx.y, i) * b(i, idx.x);
		c[idx] = sum;
	} );

This is a function that performs floating-point matrix multiplication. I'll try a bottom-up approach and go line by line to see what's new with C++ AMP. There is certainly nothing different from regular C++ in the function argument list (Disclaimer: my knowledge of C++ is minimal; school has caused me to stick with C). The next few lines, though, introduce a class called an array_view. Herb described it in the keynote as an iterable array abstraction. We need this abstraction because we have no idea about the underlying memory model for the system our code is executing on. For example, if we are developing for an x86-64 CPU, then we have one coherent 64-bit address space. But if we are using a discrete GPU, then that GPU may have its own completely different address space(s). With IaaS platforms, we may be dealing with incoherent memory as well. The array_view will perform any memory copies or synchronization actions for us, so that our code is cleaner and can run on multiple platforms.

Next up is the parallel_for_each loop. This is surprisingly not a language extension by Microsoft, but just a function. Microsoft's engineers determined that by using lambda functions (a new feature of C++ 0x) as objects to define their compute kernels, they can avoid extending C++ to include all sorts of data-parallel for loops. In this case, a lambda function is executed that calculates the dot product of a row of a and a column of b over the grid defined by the output array_view c. It seems that the lambda function takes a 2D iterator as an argument to traverse the arrays.

There is one keyword that I didn't explain, which is restrict. Herb says in the keynote that this is the only extension they had to make to C++ 0x to realize C++ AMP. restrict provides a compile-time check to ensure that code can execute on platforms of different compute capability. For instance, restrict(direct3d) ensures that the defined function will not attempt to execute any code that a DirectX 11-class GPU could not execute (such as throwing an exception or using function pointers). With this keyword, C++ AMP can have one body of code that runs on multiple platforms despite varying processor designs.

The ideas presented in this example itself make me excited about this platform. We only have to write whatever data-parallel code we need and the runtime can take care of the details for us. This was the promise of OpenCL, but C++ AMP does take the concept further. There is no new language subset to account for the threading and memory models of GPUs. There is no need to worry about which compute node's memory space the data is at. It also seems from this example that there is no need to size our workload for different thread and block counts like in CUDA; the runtime will handle that too. Microsoft showed an impressive demo of an n-body collision simulation program that could run off one core of a CPU, the on-die GPU of Fusion APUs, discrete GPUs, or even a discrete GPU and Fusion GPU at the same time, all using one executable. They simply changed an option from a GUI dropdown list to choose the compute resource to use.

There are plenty of details left to be answered, though. While Herb said in the keynote that developers will be free to performance tune, we don't know how much we can control execution resources like thread blocks. We also don't know what else is available in the C++ AMP API. Additionally, while Microsoft promises C++ AMP will be an open specification, the dependence on DirectCompute questions the notion of quality implementations on non-Windows platforms. Hopefully the hands-on session given at the Fusion summit by Daniel Moth will be posted online soon, and we can see what details were uncovered then.

The announcement by Soma Somasegar notes that C++ AMP is expected to be part of the next Visual C++ and Visual Studio release. Herb announced in the keynote that AMD will release a compiler supporting C++ AMP for both Windows and, interestingly, non-Windows platforms. NVIDIA also announced their support, while noting that CUDA and thrust is still the way to go ;). With the support of the discrete GPU vendors (note: nothing from Intel yet...) and the most popular development environment, C++ AMP has the potential to bring heterogeneous computing to a much larger developer market than what CUDA or OpenCL can do in their current form. I won't underestimate the ability of CUDA or OpenCL to catch up in ease-of-use by its release, though. In any case, I look forward to simpler GPU computing times ahead.

Thursday, June 2, 2011

Hack: Texting Myself New Emails

A project idea came to me today because of a predicament. I am beginning an internship, and I was told that I could not access personal email while working. I unfortunately do not own a smartphone, so this really meant that I would be away from my precious Gmail for at least 8 hours a day. I do have an unlimited text messaging plan, though, so I thought it would be great if I received a text message every time a new email came to my inbox.

The service that immediately came to mind for sending SMS was Google Voice. I was lucky to find an excellent API for Google Voice written in Python. With this, sending an SMS is literally 7 lines of code.

I also needed a way to find any new unread messages. I could parse Gmail's Atom feed and keep track of the most recent unread message, but then I would need to deal with login credentials. I am no Python expert, and I was too lazy to look up how to do this.

I realized, though, that I was already using a program called gm-notify with Ubuntu 10.04 to notify me of new unread emails with the messaging icon. I also realized that this program was written in Python. Time to hack.

I edited the file in /usr/bin and added a new send_sms() method to the CheckMail class. The method simply copied the code from the PyGoogleVoice SMS example to send a text message (a minor change I made was to use the Gmail credentials that gm-notify already retrieved from the GNOME keyring; all I had to do was read the username and password from the self.creds tuple). gm-notify nicely concatenates the sender and subject of every email it hasn't already notified about into a string. I simply call send_sms() every time the program launches a notification bubble because of a new string.

In the end, I've written about 1 line of code and reused the rest. It has also taken far longer to write this blog post than to research and implement the whole hack. In my limited testing so far, though, it's worked well. The caveat is that I have to leave my laptop on and running Ubuntu, but hopefully I can get a server someday and avoid that. Or I could use that money and get a data plan. In any case, I'm pretty amazed with what can be done with mashing some open-source Python code together. Now I can happily avoid inbox withdrawal...

Wednesday, June 1, 2011

Ed: AMD & OpenCL vs CUDA

Edit (6/1/12): 2012 edition now available here

An interview of AMD Fusion marketing managers by bit-tech was recently posted on Slashdot. The interviewees predicted the death of CUDA, discussed the importance of GPU acceleration for consumer applications, and had no comment on developing ARM-based Fusion products. I wasn't very impressed with a lot of the answers. My opinions on what was said about OpenCL and the demise of CUDA are after the break. I'd like to make some comments about the role of Fusion in another post.

Sunday, May 22, 2011

VLSI Project Results

The semester is over, but I thought I should post results from another final project. For ESE 570 at Penn, two fellow juniors and I implemented a 4-bit signed Wallace Tree multiplier in 0.6μm CMOS. The design uses two CSAs for reduction and an 8-bit ripple-carry adder to deliver the final product. In the end, 952 transistors were used for implementation, with layout dimensions of 253.95μm x 499.95μm (area of 0.127 square millimeters). I'm happy to say it passed all of our tests too, and had a propagation delay of 3.1 nanoseconds, which we didn't feel was too shabby given our time constraints. I'm sure we could do better if we spent the time to size transistors appropriately. An image of the layout is shown below.

Implementing digital logic at this level really gives some extra insight on all the bit arithmetic learned in freshman/sophomore-level digital design courses. Sign-extension and shifting really are just dragging the wires around (and maybe a little bit of buffering). I can now say that I have some Cadence experience too. phew...

Monday, April 25, 2011


As stated before, here is the post about my high arithmetic intensity kernel used for benchmarking. At the advice of course TA Jon McCaffrey, I looked into Black-Scholes option pricing. Black-Scholes is a model developed by Fischer Black and Myron Scholes for pricing call and put European-style stock options. Call options are where the buyer has a right to buy a stock in the future for an agreed upon price (the strike price). Put options are the opposite, where the buyer of the deal has the right to sell a stock in the future for the strike price.

Econ lesson aside, the closed form solution for pricing these options has a lot of floating point arithmetic. A lot. You can see this by perusing the short kernel. Here is the OpenCL version: Each option pricing is distinct, so the problem is embarrassingly parallel. Lots of independent floating-point instructions, very few memory operations, and no control flow at all; is this AMD's GPU dream come true?

In the interest of time, I used the kernels and CPU implementations provided by the NVIDIA GPU Computing SDK for benchmarking (although I used my own test framework). All of the benchmark data is available at the SVN repository linked above, and I will be posting pretty graphs of this data soon.

Saturday, April 23, 2011

Data Collection Complete

I spent today finalizing the CUDA implementations for the SAT and my high compute kernel, a Black-Scholes option pricer. With that, all of my benchmarking programs are as complete as I can make them for this project. I ran the CUDA and OpenCL versions of both programs on the Fermi Tesla card in the lab, and have CSV files of data.

The other half of the equation was getting data for AMD cards for comparison. Very fortunately, Patrick found a friend, Aleksandar Dimitrijevic, who knew someone with a Radeon HD 6870. While this is not a Cayman-class GPU with a VLIW4 architecture, it is still the latest VLIW5 chip (codename Barts), and definitely would give interesting competition. Aleksandar sent me results for both the SAT and Black-Scholes kernels this morning, which was less than a day after I sent him the Black-Scholes implementation. A hearty thanks to both Aleksandar and Patrick for helping out so much in making this project possible.

Without further delay, data for both my low compute and high compute kernels is available in the project SVN repository here: I will post later on why the Black-Scholes kernel was selected for high arithmetic intensity and some implementation details.

I have Radeon HD 6870 Data!

I am a couple posts behind in explaining my progress over the last few days, so I'll keep this brief. I have data for my two tests, and from my brief perusal, it looks great. More updates coming over this final weekend.

Thursday, April 21, 2011

SAT OpenCL Port Upgraded and Complete

I spent today refactoring the OpenCL summed area table implementation to output plenty of timing information and take the thread block size and the input matrix size as parameters. I will port those changes to the CUDA implementation soon, and then performance can be compared for the two runtimes. Since they both compile to PTX code in the end, it will be a fight between NVCC and the OpenCL compiler (which I hear from Wikipedia is based off the Clang LLVM project).

I ran into quite a few oddities when working with the program. For some reason, a block size of 384 crashed the display driver. That was rather disheartening, especially when running this program on a Tesla C2070. The other good chunk of problems was working with this code on an ATI Mobility Radeon HD 4650. I was basing my device set up off of an AMD guide for revision 2.3 of their Stream SDK (now called AMD APP). I was annoyed to find that the vendor string changed from "Advanced Micro Devices, Inc." to "AMD APP...", which completely breaks the device checking. I would think that such systems would be a bit more robust. The 4650 also could not handle thread block sizes of 128 or even 64, which I find hard to believe. I hope though that the program runs a lot better on newer cards.

I next have to find a high arithmetic intensity benchmark to analyze. We press on.

Wednesday, April 20, 2011

SAT OpenCL Port

I was able to port yesterday's summed area table (SAT) program from CUDA to OpenCL today. The code is in my SVN repository under sat_cl/. The porting took a little longer than anticipated. OpenCL is more akin to the CUDA Driver API than the CUDA Runtime API, where the kernel and its parameters are configured with objects instead of as a function call. It reminded me a lot of the GLSL shaders we used at the beginning of CIS 565, where the GPU code is taken as a string and compiled at runtime. While the interface is a bit quirky compared to the can't-believe-it's-not-C CUDA Runtime API, both the OpenCL API and documentation are well done.

Here are a few places that tripped me up in the porting process, in case anyone else runs into the same issues:
  • API error codes are very nice. NVIDIA has an oclErrorString() function in their SDK that I highly recommend copying.
  • cl_float4 and CUDA float4 are very different. The former is implemented as an SSE vector from what I can tell, while the latter is a struct. In Visual Studio 2008, I had to set the components of the cl_float4 with .s[0], .s[1], etc. Apparently this might not be the case for other dev environments though.
  • In OpenCL, the size of the NDRange (aka CUDA grid) is specified in terms of the number of work_items/threads, whereas in CUDA it is specified in terms of the number of thread blocks.
I was unfortunately a bit sick to travel to the SIG lab today, but hopefully Fermi and I will have some benchmarking numbers soon to at least compare CUDA and OpenCL performance. One thing I will have to do first is find the best timer to use. My CUDA code uses the high-res GPU timer, but my thought is that results from that will not be comparable when running code on AMD hardware too. My next step is to find a good CPU timer to use for both CUDA and OpenCL. I'd appreciate any suggestions.

Tuesday, April 19, 2011

SAT Testing

Hello and sorry for the long delay between posts. I only wanted to post next when I had something to share. I have a few things this time around.

First is my SVN repository. It is hosted on Google Code here: The code is under an MIT license, so feel free to use or modify it in any way you please.

First up in my SVN repository is my super flexible templated forwards and backwards scan. This is located in the scan/ folder in I wrote it after thinking I would need it for summed area table calculation (more on that later). Turns out I was grievously wrong, but the code is tested and may be useful to someone. The kernel can do both forwards and backwards scans for arbitrary sizes.

The more important program in there is my summed area table implementation in the sat/ folder. I based this implementation on the work of Harris, Sengupta, and Owens in GPU Gems 3: I implemented the naive scan algorithm instead of the work-efficient one (which is pretty complex for an inclusive scan). After a good night's stay in the SIG lab, I managed to squash the bugs in the implementation. The key things I learned are that naive scans just don't like to work all the time in-place, and printf() sometimes fixes bugs all by itself ;)

The code is broken up into a few kernels. One is a SAT scan kernel that scans all the rows of the 2D matrix in parallel. Instead of applying this kernel to the columns as well (which would have severe memory coalescing issues), the matrix is transposed and then the kernel is again applied to the rows. The result is transposed again to arrive at the final answer.

I implemented SAT at the suggestion of my course instructor, Patrick Cozzi. Since scans and transposes are heavily bandwidth-limited, this program will give good insight into memory performance for both Fermi and Cayman GPUs.

Next on the to-do list is to port this program to OpenCL so it can run on AMD cards. Afterwards I can do some benchmarking and performance analysis, and then move on to picking a high-arithmetic intensity application. One week to go.

Saturday, April 9, 2011


On Wednesday, I gave a talk on Intel's Larrabee project. After reading through the SIGGRAPH 08 paper and presentations given by Tom Forsyth, I became very impressed with the work Intel's engineers did in advancing manycore computing. It's a shame that a Larrabee GPU won't be in our desktops and laptops. I hope that Intel's Many Integrated Core (MIC) project does at least bring a commercial product to the HPC developers currently confined to CUDA and OpenCL.

Slides to my presentation are available here:

The slides are a bit sparse, but here's a video with the slides and voice-over from the presentation:

Feel free to post any questions and/or mourn the loss of the Larrabee GPU in the comments.

Wednesday, March 23, 2011

GPU Compute Benchmarking

For my CIS 565 final project, I plan on developing benchmarks to compare compute performance of AMD Radeon HD 69xx series and the NVIDIA Fermi series of GPUs. I think this will be useful to the community, especially since these are the latest and greatest cards on the market. Both GPU families have very different architectures, so it'll be interesting to see where each card shines. I have a hunch that the AMD card could win at high ILP and arithmetic intensity workloads, but with NVIDIA's huge push on GPU compute, their Fermi line has the potential to sweep the test cases.

Soon, I'll have to start picking out those test cases. If anyone has any suggestions or things they'd like to see, please post away in the comments. In the meanwhile, my full gory project proposal is linked here.

Starting it off

Hi all, I'm going to position this blog as a place where I post updates on the projects I'm working on (and want to post updates on). The project I'm starting now is for my course on GPU Programming and Architecture, CIS 565.

Before I kick off a stream of updates on that, though, I figured I should link to a blog on a different project of mine, one year ago. Back in the day we built a wireless accelerometer-driven car with plenty of other features. Check out the above link to see some YouTube videos of it in action.  Moving from developing for 2MHz microcontrollers to monsters like the NVIDIA Fermi has definitely been interesting so far...