Monday, June 20, 2011

Intel MIC/Larrabee is Coming!

(Note: in this post, like many others, I unfortunately don't give too much background info. If you would like to learn more about Larrabee the Intel MIC, you are welcome to check out the slides and video lecture I gave on the subject here. Larrabee architect Tom Forsyth also links to a wealth of sources on Larrabee at his home page).

Intel Vice President Kirk Skaugen presented the Intel Many Integrated Core (MIC) platform at the International Supercomputing Conference (ISC) today. He unveiled the platform at the same event last year, showcasing then the Knights Ferry development kit and announcing the Knights Corner product. While I don't see it in the presentation slides, news sites are reporting that Knights Corner products will be available (or go into production?) next year. Kirk's slides at least confirm that Knights Corner will be a > 50 core part built with Intel's new 22nm tri-gate process.

The presentation at ISC 2011 also showed a variety of benchmarks from different research centers using the Knights Ferry development board. The GFLOPS numbers for the year-old Knights Ferry don't look too impressive to me; you can get 2 TFLOP SGEMM performance out of an AMD Radeon HD 5870 (released in 2009), for instance. However, we must keep in mind that those 2 TFLOPs came out of writing the routine entirely in AMD intermediate language (IL), which is 5-wide VLIW assembly. Larrabee's true importance is in its ease of use. Take a look at the second page of this PDF to see how simple it can be to write parallel code for Larrabee. One could even argue that it's nicer than Microsoft's unreleased C++ AMP.

2012 is not yet here though, and time is of the essence in the extremely fast-moving GPU world. NVIDIA is preparing Fermi's successor, and AMD's Graphics Core Next is also around the corner (I hope to write a post on that soon). The semiconductor industry knows to never underestimate Intel, though, and with 50 Larrabee cores, the advantages of the tri-gate process, and the might of Intel's software development tools, Knights Corner has the potential to shake up the GPU compute industry.

Thursday, June 16, 2011

Microsoft's C++ AMP (Accelerated Massive Parallelism)

Microsoft has just announced a new parallel programming technology called C++ AMP (which stands for Accelerated Massive Parallelism). It was unveiled in a keynote by Herb Sutter at AMD's Fusion Developer Summit 11. Video and slides from the keynote are available on MSDN Channel 9 here (Herb begins talking about C++ AMP around a half hour into the keynote).

The purpose of C++ AMP is to tackle the problem of heterogeneous computing. Herb argues for a single programming platform that can account for the differences in processing ability and memory models of CPUs, GPUs, and Infrastructure-as-a-Service (IaaS) cloud platforms. By basing it off of C++ 0x, such a platform could provide the abstractions necessary for productivity, but also allow the best performance and hand-tuning ability. Let's dive straight into the code with an example given during Herb's keynote:

void MatrixMult( float* C, const vector<float>& A, const vector<float>& B,
int M, int N, int W )
{
	array_view<const float,2> a(M,W,A), b(W,N,B);
	array_view<writeonly<float>,2> c(M,N,C);
	parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) {
		float sum = 0;
		for(int i = 0; i < a.x; i++)
			sum += a(idx.y, i) * b(i, idx.x);
		c[idx] = sum;
	} );
}


This is a function that performs floating-point matrix multiplication. I'll try a bottom-up approach and go line by line to see what's new with C++ AMP. There is certainly nothing different from regular C++ in the function argument list (Disclaimer: my knowledge of C++ is minimal; school has caused me to stick with C). The next few lines, though, introduce a class called an array_view. Herb described it in the keynote as an iterable array abstraction. We need this abstraction because we have no idea about the underlying memory model for the system our code is executing on. For example, if we are developing for an x86-64 CPU, then we have one coherent 64-bit address space. But if we are using a discrete GPU, then that GPU may have its own completely different address space(s). With IaaS platforms, we may be dealing with incoherent memory as well. The array_view will perform any memory copies or synchronization actions for us, so that our code is cleaner and can run on multiple platforms.

Next up is the parallel_for_each loop. This is surprisingly not a language extension by Microsoft, but just a function. Microsoft's engineers determined that by using lambda functions (a new feature of C++ 0x) as objects to define their compute kernels, they can avoid extending C++ to include all sorts of data-parallel for loops. In this case, a lambda function is executed that calculates the dot product of a row of a and a column of b over the grid defined by the output array_view c. It seems that the lambda function takes a 2D iterator as an argument to traverse the arrays.

There is one keyword that I didn't explain, which is restrict. Herb says in the keynote that this is the only extension they had to make to C++ 0x to realize C++ AMP. restrict provides a compile-time check to ensure that code can execute on platforms of different compute capability. For instance, restrict(direct3d) ensures that the defined function will not attempt to execute any code that a DirectX 11-class GPU could not execute (such as throwing an exception or using function pointers). With this keyword, C++ AMP can have one body of code that runs on multiple platforms despite varying processor designs.

The ideas presented in this example itself make me excited about this platform. We only have to write whatever data-parallel code we need and the runtime can take care of the details for us. This was the promise of OpenCL, but C++ AMP does take the concept further. There is no new language subset to account for the threading and memory models of GPUs. There is no need to worry about which compute node's memory space the data is at. It also seems from this example that there is no need to size our workload for different thread and block counts like in CUDA; the runtime will handle that too. Microsoft showed an impressive demo of an n-body collision simulation program that could run off one core of a CPU, the on-die GPU of Fusion APUs, discrete GPUs, or even a discrete GPU and Fusion GPU at the same time, all using one executable. They simply changed an option from a GUI dropdown list to choose the compute resource to use.

There are plenty of details left to be answered, though. While Herb said in the keynote that developers will be free to performance tune, we don't know how much we can control execution resources like thread blocks. We also don't know what else is available in the C++ AMP API. Additionally, while Microsoft promises C++ AMP will be an open specification, the dependence on DirectCompute questions the notion of quality implementations on non-Windows platforms. Hopefully the hands-on session given at the Fusion summit by Daniel Moth will be posted online soon, and we can see what details were uncovered then.

The announcement by Soma Somasegar notes that C++ AMP is expected to be part of the next Visual C++ and Visual Studio release. Herb announced in the keynote that AMD will release a compiler supporting C++ AMP for both Windows and, interestingly, non-Windows platforms. NVIDIA also announced their support, while noting that CUDA and thrust is still the way to go ;). With the support of the discrete GPU vendors (note: nothing from Intel yet...) and the most popular development environment, C++ AMP has the potential to bring heterogeneous computing to a much larger developer market than what CUDA or OpenCL can do in their current form. I won't underestimate the ability of CUDA or OpenCL to catch up in ease-of-use by its release, though. In any case, I look forward to simpler GPU computing times ahead.

Thursday, June 2, 2011

Hack: Texting Myself New Emails

A project idea came to me today because of a predicament. I am beginning an internship, and I was told that I could not access personal email while working. I unfortunately do not own a smartphone, so this really meant that I would be away from my precious Gmail for at least 8 hours a day. I do have an unlimited text messaging plan, though, so I thought it would be great if I received a text message every time a new email came to my inbox.

The service that immediately came to mind for sending SMS was Google Voice. I was lucky to find an excellent API for Google Voice written in Python. With this, sending an SMS is literally 7 lines of code.

I also needed a way to find any new unread messages. I could parse Gmail's Atom feed and keep track of the most recent unread message, but then I would need to deal with login credentials. I am no Python expert, and I was too lazy to look up how to do this.

I realized, though, that I was already using a program called gm-notify with Ubuntu 10.04 to notify me of new unread emails with the messaging icon. I also realized that this program was written in Python. Time to hack.

I edited the gm-notify.py file in /usr/bin and added a new send_sms() method to the CheckMail class. The method simply copied the code from the PyGoogleVoice SMS example to send a text message (a minor change I made was to use the Gmail credentials that gm-notify already retrieved from the GNOME keyring; all I had to do was read the username and password from the self.creds tuple). gm-notify nicely concatenates the sender and subject of every email it hasn't already notified about into a string. I simply call send_sms() every time the program launches a notification bubble because of a new string.

In the end, I've written about 1 line of code and reused the rest. It has also taken far longer to write this blog post than to research and implement the whole hack. In my limited testing so far, though, it's worked well. The caveat is that I have to leave my laptop on and running Ubuntu, but hopefully I can get a server someday and avoid that. Or I could use that money and get a data plan. In any case, I'm pretty amazed with what can be done with mashing some open-source Python code together. Now I can happily avoid inbox withdrawal...

Wednesday, June 1, 2011

Ed: AMD & OpenCL vs CUDA

Edit (6/1/12): 2012 edition now available here

An interview of AMD Fusion marketing managers by bit-tech was recently posted on Slashdot. The interviewees predicted the death of CUDA, discussed the importance of GPU acceleration for consumer applications, and had no comment on developing ARM-based Fusion products. I wasn't very impressed with a lot of the answers. My opinions on what was said about OpenCL and the demise of CUDA are after the break. I'd like to make some comments about the role of Fusion in another post.