Programming Massively Parallel Processors

The next step for a CUDA newbie after CUDA by Example would naturally be the book Programming Massively Parallel Processors by David Kirk and Wen-mei Hwu. They taught the first course on CUDA at UIUC for a few semesters and this book is based on its lecture notes. I had used their lecture notes and videos when I first learnt CUDA.

To write any CUDA application that moulds the problem optimally to the CUDA architecture requires the programmer to think very differently from programming on a CPU. Using a matrix multiplication example, the authors walk the student through many levels of improvement. The authors introduce the different facets of the architecture and end up improving the performance of the solution by as much as two orders of magnitude in the end.

All the concepts of the CUDA architecture are covered: the thread-block-grid hierarchy, the global-shared-local memories and barrier synchronization. Details of the warps and the warp scheduler are explained. Since most CUDA applications are scientific, there is an entire chapter on the floating point format. This chapter gives a practitioner’s perspective that I found to be more useful than the popular but obscure What every computer scientist should know about floating-point arithmetic. There are two chapters on application case studies, which are mostly useless since one cannot understand the application intimately enough to draw any lessons from it.

CUDA runs only on NVIDIA devices. OpenCL is its twin that is designed to be used on all kinds of CPU and GPU processors. The authors have thrown in a chapter on OpenCL for folks who need to transition to it. OpenCL is exactly like CUDA, except that it does not have an equivalent of the CUDA Runtime API. So, the programmer ends up spending some time building the scaffolding required to run his kernels.

Programming Massively Parallel Processors is a easy book to study from. It should be accessible to any intermediate-to-expert programmer. Newbies can check out CUDA by Example before studying this book. I do wish this book covered some information on cache configuration, launch bounds, profiling, compiler options and other intimate details which one ends up using to squeeze out the last bit of performance. Currently, I do need to fall back onto the CUDA Programming Guide for such information. The book is also a wee bit outdated since the Fermi architecture is not well covered and the new Kepler architecture has already been released.

CUDA by Example

With single processor speeds having hit a wall, there is a lot of interest in heterogeneous computing today. One of the popular ways to speed up applications is to rewrite them as massively parallel applications that execute on the NVIDIA CUDA architecture. It is quite hard to think of parallel solutions to existing problems and writing CUDA programs can be a minefield. These factors have made learning to swim in the choppy waters of CUDA difficult for beginners. Despite an abundance of CUDA information on the web, there has been no introductory material that is both simple and of good quality. The new book CUDA by Example: An Introduction to General-Purpose GPU Programming written by Jason Sanders and Edward Kandrot (both NVIDIA employees) aims to be such an introductory book for CUDA programming.

The only prerequisite expected of this book’s reader is knowledge of C. Spread over 12 quick chapters, the book uses example CUDA C programs all through to introduce concepts and explain their usage. Every example program is thoroughly broken down and the authors explain every stage of the process. It is quite heartening to see this detailed hand-holding extend all the way through to the complex concepts and last chapters. Chapters 1-5 are essential reading and the reader should be able to write simple CUDA programs after this point. The rest of the chapters acquaint concepts which are useful to further optimize the CUDA solution to take advantage of the problem domain or the CUDA architecture or both.

The book is strictly introductory, thankfully, and does not explain the CUDA architecture and its inner workings. I cannot commend the authors enough for taking this hard-line and making the jump into CUDA as simple and painless as they have done here. It would be natural to read the CUDA Programming Guide after this and keep it around as a reference for CUDA programming. This book is perfect for any inquisitive programmer wanting a taste of CUDA to see if it is worth his time. The avid reader can finish this book, having worked the examples and understood the major concepts, easily over a weekend.

Example code and errata of the book can be found here.