Prof. Dennis Shasha (NYU CS Advisor, Reader)
Prof. Benjamin Goldberg(NYU CS, Reader)
Prof. Allan Gottlieb (NYU CS, Reader)
Prof. Zvi Kedem (NYU, Auditor)
Prof. Thomas Weis (NYU, Auditor)
Time: Thursday, February 7, 11:00am
Place: Warren Weaver Hall (Room 517)
Locality Optimization for Data Parallel Programs
Productivity languages such as NumPy and Matlab make it much easier to implement data-intensive numerical algorithms than it is to implement them in efficiency languages such as C++. This is important as many programmers (1) aren't expert programmers; or (2) don't have time to tune their software for performance, as their main job focus is not programming per se. The tradeoff is typically one of execution time versus programming time, as unless there are specialized library functions or precompiled primitives for your particular task a productivity language is likely to be orders of magnitude slower than an efficiency language.
In this thesis, we present Parakeet, an array-oriented language embedded within Python, a widely-used productivity language. The Parakeet just-in-time compiler dynamically translates whole user functions to high performance multi-threaded native code. This thesis focuses in particular on our use of data parallel operators as a basis for locality enhancing program optimizations. e transform Parakeet programs written with the classic data parallel operators (Map, Reduce, and Scan; in Parakeet these are called adverbs) to process small local pieces (called tiles) of data at a time. To express this locality we introduce three new adverbs: TiledMap, TiledReduce, and TiledScan. These tiled adverbs are not exposed to the programmer but rather are automatically generated by a tiling transformation.
We use this tiling algorithm to bring two classic locality optimizations to a data parallel setting: cache tiling, and register tiling. We set register tile sizes statically at compile time, but use an online autotuning search to find good cache tile sizes at runtime. We evaluate Parakeet and these optimizations on various benchmark programs, and exhibit excellent performance even compared to typical C implementations.