Matrix Multiply Assignment

Due: Tues, Sep 16 at 5:00 pm

[ Problem | Implementation | Submission | Resources ]


For this first assignment, you will optimize a routine to multiply square matrices. Matrix multiplication is a basic building block in many scientific codes; and since it is an O(n3) algorithm, these codes often spend a lot of their time in matrix multiplication. The most naive code to multiply matrices is short, sweet, and slow:

  for i = 1 to n
    for j = 1 to n
      for k = 1 to n
        C[i,j] = C[i,j] + A[i,k] * B[k,j]

We're providing implementations of the trivial unoptimized matrix multiply code in C and Fortran, and a very simple blocked implementation in C. We're also providing a wrapper for the BLAS (Basic Linear Algebra Subroutines) library. The BLAS is a standard interface for building blocks like matrix multiplication, and many vendors provide optimized versions of the BLAS for their platforms. You may want to compare your routines to codes that others have optimized; if you do, think about the relative difficulty of using someone else's library to writing and optimizing your own routine. Knowing when and how to make use of standard libraries is an important skill in building fast programs!


You need to write a dgemm.c that contains a function with the following C signature:

      square_dgemm(const unsigned M,
                   const double *A, const double *B, double *C)

Note that the matrices are stored in column-major order. Also, your program will actually be doing a multiply and add operation:

    C := C + A*B

Look at the code in basic_dgemm.c if you find this confusing.

The necessary files are in matmul.tar.gz. Included are the following:

a sample Makefile, with some basic rules,
the driver program,
a very simple square_dgemm implementation,
a slightly more complex square_dgemm implementation
another wrapper that lets the C driver program call the dgemm routine in BLAS implementations,
a sample gnuplot script to display the results,

We will be testing on the 2.33 GHz Intel Pentium III Xeon (Katmai) machines on the Dell cluster. These are quad-core nodes (but you will be using only one processor of these). See the page on using the class machines for more detailed information on how to use the Dell cluster. Also see the NYU HPC wiki for more information.

The y-axis in the gnuplot script plots is labeled MFlop/s, but really it should be "MFlop/s assuming you were using the standard algorithm." If you use something like Strassen's, we will still measure time/(2n3). However, the numerical error checking may haunt you.

You will get a lot of improvement from tiling, but you may want to play with other optimizations, too. Strassen-like algorithms, copy optimizations, recursive data layout, ... you can get some pretty elaborate optimization strategies. We'll cover some of the possibilities in class, but you might want to do some independent reading, too. The off-limit optimizations are those that weaken the floating point system.


Your group should submit your dgemm.c, Makefile (for compiler options) and a write-up. Your write-up should contain:

To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)

Please tar up your group's dgemm.c, write-up, and associated files and mail us either a URL from which I can retrieve the tar file or an encoded tar file.