Scientific Computing: Coding Tips

I have added this page in the interest of keeping a running record of some of the advice I give over the course of the semester about computing and presenting computations.

Apr 20, 2009: MATLAB vectorization

MATLAB operates more efficiently if you program with vector operations than if you program with loops. In some cases, the difference is dramatic enough that it is worth vectorizing even if the vector version involves some extra work. For example, consider this code that contains two implementations of a rejection sampler that uniformly samples from the unit circle. The version using loops nominally does less arithmetic and less random number generation on average; it generates a million points in about 16 seconds on my desktop. The vectorized version generally generates about 10 percent more numbers than it needs, does more arithmetic, uses more storage -- and generates a million numbers in 0.25 seconds. On my laptop, running Octave, the difference is 47 seconds versus 1 second.

Mar 11, 2009: Dynamic arrays

The GNU compilers provide a few extensions to C/C++. One of these extensions is automatic dynamic arrays; for example

  void goofy_operation(int n)
  {
      int tmp[n];  // GNU extension
      for (int i = 0; i < n; ++i)
          tmp[i] = i;
      // ...
  }

A more portable alternative is to use a vector object from the standard library

  void goofy_operation(int n)
  {
      std::vector tmp(n); 
      for (int i = 0; i < n; ++i)
          tmp[i] = i;
      // ...
  }

If n is at all large, you probably would prefer the second version even if you planned to use the GNU compiler suite. The reason is that storage for vectors is allocated on the heap, which generally has lots of space. If we used a GNU-style automatic array, the allocation would be done on the stack, which has rather less space. Allocating a couple megabytes on the stack is a good way to crash your program on many systems.

Mar 11, 2009: Tools for catching bugs

There are a few tools that I use all the time when I'm writing C++ code:

Mar 11, 2009: Variable scoping

As a rule of thumb, it's a good idea to declare variables in C++ with the smallest possible scope. For example, rather than writing

  int i, ai;
  for (i = 0; i < n; ++i) {
      ai = a[i];
      // Something that uses ai
  }

I would tend to write

  for (int i = 0; i < n; ++i) {
      int ai = a[i];
      // Something that uses ai
  }

There are several advantages to making variable declarations have the most local possible scope. First, when variable declarations, initializations, and updates are all within a few lines of each other, it's relatively easy to follow the flow of control and data involving those variables. Second, it makes sense to give a short name to a variable that's used in a very short piece of code; a variable that you have to remember for twenty or thirty lines probably deserves a larger name. Thus, keeping scopes local saves some typing and some line space. Third, if you define variables locally, it's more difficult to make the type of mistake where you recycle a variable without properly re-initializing it.

Mar 11, 2009: Indentation for C++

People have strong -- and wildly divergent -- opinions about code formatting conventions. I'm no exception. My own style tends to be close to the Linux kernel coding style (though I use 4-character indentations); this is mostly because the Linux kernel style is close to the style used by Stroustroup and by Kernighan and Ritchie.

There are some people who seem to write code with no formatting conventions, or with conventions that I consider severely misguided. Fortunately, there are code reformatting tools. I use a tool called Artistic Style to reformat code that strays too far from the family of formatting conventions with which I'm comfortable.

Feb 23, 2009: Documentation and style

Code is written for two audiences. The first audience is the computer: if your code can't make it past the compiler, it probably will not be of much use. Computers are generally good at pointedly complaining about certain types of problems, so programmers get used to writing code that is at least syntactically correct. Alas, many people are not as good at writing code to be appreciated by the second, human audience.

Part of writing code for humans to read is providing relevant comments. Novice programmers sometimes take this overboard. For example, consider the following:

  for i = 1:10        % Here we increment the counter variable i from the value one to the value ten in steps of one.
    result(i) = f(i); % In this line, we will evaluate the function f with the argument i.  The result of the function evaluation is stored in the ith element of an array called result, which will eventually have length 10.
  end                 % This is the end of the for loop with the variable i.

This code is obvious to anyone who knows the MATLAB programming language, and the comments provide an unwelcome distraction. I have had the misfortune of reading code that looked pretty much identical to this. After grumbling to myself for a bit, I wrote a script that stripped out all the comments from that code. This cheered me up, and made the code remarkably more legible.

If line-by-line comments are too much, what is an appropriate level at which to comment? I use the following rules of thumb:

An interesting alternative to writing comments around code is to write code as an integral part of an ordinary prose document. This is the basic idea of the literate programming style championed by Don Knuth.

Comments are not the only thing that make code legible. Some other things that make code easier to read are:

Feb 23, 2009: Defensive coding

As most people learn to program, they also learn to step through code, effectively simulating the computer in their heads. This is an excellent way to understand what code is doing, and -- if there are bugs -- to figure out how to fix them.

A twist on the same strategy is to step through a code and to ask on each line (or in each expression), what could possibly go wrong? What happens if this expression gets an infinity or NaN input? What happens if the file I'm trying to open isn't there? Could this argument go out of range? Could this linear system be singular? Once you've figured out whether there's a possible problem. you can look at the surrounding code and figure out whether it has been structured in a way that makes the possible problem structurally impossible.

What if you think a problem is structurally impossible, but aren't sure? A natural thing to do is to state a precondition that ensures that the problem will not occur, and enforce that precondition with an assert statement. If an assertion fails during the program execution, you at least know where the problem is.

What if you are sure that your code works for reasonable inputs, but it breaks on invalid input? In this case, the least you can do is document the assumptions you make on the input, so that programmers using your routine (including yourself) will know what to watch for. Sometimes this is the most you can do, too, if checking the validity of the input data is infeasible or is too expensive. Ideally, though, you'd like your routines to at least provide the option of checking the input data for validity.

What about pre-production code that you're sure nobody will see until it has been cleaned up? If you know in advance that you're only trying out ideas, it might make sense to omit some checks -- but it's easy to take this too far. Keep in mind that very few one-off codes actually get used only once, and that deadline pressures often mean that defering a careful implementation means a careful implementation will never be written.

Feb 23, 2009: Passing context in MATLAB

One of the features of the MATLAB programming language is anonymous functions (sometimes called lambda functions). For example, if we wanted g(x) to be the function mapping x to x + 1, we could write

g = @(x) x + 1;

The computation done by these functions can depend not only on the function arguments, but also on variables defined in a surrounding environment. This language feature is called a closure, and it provides a very useful way to provide context information. For example, suppose we want to integrate some monomials using MATLAB's quad function. Then we might write

  for p = 1:pmax
    I(p) = quad(@(x) x.^p, 0, 1);
  end

Anonymous functions and closures are common in functional programming languages, including LISP, ML, Haskell, and so forth. They are not a native feature of C++, though packages like Boost attempt to provide the semblance of anonymous closures at the language level (see the documentation for the Boost Lambda library, for example).

Feb 23, 2009: Passing context in C++

In the integration routines we wrote as part of HW 3, we expressed the interface to the integrand in terms of function pointers. The disadvantage of this particular interface is that we were forced to add an extra argument to accomodate parameters to the function -- whether or not the integrand we were interested in was actually parameterized.

In C++, a function object is an object that can be treated like a function. Unlike a function, though, a function object provides a natural way to carry around context, such as a parameter or a count of the number of times the function has been called. Function objects are particularly useful in conjunction with generic algorithms written using function templates.

As an example, let's consider a (slightly simplified) version of our midpoint rule routine, written as a function template:

template <class F>
double midpoint_rule(F f, double a, double b, unsigned n)
{
    double I = 0;
    double h = (b-a)/n;
    double xmid = a+h/2;
    for (unsigned i = 0; i < n; ++i, xmid += h)
        I += h*f(xmid);
    return I;
}

The type class F can be matched to any type that can be called with a single double argument to return a double value. That includes ordinary functions from doubles to doubles, but it can also include function objects. The compiler uses the template as a recipe to generate a different version of midpoint_rule for each type F

How would we use this template in practice? Suppose we wanted to test our routine on monomials, and have an existing function monomial as follows:

double monomial(double x, int p)
{
    double result = 1;
    for (unsigned i = 0; i < p; ++i)
        result *= x;
    return result;
}

We cannot pass monomial directly to midpoint_rule, because it takes two arguments. Therefore, we write a function object adapter that keeps the second argument as context information:

class MonomialFun {
public:
    MonomialFun(int p) : p(p) {}
    double operator()(double x) { return monomial(x, p); }
private:
    int p;
};

We can use MonomialFun objects with midpoint_rule:

void check_monomial1()
{
    for (unsigned p = 0; p < 3; ++p) {
        double I = midpoint_rule(MonomialFun(p), 0, 1, 1);
        printf("Error for x^%d: %g\n", p, I-1.0/(p+1));
    }
}

Writing a function object to store the variable p is arguably more natural than writing an adapter function as we would do with C-style function pointers. But this is still a little clunky. If we are willing to go beyond the C++ standard library, we can write things more concisely. The Boost library is a library of useful C++ code that is not (yet) part of the standard library. Among other features of Boost is a library called bind, which can be used to automatically generate function objects that partially bind the arguments to a function -- which is what we've essentially done above. If we install Boost and include boost/bind.hpp, we can rewrite the above routine as

void check_monomial2()
{
    for (unsigned p = 0; p < 3; ++p) {
        double I = 
          midpoint_rule(boost::bind(monomial, _1, p), 0, 1, 1);
        printf("Error for x^%d: %g\n", p, I-1.0/(p+1));
    }
}

The call boost::bind(monomial, _1, p) returns a function object that can be called with one argument x in order to make the call monomial(x, p). This is much more concise than our original implementation.

Feb 9, 2009: C and C++ I/O

Q: It appears to me that the C++ code you wrote for creating the .txt files containing the outputs reflect two different methods to write the files. Specifically I'm referring to question 8 in hw1 and question 3 in hw2. Is this true, and if so, can you comment a bit on the differences between the two methods?

A: Absolutely. C++ has two basic I/O libraries: the stdio library from C, and the iostream library introduced in C++. The C++ library has several advantages: you can add output methods for your own types, for example, and everything is type safe so that it's harder to make mistakes with C++ I/O than with C I/O. In contrast, the C stdio library allows you to write things like

 double x = 1.0;
 printf("%d\n", x);  // x should be double for this to work!

A modern compiler will usually warn you about such problems, but it's not required.

The advantage of C-style I/O -- and it's a pretty big advantage if you spend your time printing floating point numbers -- is that the format strings used by printf and fprintf are remarkably powerful and concise. For example, if I wanted to print a floating point number in scientific notation to three figures (two after the decimal point), and I wanted a space before any positive numbers so that the digits in positive and negative numbers lined up in successive lines on a printout, I could write

 printf("% .2e\n", x);

There is a format library in the Boost library that gives some of the best of both worlds. I hope one day that something like this will be part of the standard library.

Feb 1, 2009: Plotting

It generally makes sense to use a logarithmic scale for errors and for step sizes. On a linear scale, these plots tend to be extremely uninformative; for most of the x values, the y value is visually indistinguishable from zero.

There are some plotting packages (not scientific plotters like the ones in MATLAB) in which a semilogarithmic plot does not use a purely logarithmic y scale. Instead, the markers between integer powers of ten are evenly spaced, but between powers of ten there seems to be a linear scale (or something non-logarithmic). This leads to purely exponential curves turning into wiggly lines rather than being straight lines. Please try to avoid this.

Feb 1, 2009: Recurrences

Just because a function is defined via a recurrence does not mean that it needs to be computed using a recursive procedure. In fact, a naively written recursive procedure can be extremely expensive. For example, if you compute the nth Fibonacci number using the obvious recursive routine, the computational cost grows exponentially with n. If you compute using a loop, the cost grows linearly. It's possible to turn the obvious recursive routine into something more efficient by caching results, a process known as memoization; but for most of the recurrences we will work with, this is probably not the clearest way to write the code.

When you want to run a short recurrence forward for several steps and you only care about the last value of the recurrence, it usually makes sense to throw away the old values you no longer need. A slick way to do this is to map the kth value of an m-term recurrence into element k modulo m of an array of length m. For example, the Legendre polynomials are computed using the three-term recurrence (n+1)Pn+1(x) = (2n+1)x Pn(s)- nPn-1(x), with starting values P0(x) = 1 and P1(x) = x. I might code this as


  double legendre(double x, int n)
  {
      double P[3];
      P[0] = 1;
      P[1] = x;
      for (int i = 1; i < n; ++i)
          P[(i+1)%3] = ( (2*i+1)*x*P[i%3] 
                           - n*P[(i-1)%3] )/( i+1 );
      return P[n%3];
  }  

Feb 1, 2009: C++ Programming

Based on the first homework, I have the following comments: