I have added this page in the interest of keeping a running record of some of the advice I give over the course of the semester about computing and presenting computations.
Apr 20, 2009: MATLAB vectorization
MATLAB operates more efficiently if you program with vector operations than if you program with loops. In some cases, the difference is dramatic enough that it is worth vectorizing even if the vector version involves some extra work. For example, consider this code that contains two implementations of a rejection sampler that uniformly samples from the unit circle. The version using loops nominally does less arithmetic and less random number generation on average; it generates a million points in about 16 seconds on my desktop. The vectorized version generally generates about 10 percent more numbers than it needs, does more arithmetic, uses more storage -- and generates a million numbers in 0.25 seconds. On my laptop, running Octave, the difference is 47 seconds versus 1 second.
Mar 11, 2009: Dynamic arrays
The GNU compilers provide a few extensions to C/C++. One of these extensions is automatic dynamic arrays; for example
void goofy_operation(int n)
{
int tmp[n]; // GNU extension
for (int i = 0; i < n; ++i)
tmp[i] = i;
// ...
}
A more portable alternative is to use a vector object from the standard library
void goofy_operation(int n)
{
std::vector tmp(n);
for (int i = 0; i < n; ++i)
tmp[i] = i;
// ...
}
If n is at all large, you probably would prefer the second version even if you planned to use the GNU compiler suite. The reason is that storage for vectors is allocated on the heap, which generally has lots of space. If we used a GNU-style automatic array, the allocation would be done on the stack, which has rather less space. Allocating a couple megabytes on the stack is a good way to crash your program on many systems.
Mar 11, 2009: Tools for catching bugs
There are a few tools that I use all the time when I'm writing C++ code:
- First, there's my compiler. I can ask the compiler to warn me if it sees something that looks fishy, by compiling with all warnings on: g++ -Wall mycode.cc. Other compilers typically have a similar option. Second, there's valgrind. Valgrind will tell you whenever you're reading from memory you haven't initialized, writing to memory you don't own, deleting something you've already deleted, etc. It's incredibly valuable. Alas, it's basically only available under Linux, though that might change at some point. There are other packages that provide similar functionality, including Purify, but most of them cost a pretty penny.
- Third, there's the debugger. I tend to default to gdb; it's probably an old-fashioned choice, but it works for me. Visual Studio and Xcode have their own debuggers. It's worthwhile spending the time to learn to use them.
Mar 11, 2009: Variable scoping
As a rule of thumb, it's a good idea to declare variables in C++ with the smallest possible scope. For example, rather than writing
int i, ai;
for (i = 0; i < n; ++i) {
ai = a[i];
// Something that uses ai
}
I would tend to write
for (int i = 0; i < n; ++i) {
int ai = a[i];
// Something that uses ai
}
There are several advantages to making variable declarations have the most local possible scope. First, when variable declarations, initializations, and updates are all within a few lines of each other, it's relatively easy to follow the flow of control and data involving those variables. Second, it makes sense to give a short name to a variable that's used in a very short piece of code; a variable that you have to remember for twenty or thirty lines probably deserves a larger name. Thus, keeping scopes local saves some typing and some line space. Third, if you define variables locally, it's more difficult to make the type of mistake where you recycle a variable without properly re-initializing it.
Mar 11, 2009: Indentation for C++
People have strong -- and wildly divergent -- opinions about code formatting conventions. I'm no exception. My own style tends to be close to the Linux kernel coding style (though I use 4-character indentations); this is mostly because the Linux kernel style is close to the style used by Stroustroup and by Kernighan and Ritchie.
There are some people who seem to write code with no formatting conventions, or with conventions that I consider severely misguided. Fortunately, there are code reformatting tools. I use a tool called Artistic Style to reformat code that strays too far from the family of formatting conventions with which I'm comfortable.
Feb 23, 2009: Documentation and style
Code is written for two audiences. The first audience is the computer: if your code can't make it past the compiler, it probably will not be of much use. Computers are generally good at pointedly complaining about certain types of problems, so programmers get used to writing code that is at least syntactically correct. Alas, many people are not as good at writing code to be appreciated by the second, human audience.
Part of writing code for humans to read is providing relevant comments. Novice programmers sometimes take this overboard. For example, consider the following:
for i = 1:10 % Here we increment the counter variable i from the value one to the value ten in steps of one.
result(i) = f(i); % In this line, we will evaluate the function f with the argument i. The result of the function evaluation is stored in the ith element of an array called result, which will eventually have length 10.
end % This is the end of the for loop with the variable i.
This code is obvious to anyone who knows the MATLAB programming language, and the comments provide an unwelcome distraction. I have had the misfortune of reading code that looked pretty much identical to this. After grumbling to myself for a bit, I wrote a script that stripped out all the comments from that code. This cheered me up, and made the code remarkably more legible.
If line-by-line comments are too much, what is an appropriate level at which to comment? I use the following rules of thumb:
- If a computation is non-obvious and there is a way to rewrite it that makes it obvious, it's probably better to switch to the obvious version than to introduce a comment. On the other hand, sometimes a non-obvious computation is critical to either performance or accuracy. In this case, a comment is necessary.
-
If it's possible to choose variable names that are obvious enough so that
no comment is needed, that's ideal. The criterion for
obvious enough
should get more and more stringent as the scope of the variable gets broader and broader. Commenting on the counter variable i visible only over a short loop is unnecessary; a global variable called i should result in immediate criminal prosecution, and the possibility of life without parole if there is no comment. - Data structures and the relationships between them should be documented in comments.
-
Non-obvious preconditions and postconditions should be
documented. Obvious preconditions (e.g.
the output array should not be a NULL pointer
) may not need an explicit comment, though the code should certainly guard against them.
An interesting alternative to writing comments around code is to write code as an integral part of an ordinary prose document. This is the basic idea of the literate programming style championed by Don Knuth.
Comments are not the only thing that make code legible. Some other things that make code easier to read are:
- Consistent formatting: Choose a style for indentation, brace placement, etc. and stick to it. Tools like indent and astyle are helpful for enforcing style conventions.
- Consistent naming: If you're writing about analysis, you don't torture your readers by making N a tiny quantity and letting epsilon and delta go to infinity. The unspoken conventions say that epsilon and delta should be small, and N is likely an integer. If you're writing code, think similarly about naming conventions. What are the main concepts, and what are you going to call them? A concise, descriptive variable name is worth its weight in gold -- maybe more, since variable names don't really weigh anything.
- Short functions: A function that takes more than a page of text is worth reconsidering -- would it be simpler to read if it were split into two functions? Similarly, functions that use a lot of variables and functions that involve deep loop and conditional nests are often easier to read if they are split up.
- Single-function functions:
One reason functions get too long is because sometimes they are asked to
do too much. A function should do one thing well, though it should do
that one thing as generally as is feasible. Specifically, it is generally
a good idea to keep I/O separate from the
business
logic of your code. - Clean error handling: One of the difficult aspects of error handling is that it tends to clutter code. To the extent possible, keep the error handling code visually distinct from the business logic in your functions. Error handling can be subtle, and you want a reader to be able to see how things work in the common case before you force them to contemplate all the ways in which things could go horribl awry.
- Avoiding micro-optimization:
Optimization often results in another tradeoff: should the code be as
fast as possible, or should it be legible? Usually it's better to err
in favor of legibility. Legible code is easier to read and debug, and
is more likely to get the correct answer. Also, it's easy to
thoughtlessly
optimize
codes in a way that don't make much improvement in the overall performance, either because the compiler could handle that sort of optimization better or because the routine you optimized was responsible for only a small fraction of the overall run time.
Feb 23, 2009: Defensive coding
As most people learn to program, they also learn to step through code, effectively simulating the computer in their heads. This is an excellent way to understand what code is doing, and -- if there are bugs -- to figure out how to fix them.
A twist on the same strategy is to step through a code and to ask on
each line (or in each expression), what could possibly go
wrong?
What happens if this expression gets an infinity or NaN
input? What happens if the file I'm trying to open isn't there?
Could this argument go out of range? Could this linear system be
singular? Once you've figured out whether there's a possible
problem. you can look at the surrounding code and figure out
whether it has been structured in a way that makes the possible
problem structurally impossible.
What if you think a problem is structurally impossible, but aren't sure? A natural thing to do is to state a precondition that ensures that the problem will not occur, and enforce that precondition with an assert statement. If an assertion fails during the program execution, you at least know where the problem is.
What if you are sure that your code works for reasonable inputs, but it breaks on invalid input? In this case, the least you can do is document the assumptions you make on the input, so that programmers using your routine (including yourself) will know what to watch for. Sometimes this is the most you can do, too, if checking the validity of the input data is infeasible or is too expensive. Ideally, though, you'd like your routines to at least provide the option of checking the input data for validity.
What about pre-production code that you're sure nobody will see until it
has been cleaned up? If you know in advance that you're only trying out
ideas, it might make sense to omit some checks -- but it's easy to take this
too far. Keep in mind that very few one-off
codes actually get used
only once, and that deadline pressures often mean that defering a careful
implementation means a careful implementation will never be written.
Feb 23, 2009: Passing context in MATLAB
One of the features of the MATLAB programming language is anonymous functions (sometimes called lambda functions). For example, if we wanted g(x) to be the function mapping x to x + 1, we could write
g = @(x) x + 1;
The computation done by these functions can depend not only on the function arguments, but also on variables defined in a surrounding environment. This language feature is called a closure, and it provides a very useful way to provide context information. For example, suppose we want to integrate some monomials using MATLAB's quad function. Then we might write
for p = 1:pmax
I(p) = quad(@(x) x.^p, 0, 1);
end
Anonymous functions and closures are common in functional programming
languages, including LISP, ML, Haskell, and so forth. They are not
a native
feature of C++, though packages like Boost attempt to
provide the semblance of anonymous closures at the language level
(see the documentation for the
Boost
Lambda library, for example).
Feb 23, 2009: Passing context in C++
In the integration routines we wrote as part of HW 3, we expressed the interface to the integrand in terms of function pointers. The disadvantage of this particular interface is that we were forced to add an extra argument to accomodate parameters to the function -- whether or not the integrand we were interested in was actually parameterized.
In C++, a function object is an object that can be treated like a function. Unlike a function, though, a function object provides a natural way to carry around context, such as a parameter or a count of the number of times the function has been called. Function objects are particularly useful in conjunction with generic algorithms written using function templates.
As an example, let's consider a (slightly simplified) version of our midpoint rule routine, written as a function template:
template <class F>
double midpoint_rule(F f, double a, double b, unsigned n)
{
double I = 0;
double h = (b-a)/n;
double xmid = a+h/2;
for (unsigned i = 0; i < n; ++i, xmid += h)
I += h*f(xmid);
return I;
}
The type class F can be matched to any type that can be called with a single double argument to return a double value. That includes ordinary functions from doubles to doubles, but it can also include function objects. The compiler uses the template as a recipe to generate a different version of midpoint_rule for each type F
How would we use this template in practice? Suppose we wanted to test our routine on monomials, and have an existing function monomial as follows:
double monomial(double x, int p)
{
double result = 1;
for (unsigned i = 0; i < p; ++i)
result *= x;
return result;
}
We cannot pass monomial directly to midpoint_rule, because it takes two arguments. Therefore, we write a function object adapter that keeps the second argument as context information:
class MonomialFun {
public:
MonomialFun(int p) : p(p) {}
double operator()(double x) { return monomial(x, p); }
private:
int p;
};
We can use MonomialFun objects with midpoint_rule:
void check_monomial1()
{
for (unsigned p = 0; p < 3; ++p) {
double I = midpoint_rule(MonomialFun(p), 0, 1, 1);
printf("Error for x^%d: %g\n", p, I-1.0/(p+1));
}
}
Writing a function object to store the variable p is arguably more natural than writing an adapter function as we would do with C-style function pointers. But this is still a little clunky. If we are willing to go beyond the C++ standard library, we can write things more concisely. The Boost library is a library of useful C++ code that is not (yet) part of the standard library. Among other features of Boost is a library called bind, which can be used to automatically generate function objects that partially bind the arguments to a function -- which is what we've essentially done above. If we install Boost and include boost/bind.hpp, we can rewrite the above routine as
void check_monomial2()
{
for (unsigned p = 0; p < 3; ++p) {
double I =
midpoint_rule(boost::bind(monomial, _1, p), 0, 1, 1);
printf("Error for x^%d: %g\n", p, I-1.0/(p+1));
}
}
The call boost::bind(monomial, _1, p) returns a function object that can be called with one argument x in order to make the call monomial(x, p). This is much more concise than our original implementation.
Feb 9, 2009: C and C++ I/O
Q: It appears to me that the C++ code you wrote for creating the .txt files containing the outputs reflect two different methods to write the files. Specifically I'm referring to question 8 in hw1 and question 3 in hw2. Is this true, and if so, can you comment a bit on the differences between the two methods?
A: Absolutely. C++ has two basic I/O libraries: the stdio library from C, and the iostream library introduced in C++. The C++ library has several advantages: you can add output methods for your own types, for example, and everything is type safe so that it's harder to make mistakes with C++ I/O than with C I/O. In contrast, the C stdio library allows you to write things like
double x = 1.0;
printf("%d\n", x); // x should be double for this to work!
A modern compiler will usually warn you about such problems, but it's not required.
The advantage of C-style I/O -- and it's a pretty big advantage if you spend your time printing floating point numbers -- is that the format strings used by printf and fprintf are remarkably powerful and concise. For example, if I wanted to print a floating point number in scientific notation to three figures (two after the decimal point), and I wanted a space before any positive numbers so that the digits in positive and negative numbers lined up in successive lines on a printout, I could write
printf("% .2e\n", x);
There is a format library in the Boost library that gives some of the best of both worlds. I hope one day that something like this will be part of the standard library.
Feb 1, 2009: Plotting
It generally makes sense to use a logarithmic scale for errors and for step sizes. On a linear scale, these plots tend to be extremely uninformative; for most of the x values, the y value is visually indistinguishable from zero.
There are some plotting packages (not scientific plotters like the ones in MATLAB) in which a semilogarithmic plot does not use a purely logarithmic y scale. Instead, the markers between integer powers of ten are evenly spaced, but between powers of ten there seems to be a linear scale (or something non-logarithmic). This leads to purely exponential curves turning into wiggly lines rather than being straight lines. Please try to avoid this.
Feb 1, 2009: Recurrences
Just because a function is defined via a recurrence does not mean that it needs to be computed using a recursive procedure. In fact, a naively written recursive procedure can be extremely expensive. For example, if you compute the nth Fibonacci number using the obvious recursive routine, the computational cost grows exponentially with n. If you compute using a loop, the cost grows linearly. It's possible to turn the obvious recursive routine into something more efficient by caching results, a process known as memoization; but for most of the recurrences we will work with, this is probably not the clearest way to write the code.
When you want to run a short recurrence forward for several steps and you only care about the last value of the recurrence, it usually makes sense to throw away the old values you no longer need. A slick way to do this is to map the kth value of an m-term recurrence into element k modulo m of an array of length m. For example, the Legendre polynomials are computed using the three-term recurrence (n+1)Pn+1(x) = (2n+1)x Pn(s)- nPn-1(x), with starting values P0(x) = 1 and P1(x) = x. I might code this as
double legendre(double x, int n)
{
double P[3];
P[0] = 1;
P[1] = x;
for (int i = 1; i < n; ++i)
P[(i+1)%3] = ( (2*i+1)*x*P[i%3]
- n*P[(i-1)%3] )/( i+1 );
return P[n%3];
}
Feb 1, 2009: C++ Programming
Based on the first homework, I have the following comments:
-
Unlike Java, C++ is not a garbage-collected language. That means whenever you dynamically allocate an array using new, you should later delete it.
-
The C++ language provides a vector class, which you can use as effectively a drop-in replacement for an array. For example:
vector
v(10); // This is a vector with length 10 for (int i = 0; i < 10; ++i) v[i] = i; // Fill in the vector using array notation While push_back is allowed on a vector, it is not always recommended, particularly if there isn't space pre-allocated for all the vector entries.
-
Often it makes sense to return C++ arrays using an output argument. For example, if I wanted to write a function to fill an array with the first n sums of i, i going from i = 1 to n, I would write
void sums_of_i(unsigned* sums, unsigned n) { assert(sums != NULL && n > 0); sums[0] = 0; for (unsigned i = 1; i < n; ++i) sums[i] = sums[i-1]+i; }I could also do this using vectors, e.g.
void sums_of_i(vector
& sum, unsigned n) { assert(n > 0); sum.resize(n); for (unsigned i = 1; i < n; ++i) sums[i] = sums[i-1]+i; } -
If you want to use the C++ standard library to find the maximum element of a vector, the recommended way is to use max_element. Please don't use sort and then pull an element off the end, unless you have some other use for the sorted elements later.