In research on empirical NLP, software is a tool for scientific
discovery. As our science advances, so must our tools. Since our
science tends to advance rapidly, the purpose and functionality of our
software change frequently. Therefore, DON'T WRITE LARGE MONOLITHIC
PROGRAMS. Their design is guaranteed to become obsolete before you
finish your project. Instead, write small programs whose call
structure is a deep tree (or lattice). This kind of design is much
easier to modify rapidly.
Everyone who intends to work on a non-trivial software system (which
would include at least all of my PhD students) should study the book
_Design Patterns_ by Gamma et al. (1994), and keep it handy for
reference. This book contains some of the best wisdom of the software
development community. It is not just the latest trend. It has
withstood the test of time, and is considered basic knowledge by all
seasoned software developers. Understanding the contents of this book
can increase your programming productivity tenfold.
Your code should contain enough comments to be completely
understandable by somebody who knows what it's supposed to do, but has
no idea how it works inside. At the very least, every class and
method should have some prose description attached. As a rule of
thumb, at least every 5 lines of code should have some explanation
attached. (Some people say there should be more prose than code in
every program, but I'm not that extreme.... yet.) It is much easier
to comment *while* you're coding, rather than adding comments
afterwards, so get in the habit.
For maximum usefulness, your documentation should be compatible with
automatic documentation generators. For C++, use the doxygen
conventions. It's easy. For one line comments, precede the comment
with "//!" instead of just "//". E.g.:
//! This is a doxygen-compatible one-line comment.
For multi-line comments, start the comment with "/*!" instead of just "/*".
/*! This is a doxygen-compatible
You can also get much more sophisticated with Doxygen, and make your
documentation much more useful to yourself and others. See the online Doxygen
manual at http://www.stack.nl/~dimitri/doxygen/manual.html for
Here are some guidelines that are specific to C++:
- Variables whose scope is more than a few lines should have
- All constants should be described, declared, and initialized at
the top of source files, so that they're easy to find and change.
This includes string constants like file names.
- If you're going to add code to an existing code base (such as
GenPar), study that code base first, to take advantage of all
its functionality and to make sure that you don't recode functionality
that already exists. Use the Doxygen docs for this purpose. Don't
hesitate to ask me or your colleagues about how different pieces fit
- Practice generic programming. If you find that you need a new
algorithm for a specific purpose, think about whether that algorithm
could be useful for other purposes. If so, then make it part of a
common library, rather than burying it deep in some specialized source
file. This practice might save somebody else (maybe even you!) the
effort of recoding it later.
- To avoid code bloat, delete dead code and obsolete source files.
You can always recover them later from CVS if necessary.
- To make your code easier to study, avoid unnecessary vertical whitespace.
- To facilitate future distribution, use Free/free software and
libraries whenever possible.
- Names of classes and namespaces should start with a capital letter, and follow the pattern VariableNameWithCaps.
- Names of objects and builtin types should start with a lowercase letter, and follow the pattern variableNameWithCaps.
- Names of constants should be in ALLCAPS with UNDERSCORES_IF_NECESSARY.
- Names of class member variables should begin with m_ .
- Names of pointer variables should begin with pX, X a capital letter.
- With recent compilers, there is almost never a good reason to use #DEFINE.
- Do not use C-style casts of the form (type)other-type.
They are hard to see and are otherwise bug-prone. Instead, use
C++-style casts, like static_cast or reinterpret_cast. Also, do not
use dynamic_cast where a static_cast would suffice.
Using CVS, Subversion, or other version control software
- If you're coding in tandem with somebody else, agree on a
standard indentation style. Otherwise, it will be difficult to track
changes in CVS.
- Don't check your revisions into the repository until you're sure
they don't break the compilation of any other code that depends on
them. That includes code that other people may be working on. (We
share Makefiles for this purpose.)
- Make changes to the code in small atomic steps, such that after
you finish a step, the code compiles and passes all validation tests.
After every step, make sure you check your changes into CVS. This is
especially important if other people are working on the same code at
the same time, because otherwise you're much more likely to have
conflicts to resolve.