The Dan Melamed User Guide:
How to do research with Dan Melamed
One of the first things you should do when you start to work with me
is to sign up for the NLP-group and NYCNLP mailing lists. If you work
on the 7th floor of 715/719 Broadway, then you should also sign up for
the Broadway7 list.
Another thing you should do is create a web page for yourself, and add
a link to it from http://nlp.cs.nyu.edu/people/.
It need not be fancy, but it should exist. I put this in the
"Keeping Informed" section for a reason. If the world knows you
exist, knowledge and opportunities will come your way. Eventually,
you should link your publications, software, and other goodies from your
homepage, to encourage people to visit.
Take everything I say as opinion, not as gospel. If something doesn't
make sense to you, ask for clarification. Don't hesitate to challenge
me, or to suggest alternatives. I like it when people keep me on my
toes. I like to engage in intellectually stimulating debate. If I
end up losing a debate, so much the better, because that means I
learned something. Learning is ultimately what I'm here for. These
days I'm learning quite a lot, which means that I lose debates on a
regular basis :)
Time is my most valuable resource. Please don't waste it. Please let
me know as far in advance as possible if you need to cancel or
postpone a scheduled meeting. On the other hand, you don't necessary
need to schedule a meeting in order to speak with me. I'm happy to
chat informally, as time permits. Feel free to interrupt me. The
worst thing that can happen is that I might tell you to come back later.
When we meet, you should take notes, so that you don't forget stuff.
(See "Writing" below.)
When you need help with hardware, software, connectivity, etc., you
should try the following, in order:
is a list of hardware available to people doing research with me.
If you need more, ask me.
- Ask the person sitting next to you, if any.
- Ask your advisor, if he's around.
- Check our FAQ list.
- Send email to email@example.com, with a cc: to your
Your home directory is cross-mounted, so that you have access to
it from all the boxes in the lab. We achieve this by actually storing
it on an off-site server called NFS.CIMS.NYU.EDU. Unfortunately, this
means that access to data in your home directory is slower than access
to local disks. Therefore, data that does not need to be backed up
should live in the /data partition of your desktop machine, rather
than in your home directory.
Here is a
list of the software that is standardly installed on our workstations.
If you will be doing intensive computation, you will want to learn
about our Condor job distribution system, starting here. If you're planning to
compile large C++ codebases (such as GenPar), then you'll want to
learn about distcc.
If you want your C/C++ programs to be able to use more than 2GB of RAM
(e.g. on s1), then you have to compile them with the -m64 switch for
If you are doing memory-intensive computing in Java, then you should
understand the -Xms and -Xmx switches.
We have lots of useful software on s1 in /s1/software/. The stuff
we're actively developing is in the CVS and/or Subversion repository.
Oh, this means you have to learn Subversion and maybe CVS. They're
pretty simple, though, and once you learn it you will love it. To
learn Subversion, start with Chapters 2 and 3 of this excellent and free
online book. A very short intro to CVS is here.
Most of the useful data that we have is on s1 under /s1/data/. Some of
it is read-protected due to license restrictions. If you need
something that you can't read or can't find, please ask. Please pay
attention to the license conditions, usually described in files like
README, COPYRIGHT, and LICENSE. Some of the data may not leave the
disks of s1. Much of it may not leave NYU. Breaking these rules can
severely compromise my ability to do research, and therefore also your
ability to do it with me.
If you need some other resource to help you work effectively, whether
it's common stuff or something very unusual, please don't hesitate to
ask. No matter how outrageous the request, the worst
possible outcome is that I'll say no. On the other hand, you might be
surprised how far I'm willing to go to make your life easier. And
even if I say no, I might keep your request in mind, and I might find
a way to say yes later. Typical "other" resources include books,
furniture, and computer hardware and software. Untypical resources
include travel allowances, broadband service to your home, and
exclusive use of computer hardware.
In research, one must be constantly vigilant against reinventing the
wheel. Unoriginal work is not only a waste of time but, possibly
worse, it might offend those whom you should have cited but didn't.
So read anything you find interesting, but read everything
that's directly relevant to your research. Learn to use the library,
both the online and traditional varieties. Make an effort to obtain
and read publications that are hard to find but likely to be relevant.
If you find interesting ideas, write them down together with their
source. If you develop original ideas while reading somebody else's,
write down the source of your inspiration --- a stronger connection
might crystallize later.
Some specific suggestions on how to read papers are here.
Let me know what you read. I might be able to offer relevant insights
that increase your rate of knowledge gained per unit time invested.
Knowing what you read also benefits me, because you will sometimes
find and read something that I should read too but didn't know about.
In that case, I will appreciate you alerting me to the source.
In addition to reading papers, you should attend presentations in your
field. This applies not only to presentations that are directly
relevant to your current research focus, but also to presentations
that are tangentially relevant, as well as all presentations given by
your colleagues and/or respected scientists. A good scholar should
have breadth as well as depth. For example, I recommend regular
attendance at the NYCNLP
Forum. The topics that are chosen tend to be of the kind that
will be relevant to you eventually, even if they are not relevant
Listening to presentations need not be a passive activity. To get the
most out of your listening experience, it is sometimes a good idea to
ask questions. Some people aren't sure when it's appropriate to ask
questions about a formal presentation. During? After? It is almost
never appropriate to ask questions in the middle of a research
presentation. First, it is impolite to impose on other members of the
audience with questions that they might not care about. Second, most
presentations have time constraints, and you want to give the speaker
time to say what s/he wants to say. Therefore, questions that are
appropriate in the middle of a presentation are only those that are
likely to help the speaker get their message across. To my knowledge,
the only kind of question that fits this description is a very
specific clarification question, like "What does the term X represent
in your slide," which the speaker is likely to answer in a just a few
Many research presentations have a formal question period at the end,
i.e. a time when the speaker is expected to field questions from the
audience. What kinds of questions are appropriate to ask during this
time? Remember that you can usually chat with the speaker informally
later, either in person or by email. To avoid wasting other people's
time, the questions that you ask in a public forum should be the kind
that are interesting to the public. Even then, there are two types of
questions to avoid.
First, you should avoid asking any questions when the speaker
is one of your academic "allies," such as a research collaborator or a
fellow student. The reason is that answering questions from the
audience is one of the hardest parts of giving a presentation. You
can't anticipate what the questions will be. Consequently, every
"live" question runs the risk of embarrassing the speaker in a public
forum, which benefits nobody. This risk is usually outweighed by the
benefit of gaining a deeper understanding of the topic of the
presentation. However, the risk of embarrassing somebody that you
really care about outweighs any potential benefit to people not in
that category. Of course, there are ways to ask questions that make
your allies look good, but those are tricky, and rarely worth the
Second, you should avoid public confrontations. The academic
establishment puts a premium on "collegiality." Bad science will
almost always be seen as such sooner or later. Don't publicly
criticize somebody just because they're doing bad work. Do that
privately and constructively. An exception might be justified if
their bad work adversely affects you in some way. For example, if
somebody misrepresents your work in an unflattering way, then you have
a right to point this out publicly. Even then, I would not recommend
public confrontations until you are an expert in the relevant field.
Then, if you see a serious flaw in the presentation, and you don't
mind embarrassing the speaker, aim for the head. If you misfire, you
will look like a fool, so don't try this until you've seen others do
it many times, and you are sure of your aim, and you can predict the
political consequences with confidence. Definitely don't do this if
you are emotionally upset. The academic community does not respect
You should code under the assumption that your code will be
distributed far and wide. This means that you should practice good
software engineering, to make sure your code is portable, modular, and
well-documented. Even if your code is never distributed, you yourself
will benefit from coding this way, because you will often reuse your
own code in other projects, sometimes years later, after you forget how
it all fits together.
There is a trade-off between coding for the short term and coding for
the long term. When you absolutely must finish some piece of code
before a deadline, you might cut a few corners. However, I would like
most of the code produced by my students and staff to be useful in the
long term, even by people who may not have been around when the code
was written. If you cut corners, I expect you to eventually return to
your quick-and-dirty code, and make it long-lasting.
There are several kinds of software engineering tools that you should
start using from day one:
This document is not the place for a treatise on software engineering.
However, whenever I find myself wishing that somebody's code was
better, I will add guidelines to this
The first law of data is that there are no laws for data. Anything
and everything can and does happen in real-world data. In
data-intensive research such as empirical NLP, you should never trust
data to follow any rules, even if its documentation says that it does.
E.g., you should not expect English text to be purely English, or even
purely text! This means that your data-processing software should be
robust to aberrations, and should fail gracefully if its input is
unexpected. Given that it's infeasible to write perfectly robust
software, you should often check-point and spot-check your processes,
to make sure that the output looks reasonable, before it's fed into
the next process. As you get more experience in NLP, you will develop
an intuition for what "reasonable" looks like. However, some problems
will be obvious even to novices.
- Version control. Your software prototypes will change often.
Sometimes you'll wish to undo the last couple of changes, but you
won't remember the relevant lines of code that got changed. Version
control systems eliminate this problem. Our local system of choice is
Subversion, which also enables several people to work on the same code
base simultaneously. Some of our legacy code is in CVS, an archaic
version control system we used before Subversion. To learn
Subversion, start with Chapters 2 and 3 of this excellent and free
online book. A very short intro to CVS is here. To make your
code usable by other people, you should always write informative
descriptions of the changes you make, even if it seems like nobody
else will ever want to use your code (you'll be surprised!)
- Profilers. Never try to optimize your code without
profiling it first. If you think you know why your program is
inefficient, you're probably mistaken. More often than not, the main
sources of inefficiency are ones that programmers don't expect. Even
very good programmers. The gprof, memprof, and valgrind packages are
particularly useful, and easy to use.
- Auto-documenters. You can create better documentation and save
lots of time by using automatic documentation generators, which can
automatically generate cross-referenced documentation from
specially-formatted comments in your code. For Java, there's JavaDoc.
For C++ there's Doxygen.
Learn them early and use them religiously. It's easy and very helpful
once you get the hang of it.
- IDE. For many modern programming languages, and for many
platforms, there are integrated development environments (IDEs), which
combine code editors, version control, optimizing compilers, and
interactive graphical debuggers, and documentation generators. Today,
there is about as much justification for not using one as there is for
programming in assembly language.
In empirical NLP, we often deal with very large data sets and
expensive algorithms. It is fairly common for experiments to run for
weeks or even months. Unfortunately, computer systems are not as
reliable as we'd like them to be. Therefore, if you are writing
software to manage long-running experiments, you should write it with
the expectation that it will crash before it finishes. In particular,
you should save intermediate results to disk. You should also make it
easy to restart from various places in the middle of the process. The
latter is always easier if your code is written as a hierarchy of
little programs, rather than as one monolithic program.
When experimenting with machine learning on large datasets, remember
that some theories can be effectively tested with just a tiny subset
of all your data. Although you should usually use all of the
available data to generate final results, you can save much time by
doing exploratory work on small samples.
When you do exploratory data analysis (EDA), resist the temptation to
"collect butterflies." When people collect butterflies, they focus on
the most beautiful/exotic ones. But Zipf's law says that exotic
events are rare. Empirical NLP systems should work well as often as
possible, so it's more important to analyze the frequent cases than
the rare/unusual cases. All this is another way of saying that when
you need to pick some data to stare at, you should use random
sampling, not sampling by (real or apparent) importance.
Ideas and Work in Progress If your ideas and results are not
described in prose text, then they aren't worth sh*t. Keep a research
journal handy. Write down everything. Guard it with your life. It
doesn't have to be well-thought out or well-edited to be written down.
The point is to record useful stuff in a medium that doesn't disappear
as easily as memory (yours or RAM). If you're like most academics,
then you have too much valuable information to keep in your head. To
make it easy to write things down on the train or in the middle of the
night, don't rely on a computer -- just use a pen.
Ask to see my journal for more specific suggestions.
When your ideas are more fully developed, or when you have a coherent
set of results, it's time to start putting together a more organized
here for specific advice on how to write up research,
including advice on content, format, and the software tools that you
should use. When you have written something that approximates a
complete and coherent document (of any length), it's time to circulate
it in the Paper Network. One of the best ways to improve your papers
(and the research that they are based on) is to get feedback from your
peers. If your peers can't understand what you wrote, then neither
will your reviewers or anybody else. In any case, you'll be amazed at
what a second pair of eyes can catch. Every paper should circulate
through the paper network at least once during its development. The
benefit is bidirectional, since reading your peers' papers keeps you
informed and stimulates the flow of ideas.
When you want my feedback on something you've written, please
give it to me single-sided, double-spaced, in a font that's at least
as large as the standard 11pt font in LaTeX. Please proofread it
yourself before giving it to me, to catch obvious errors and typos. In
particular, please use a spell-checker every time. These simple
measures make it easier for me to read your work and to give you
useful advice. I will usually mark up your document with suggestions.
Some of the them might involve the editing
notation listed here.
In research, the old saying about "publish or perish" is pretty close
to the truth. When you have some research results that others might
benefit from, it's time to think about publishing. If you don't have
a PhD, then you must arrange for me to read your article before
submitting it for publication. I recommend it even if you have a PhD,
but then it's not mandatory.
Everybody who directly and significantly contributes to a publication,
in terms of ideas, design, implementation, experimentation, writing,
and editing, should usually be listed as one of the authors. Work on
research infrastructure that benefits multiple projects does not count
as a direct contribution.
Authorship carries responsibilities. If your name is on a publication,
then the research community will hold you responsible for the work's
quality and originality. Therefore, every author has the right to
read the final version of a publication before it is submitted, and to
request modifications. An author who disagrees with the content of a
publication has the right to withdraw their name from the list of
authors. Never include somebody as an author without their informed
The default order of authors is alphabetical by last name.
Modifications to this default might be appropriate when some author(s)
contributed far more than others. People can be very sensitive on
this issue, so consult with me if you're not sure.
In fast-moving fields like Empirical NLP, the most prestigious
publication venues are conferences, not journals. This fact of life
has the advantage that you can develop a reputation more quickly. It
has the disadvantage that the conference paper reviewing process is
rather imprecise. You should get used to the idea that even the very
best papers are sometimes rejected by short-sighted reviewers. E.g.,
when Mitch Marcus first tried to publish a paper about the Penn
Treebank, it was rejected with the question "What possible relevance
could this have to NLP?"
Before you submit the final version of a conference paper for
publication, be sure to review these
guidelines. Pay special attention to the required
Most academic conferences require the authors of accepted papers to
orally present their paper at the conference. As a rule of thumb, if
you are working in my group, and you are the first author of a
conference paper, then I will sponsor you to attend that conference to
present your work. This simple rule has several non-obvious
Before you attend your first conference, read up to section 6 of Networking
on the Network: A Guide to Professional Skills for PhD Students by
Phil Agre. Then, consult with me about effective conferencing
The quality of oral presentations makes a huge difference to whether
people will pay attention to you and your work. The key ingredients
of a good presentation are planning and sufficient practice. You
should prepare conference presentations far enough in advance to
schedule a practice talk in front of our research group, and to make
revisions afterwards. When you prepare your first few talks, you will
be amazed at how long it can take to prepare good visual aides.
Allocate at least a week just for that. Practice talks are mandatory
for my students and staff. Before your practice talk, you should
practice by yourself, in an empty room, until your presentation sounds
fluent and natural. It may seem strange to talk out loud in an empty
room, but you'll get used to it. Ask me for assistance if you get
- Like any other expense, the cost of attending a conference should
be justifiable. I will not sponsor trips to obscure conferences on
the other side of the world. When you're ready to publish something,
consult with me about appropriate venues.
- If you submit a paper to a conference, then you may have to
attend that conference. So don't submit papers to conferences that
you cannot or don't want to attend! Factors that may affect your
decision include your legal right to leave (and re-enter!) the U.S.,
your legal right to enter the host country of the conference, and
conflicting family obligations.
- See "Vacations" below.
- See "Expenses" below.
- See "Presentations" below.
Also, please note that the purpose of an oral or poster presentation
at a conference is not to summarize the paper. The purpose
is to advertise the paper, to get people to read it. If you're not
sure about the difference, come talk to me about it.
Here is some specific advice on how to prepare and deliver oral presentations on technical material such as research results.
For better or for worse, the major NLP conferences take place between
May and August each year. This means that the paper submission
deadlines for these conferences are mostly between December and
February, inclusive. Since you will want to submit your latest and
greatest results, you will typically want to write papers during these
winter months. An advantage of doing this at the same time as
everybody else is that you can participate in the paper exchange
network (see "Writing" above). Therefore, it would behoove you to
schedule vacations during some other time of the year. When I was in
grad school, I usually attached vacations to conference trips, in
order to save money on airfare.
If you need reimbursement for an expense that I approved, please get a
reimbursement form from my administrative assistant, fill it out, and
give it to me with all your receipts. I will sign it, and send it to
the right place. Depending on the amount, you'll get your money in a
few days to a couple of weeks.
Most of the money that I use to cover travel expenses comes from
grants that impose certain restrictions on how that money can be
spent. In particular, most of these grants require that any airfare
must be purchased from a U.S.-based airline. See here for more details. Do not break these
rules without consulting me first.
If you can think of things you've learned about working with me that
other people would benefit from knowing, please email them to me so
that I can add them here!
Dan Melamed (melamed at cs dot nyu dot edu)
Last modified: Tue Sep 12 15:57:30 EDT 2006