The principle has its origin in the philosophy of science. In judging between scientific theories, other things being equal, one prefers the simpler explanation. A powerful theory, such as the atomic theory or the theory of gravitation is one that is compact but accounts for a large number of phenomena.

Many different forms of learning can be characterized in terms of the MDL principle:

** General rules:** The assertion of a general rule eliminates the need
to specify each instance. For instance, given a table of animals and their
features, the rule ``All crows are black'' means that the ``color'' field
can be omitted for each crow in the table.

** General rules with exceptions:** The assertion of a rule
with a small number of exceptions saves space in all cases where the rule
applies; the exceptions can be enumerated separately.

** Numerical rules.** If one numeric parameter is a function of another,
then the tabulation of the values of the two parameters can be replaced
by a tabulation of the values of the independent parameter and the function.
For example, data consisting of the pairs "X=1.0, Y=3.0; X=1.2, Y=3.2;
X=2.7, Y=4.7; X=5.9, Y=7.9'' can be replaced by the rule "Y = X+2.0" and
the values "X=1.0; X=1.2; X=2.7; X=5.9". Of course, one has to be sure
that the rule is shorter than the data. Any K points can be interpolated
by a K-1 degree polynomial, but since the polynomial requires K coefficients
there is no gain in compactness.

** Approximate numerical rules. ** If one numeric parameter is approximated
closely by a function of another, then the tabulation of the two parameters can
be replaced by giving the function, giving the values of the dependent
parameter, and giving the value of the error. Since the error is smaller than
the value of the function, it can be represented more compactly, to any
specified degree of accuracy.
For example, data consisting of the pairs "X=1.0, Y=3.1; X=1.2, Y=3.2;
X=2.7, Y=4.8; X=5.9, Y=7.8'' can be replaced by the rule "Y = X+2.0+E" and
the values "X=1.0, E=-0.1; X=1.2, E=0.0; X=2.7, E=0.1; X=5.9, E=-0.1".
In the initial data, we have to represent values of Y between 1 and 8 to an
accuracy of 0.1, which would require 7 bits each. By contrast, E takes
on only 3 values, and therefore requires only two bits per data point.

** Arbitrary exact classifier** Any classification system
be viewed as MDL as follows: Given a table of labelled instances, delete
the desired output, and replace it with a description of the classifier
The original table can then be reconstructed by applying the network to
the instance.

** Arbitrary inexact classifier** Suppose you have a
classification system that does better
than randomly guessing the classification attribute according to its
distribution. Let C be the classification attribute and R the prediction.
Then the conditional entropy of C given R is less than the entropy of C
CEnt(C|R) < Ent(C). Therefore, using R you can find an encoding of C that
is shorter than the optimal encoding without using R. So if you encode
the C column using the description of the classifier plus the optimal
encoding of C given R, you will be shorter than the optimal encoding of C
without R.

** Clustering: ** Suppose that instances X1 ... Xn
are divided into clusters C1 ... Ck. Let M(Ci) be the
center of Ci (or a typical instances of Ci).
Then the data can be represented as follows:

- Record M1 ... Mk.
- For each instance Xi, record the difference between Xi and M(cluster(Xi)).

** Learning from positive examples: ** In practice, learning is often
done just from a corpus of positive examples, with no negative examples.
For instance, in language learning, a child is presented almost exclusively
with correct usages of the language; it does not have to rely on
incorrect usages being pointed out. In what sense, then, is a grammar
of the language better then the simple grammar S -> word*, which also
accounts for all the positive data?

MDL can provide an answer. If the language does obey some non-trivial grammar, then that grammar can be used to derive a more compact representation of a corpus of sentences. For instance, consider a context-free grammar where all the rules have been written in the form NON-TERMINAL -> SYMBOL SYMBOL ... (Chomsky normal form). We can develop an encoding as follows: For each nonterminal, number the productions from that non-terminal. Then, doing a depth-first search of the parse tree, give the number of the production used at each node. The result is generally an encoding that is a little shorter than the original sentence, since the number of choices at each step is less than the number of words in the language.

For instance, suppose the grammar is as follows:

1) S -> NP VP

1) NP -> pronoun

2) NP -> det ADJ_LIST noun PP_LIST

1) VP -> verb PP_LIST

2) VP -> verb NP

1) ADJ_LIST -> null

2) ADJ_LIST -> adj ADJ_LIST

1) PP_LIST -> null

2) PP_LIST -> PP PP_LIST

1) PP -> prep NP

1) pronoun -> (1) I (2) you ...

noun -> (1) boys (2) fish ...

verb -> (1) fish (2) ate ...

det -> (1) a (2) the ...

adj -> (1) cold (2) sharp ...

prep -> (1) with (2) of ...

Now, the parse tree

S(1) ---> NP(2) ---> det(2) ---> the | | | |-> ADJ_LIST(1) -> null | | | |-> noun(1) ---> boys | |-> VP(2) ---> verb(2) ---> ate | |-> NP(2) ---> det(1) ---> a | |-> ADJ_LIST(2) ---> adj(1) ---> cold | | | |-> ADJ_LIST(1) ---> null | |-> noun(2) -> fish | |-> PP_LIST(1) -> nullcan be represented as the sequence 1,2,2,1,1,2,2,2,1,2,1,1,2,1.

** Explanation of overfitting**. The MDL theory gives an elegant
explanation of why too rich representational schemes tend to overfit:
When the encoding of the classifier itself is longer than the original data,
or almost as long, then nothing is gained in terms of description length.
E.g. You can represent K numbers as the values of a K-1 degree polynomial,
but no description length is gained, since you now need K values for the
coefficients of the polynomials. You can exactly fit a decision tree to data,
if there is a separate leaf for each datum, but again no gain. You
can cluster N points tightly into N clusters, one per point, but again no gain.