DEPARTMENT OF COMPUTER SCIENCE

DOCTORAL DISSERTATION DEFENSE

Candidate: Bin Li

Advisor: Professor Dennis Shasha

DOCTORAL DISSERTATION DEFENSE

Candidate: Bin Li

Advisor: Professor Dennis Shasha

**Free Parallel Data Mining**

10:30 a.m., Monday, May 11, 1998

12th floor conference room, 719 Broadway

Abstract

Data mining is the emerging field of applying
statistical and artificial intelligence techniques to the problem
of finding novel, useful, and non-trivial patterns from large databases.
This thesis presents a framework for easily and
efficiently parallelizing data mining algorithms.
We propose an acyclic directed graph structure,
*exploration dag* (*E-dag*),
to characterize the computation model of data mining algorithms
in classification rule mining,
association rule mining, and combinatorial pattern discovery.
An E-dag can be constructively formed in parallel from specifications of
a data mining problem, then a parallel E-dag traversal is performed on
the fly to efficiently solve the problem.
The effectiveness of the E-dag framework
is demonstrated in biological pattern discovery applications.

We also explore data parallelism in data mining applications.
The cross-validation and the windowing techniques used
in classification tree algorithms facilitate easy development
of efficient data partitioning programs.
In this spirit, we present a new classification tree algorithm called
*NyuMiner* that guarantees that every split in a classification tree
is optimal with respect to any given impurity function and any given maximum
number of branches allowed in a split.
NyuMiner can be easily parallelized using the data partitioning technique.

This thesis also presents a software architecture for running parallel data mining programs on networks of workstations (NOW) in a fault-tolerant manner. The software architecture is based on Persistent Linda (PLinda), a robust distributed parallel computing system which automatically utilize idle cycles. Templates are provided for application programmers to develop parallel data mining programs in PLinda. Parallelization frameworks and the software architecture form a synergy that makes free efficient data mining realistic.