Information Extraction is the automatic extraction of facts from text, which includes detection of named entities, entity relations and events. Conventional approaches to Information Extraction try to find syntactic patterns based on deep processing of text, such as partial or full parsing. The problem these solutions have to face is that as deeper analysis is used, the accuracy of the result decreases, and one cannot recover from the induced errors. On the other hand, lower level processing is more accurate and it can also provide useful information. However, within the framework of conventional approaches, this kind of information can not be efficiently incorporated.
This thesis describes a novel supervised approach based on kernel methods to address these issues. In this approach customized kernels are used to match syntactic structures produced from different preprocessing phases. Using properties of a kernel, individual kernels are combined into composite kernels to integrate and extend all the information. The composite kernels can be used with various classifiers, such as Nearest Neighbor or Support Vector Machines (SVM). The main classifier we propose to use is SVM due to its ability to generalize in large dimensional feature spaces. We will show that each level of syntactic information can contribute to IE tasks, and low level information can help to recover from errors in deep processing.
The new approach has demonstrated state-of-the-art performance on two benchmark tasks. The first task is detecting slot fillers for management succession events (MUC-6). For this task two types of kernels were designed, a surface kernel based on word n-grams and a kernel built on sentence dependency trees; the second task is the ACE RDR evaluation, which is to recognize relations between entities in text from newswire and broadcast news transcript. For this task, five kernels were built to represent information from sentence tokenization, syntactic parsing and dependency parsing. Experimental results for the two tasks will be shown and discussed.