MINING, INDEXING, AND SEARCH APPROACHES TO ENTITY AND GRAPH INFORMATION RETRIEVAL FOR CHEMOINFORMATICS

Open Access
- Author:
- Sun, Bingjun
- Graduate Program:
- Computer Science and Engineering
- Degree:
- Doctor of Philosophy
- Document Type:
- Dissertation
- Date of Defense:
- September 24, 2008
- Committee Members:
- C Lee Giles, Committee Chair/Co-Chair
Prasenjit Mitra, Committee Chair/Co-Chair
Jesse Louis Barlow, Committee Member
Robert T Collins, Committee Member
Bing Li, Committee Member - Keywords:
- Information retrieval
information extraction
entity tagging
entity search
query model
graph mining
graph indexing
graph search
feature selection
Chemoinformatics - Abstract:
- Traditional generic search engines using textual keyword matching do not support domain specific searches. However, different domains may have different domain specific searches. For example, Chemical research is molecule centric rather than document centric. Usually a chemical molecule can be represented in multiple ways, e.g., textual chemical entities such as chemical names and formulae, and chemical structures such as 2D graphs and 3D graphs. Thus, in Chemoinformatics, chemical entity searches and chemical structure searches are more important than simple document searches using keyword matching. In this work, we show how to build a domain specific search engine that enables both entity and 2D graph searches for chemical molecules. First of all, documents are collected from the Web, and then preprocessed using document classification and segmentation. We apply Support Vector Machines for classification and propose a novel method of text segmentation. Then chemical entities in the documents are tagged and indexed to provide fast searches. Simultaneously, chemical structure information are collected, processed, and indexed for fast graph searches. Many issues exist to support textual chemical entity searches. Chemical names and formulae usually appear in chemical documents when corresponding molecules are mentioned, but a chemical molecule can have different textual representative ways. A simple keyword search would retrieve only the exact match and not the others. Additionally, ambiguous non-chemical terms such as "He" are retrieved. We show how chemical entity searches can improve the relevance of returned documents by avoiding those ambiguous terms. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for entity tagging that considers long-term dependencies at the sentence level. Then to support efficient and effective entity searches, we propose two feature selection methods for entity index building. One is to select frequent and discriminative subsequences from all the candidate features for chemical formula indexing. The other is to first discover subterms of chemical names with corresponding probabilities using a proposed independent frequent subsequence mining algorithm, and then segment a chemical name hierarchically into a tree structure based on discovered independent frequent subsequences. A unsupervised hierarchical text segmentation (HTS) method is proposed for this. Then subterms on the HTS tree can be indexed. Finally, query models with corresponding ranking functions are introduced for chemical entity searches. Experiments show that our approaches to chemical entity tagging, indexing, and search perform well. In addition to text searches, massive amounts of structured data in Cheminformatics and Bioinformatics raise an issue of effective and efficient structure searches. Graphs are used to model structures for a long time. A typical type of graph search is a subgraph query that retrieves all graphs containing the query graph. To support efficient search, graph features such as subgraphs are extracted from graphs for graph indexing. Then, for a given subgraph query, graph candidates that may contain the subgraph are retrieved using the index, and subgraph isomorphism tests are performed to scan the candidates. However, since the space of all the possible subgraphs of the whole graph set is prohibitively large, feature selection for index pruning is required. Thus, one of the key issues is, given the set of all the possible subgraphs of the graph set, which subset of features is the optimal to achieve the highest precision for a given query set? To answer this question, first, we introduce a graph search process for subgraph queries. Then, we propose several novel feature selection criteria, Max-Precision, Max-Irredundant-Information, and Max-Information-Min-Redundancy, based on pointwise mutual information. We also propose a greedy feature search algorithm to find a near optimal feature set based on Max-Information-Min-Redundancy. Finally we show empirically that our proposed methods achieve a higher precision than previous methods. Besides subgraph queries, users often execute similarity graph queries to search for graphs similar to the query graph. Previous methods usually use the maximum common edge subgraph (MCEG) to measure the similarity of two graphs. However, MCEG isomorphism is NP-hard and prohibitively slow to scan all graphs in the data set, so that it is not feasible for online searches to use MCEG in the fly to measure the similarity of retrieved graphs. We propose a novel approach to this issue from a different view by ranking search results using a weighted linear graph kernel that can avoid real time MCEG isomorphism tests, but achieve a reasonably high quality of search results. First, subgraphs features are extracted from graphs and indexed. Second, for a given graph query, graphs containing its subgraph features are retrieved and similarity scores are computed based on the indexed subgraph features. Finally, graphs are sorted and returned based on similarity scores. Subgraph weights used by the graph kernel are learned offline from a training set generated using MCEG isomorphism. We show empirically that the proposed methods of learning to rank for similarity graph queries can achieve a reasonably high normalized discounted cumulative gain in comparison with the "gold standard" method based on MCEG isomorphism. Moreover, our method can be applied to learn other similarity metrics, such as explicit knowledge provided by domain experts or implicit knowledge from user logs.