On statistical schema matching with embedded value mappings

Open Access
Jaiswal, Anuj Rattan
Graduate Program:
Information Sciences and Technology
Doctor of Philosophy
Document Type:
Date of Defense:
June 07, 2012
Committee Members:
  • Prasenjit Mitra, Dissertation Advisor
  • David Jonathan Miller, Dissertation Advisor
  • James Z Wang, Committee Member
  • Wang Chien Lee, Special Member
  • Schema matching
  • embedded value mapping
Schema matching and value mapping across two heterogeneous information sources are critical tasks in applications involving data integration, data warehousing and federation of databases. The complexity of the problem grows quickly with the number of attributes to be matched and also due to multiple semantics of data values used in the real-world. Traditional research has tackled schema matching and value mapping independently, and, mainly, for categorical (discrete-valued) attributes. In this thesis, novel methods that leverage value mappings to enhance schema matching in the presence of opaque column names for schemas consisting of both continuous and discrete-valued attributes are discussed. Additional sources of confounding are that a) a discrete-valued attribute in one schema could in fact be a quantized version of a continuous-valued attribute in the other schema, and b) a continuous-valued attribute in one schema could in fact be a transformed version of a continuous-valued attribute in the other schema. In this approach, the fitness objective for matching a pair of attributes from two schemas exploits the statistical distribution over values within the two attributes. Suitable fitness objectives are based on Euclidean-distance and data log-likelihood, both of which are applied in experimental evaluations. A heuristic local descent optimization strategy that uses two-opt switching to optimize attribute matches, while simultaneously embedding value mappings, is applied for one-to-one or onto matching. A top-K schema matching strategy is developed where a top-K set of schema matches is first generated and statistical hypothesis testing is then applied to identify confident matches. This strategy is further utilized to handle partial schema matchings. Experimental results show that the proposed techniques achieve mixed continuous and discrete-valued schema matching with high accuracy and, thus, should be useful additions to a framework of (semi) automated tools for data alignment.