Author(s): Phuc Nguyen, Hideaki Takeda
Abstract: Semantic labeling for quantitative data is a process of matching numeric columns in table data to a schema or an ontology structure. It is beneficial for table search, table extension or knowledge augmentation. There are several challenges of quantitative data matching, for example, a variety of data ranges or distribution, and especially, different measurement units. Previous systems use several similarity metrics to determine column numeric values and corresponding semantic labels. However, lack of measurement units can lead to incorrect labeling. Moreover, the attribute columns of different tables could be measured by units differently. In this paper, we tackle the problem of semantic labeling in various measurement units and scales by using Wikidata background knowledge base (WBKB). We apply hierarchical clustering for building WBKB with numeric data taken from Wikidata. The structure of WBKB follows the nature taxonomy concept of Wikidata, and it also has richness information about units of measurement. We considered two transformation methods: z-score-tran based on standard normalization technique and unit-tran based on restricted measurement units for each semantic label of WBKB. We tested two transformation methods on six similarity metrics to find the most robust metric for Wikidata quantitative data. Our experiment results show that using unit-tran and ks-test metric can effectively find corresponding semantic labels even when numeric columns are expressed in different units.
Keywords: semantic labeling; quantity; unit of measurement; tabular data; LOD; Wikidata