Quality Checking and Matching Linked Dictionary Data Kun Ji, Shanshan Wang, and Lauri Carlson University of Helsinki, Department of Modern Languages {kun.ji,shanshan.wang,lauri.carlson}@helsinki.fi 1 Introduction The growth of web accessible dictionary and term data has led to a proliferation of platforms distributing the same lexical resources in different combinations and packagings. Finding the right word or translation is like finding a needle in a haystack. The quantity of the data is undercut by the redundancy and doubtful quality of the resources. In this paper, we develop ways to assess the quality of multilingual lexical web and linked data resources by internal consistency. Concretely, we decon- struct Princeton WordNet [1] to its component word senses or word labels, with the properties they have or inherit from their synsets, and see to what extent these properties allow reconstructing the synsets they came from. The meth- ods developed should then be applicable to aggregation of term data coming from different term sources - to find which entries coming from different sources could be similarly pooled together, to cut redundancy and improve coverage and reliability. The multilingual dictionary BabelNet [2] can be used for evaluation. We restrain our current research to dictionary data and improving language models rather than introducing external sources. 2 Methodology In ([3]) we canvassed our sample of large dictionary and term databases for the descriptive fields/properties of entries to see what sources for matching entries they provide. The following types of properties susceptible to matching could be found in different combinations: 1) languages and labels 2) translations / synonyms 3) thematic (subject field) classifications 4) hypernym (genus/superclass) lattice 5) other semantic relations (antonym, meronym, paronym) 6) textual definitions / examples / glosses 7) source indications 8) grammatical properties (part of speech, head word, etc) 9) distributional properties (frequency, register etc) 10) instance data (text or data containing or labeled by terms) 2 Quality Checking and Matching Linked Dictionary Data Normally, these data sources in a dictionary vary in terms of coverage, un- ambiguity and information value. Labels are exact, but polysemous; semantic properties are informative, but scarce; distributional properties have a large in- formation potential, but hard to make precise. Subject field classifications are potentially powerful, but alignments between them are often open as well. Based on the above information sources, we have developed string-based distributional distance measures. Many of them are variations of Levenshtein edit distance [4]. Distributional measures are adaptable by machine learning, language independent and cheap, but fall short with unconstrained natural lan- guage (witness MT). A reason to hope that purely distributional methods work better on dictionary data is that dictionaries are a constrained language and self- documenting. For example, the following glosses for ”green card” (culled from web-based glossaries) seem at first sight lexically unrelated, but WordNet synsets and the hypernym lattice allow relating many label pairs in them: “permit” is a “document”, “immigrant” is a “person”, “US” is “United States”. GREEN CARD OR PERMANENT RESIDENT CARD: A green card is a document which demonstrates that a person is a lawful permanent resident allowing a non-citizen to live and work in the United States indefinitely. A green card/lawful permanent residence can be revoked if a person does not maintain their permanent residence in the United States travels outside the country for too long or breaks certain laws. GREEN CARD: A permit allowing an immigrant to live and work indefinitely in the US. We have tested and trained measures for some of the properties listed above on WordNet data. Although they find definite correlations, they are too weak taken singly to predict the synsets. The task is to develop a synset distance measure that combines individual measures on the different criteria above to one similarity vector and train it on WordNet data for optimal aggregation of WordNet senses back to WordNet synsets. 3 Conclusion In this paper, we only use information available in the dictionary entries them- selves in the criteria 1-9 above, to see how far dictionary internal information goes to reconstruct synset structure. Use of external information (instance data, language specific parsers, MT) will follow in subsequent papers. References 1. Princeton University ”About WordNet.” WordNet. Princeton University. 2010. . 2. R. Navigli and S. Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial In- telligence, 193, Elsevier, 2012, pp. 217-250. 3. Wang. S and L. Carlson 2016. Linguistic Linked Open Data as a Source for Ter- minology - Quantity versus Quality. Proceedings of NordTerm 2015 (to appear). 4. Levenshtein, Vladimir I. (February 1966). ”Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady. 10 (8): 707–710.