Multilingual Dictionary Linking and Aggregation: Quality from Consistency Kun Ji, Shanshan Wang, and Lauri Carlson University of Helsinki, Department of Modern Languages {kun.ji,shanshan.wang,lauri.carlson}@helsinki.fi Abstract. The growth of Web-accessible dictionaries and term data has led to a proliferation of platforms distributing the same lexical re- sources in different combinations and packagings. Finding the right word or translation is like finding a needle in a haystack. The quantity of the data is undercut by the doubtful quality of the resources. Our aim is to cut down the quantity and raise the quality by match- ing and aggregating entries within and across dictionaries. In this ex- ploratory paper, our goal is to see how far we can get by using infor- mation extracted from multiple dictionaries themselves. Our hypothesis is that the more limited quantity of data in dictionaries is compensated by their richer structure and more concentrated information content. We hope to take advantage of the structure of dictionaries by basing quality criteria and measures on linguistic and terminological considerations. The plan of campaign is to derive quality criteria to recognise well- constructed dictionary entries from a model dictionary, and then at- tempt to convert the criteria into language-independent frequency-based measures. As a model dictionary we use the Princeton WordNet. The measures derived from it are tested against data extracted from Babel- Net. Keywords: Information extraction, Quality checking, Aggregation, Merg- ing, Linked data, Edit distance 1 Introduction Interactive Web and crowdsourcing have produced easily accessible lexical re- sources of unprecedented size. Lexical resources such as WordNet or Wiktionary are complemented by encyclopaedic data collections such as Wikipedia and Wikidata. The quantity of the data brings along problems of quality, such as errors, duplication and unclear provenance. Automatic methods including ma- chine learning techniques are called on to manage the wealth, but may also contribute to the disorder. Typical dictionary data categories, or fields, differ in availability, unambiguity and information potential. These three aspects often vary inversely: word labels are abundant and simple, but polysemous; semantic relations are unambiguous and informative, but scarce; subject field classifications and glosses have great 2 Multilingual Dictionary Linking and Aggregation: Quality from Consistency information potential that is hard to make precise. We must combine different properties and vary our methods according to type. This position paper introduces our line of research (cf . [1]) that tries to de- velop language-independent linguistically-motivated distributional methods for quality checking and aggregating such linguistic linked data. We first illustrate our approach with a selection of the kind of quality criteria we have in mind. As an example of such a measure, we describe a simple distance measure which is a variant of Levenshtein edit distance [2]. The measure is tested against labels, subject fields, and glosses extracted from multilingual dictionary BabelNet [3]. The experiments indicate that a flat edit distance measure is less suited for longer pieces of text. We are working on a more sophisticated language model that takes into account the linguistic structure of glosses. The rest of the paper is structured as follows: Section 2 discusses related work. Section 3 investigates candidate indicators/properties for quality checking and aggregation of multiple dictionary data. Section 4 compares these properties and describes the frequency-based distance measure. Section 5 describes our progress implementation and evaluation. Sections 6 discusses the results of our work. Section 7 presents our conclusions future work plan. 2 Related Work Ide and Veronis [4] argued that dictionaries have too little information for ex- tracting from knowledge bases. That is not our task: basically, we just want dictionaries with fewer errors and duplicates. It remains true that dictionary checking may benefit from external information sources. The hard part is to avoid that these introduce more noise than they help suppress. Navigli and Ponzetto compile BabelNet [3] using word sense disambiguation and machine translation as external sources. WordNet and Wikipedia are linked by a mapping between WordNet senses and Wikipage titles. Missing translations are collected from Wikipedia inter-language links and by machine translating occurrences of the labels within sense-tagged corpora. They report 82 percent mapping accuracy. We want to locate and fix the remaining errors. Eckard et al [5] match a French dictionary with a machine translated French WordNet looking for hypernym relations, using manually prepared regex pat- terns to parse dictionary definitions. Semantic relatedness (SR), more generally, measures how much two (strings of) words or concepts are related, counting all kinds of relations between them. Zhang et al. [6] present a hybrid SR method that generates a connection graph between labels using WordNet semantic relations and Wikipedia contexts and measures semantic relatedness by the density of the graph between two labels. Semantic similarity can be indirectly measured by semantic relatedness. It may thus bring useful evidence for our task, which is dictionary alignment and aggre- gation. Token level edit distance is known as WER (word error rate) in speech recognition and machine translation research [7]. Multilingual Dictionary Linking and Aggregation: Quality from Consistency 3 3 Quality Checking for Dictionary Merging Dictionaries are a good case of both the availability and need of matching and aggregating Web accessible data. There are a legion of mono- and multilingual dictionaries, glossaries, thesauri and other vocabulary collections in the Web, some but not all in RDF, some public and collectively maintained, many com- mercial but openly accessible for querying. This multiplicity is also an encum- brance. In language technology, one of the most common feature requests from human translators is ways to simplify the search of equivalents in the host of available sources. Besides explicit URLs, dictionaries abound in implicit internal and cross- dictionary links, created by shared labels (words, collocations), subject field classifications, glosses, grammar and other properties. By aggregating dictionary entries, we also implicitly address the problems of (i) identifying valid such links (ii) discarding misleading and duplicated links, and (iii) making useful links explicit. 3.1 Terminology To begin with, we define here some of the key concepts of our dictionary ontology. By a label we mean a language-identified baseform (lemma), represented in RDF as "base"@lang. A monolingual or multilingual dictionary minimally gen- erates a cover (set of possibly overlapping subsets) of the labels. The cover represents the neighbourhoods generated by a synonym or equivalent relation. The members of the cover are called synsets, or equivalent sets (eqsets) if the dictionary is multilingual. An Eqset is a multilingual synset. Separated by lan- guage codes, eqsets form disjoint unions of synsets. Synsets/eqsets can be seen to represent concepts or meanings. A sense is a pairing of a label and a synset that the label is a member of. The more synsets a label belongs to, the vaguer it is. A label is n-way polysemous if it belongs to n synsets. Dually, the more members a synset has, the wider its meaning is. Special language labels are less polysemous than general language labels. Ideally, terms should be monosemous (per subject field). Polysemy is said to happen between subject fields, which supplies another consistency test. To estimate if a label is a term, one may check the size of its synset. To check if a term has been translated by a general language label, compare the sizes of their synsets. Subject field headings show which domain or subject field a specific term belongs to. When the same label appears in different meanings, subject field classifications are used to distinguish meanings. A gloss is the definition or ex- planation associated to a label, which provides direct meaning of the concept in order to check if concepts are same or not. Hypernyms and other seman- tic relations serve the same purpose, as do part of speech and other grammar categories. Synsets and eqsets are target units that we reconstruct in our dictionary alignment. Labels, subject fields, glosses and other indicators are properties from 4 Multilingual Dictionary Linking and Aggregation: Quality from Consistency which our distance measure to be inferred. In this paper, we restrict our attention to labels, subject fields, and glosses. Synsets/eqsets may overlap because labels may belong to many synsets, by way of vagueness (synonymy is not exact, synset boundaries are negotiable) or polysemy (a label may belong to different but semantically related synsets). A synset (eqset) is thus a syntactic representative of a meaning. The relation “syn- onymy” or “equivalence” means “in the same neighbourhood”. It is an equiva- lence relation within the synset, but not transitive through shared members of overlapping synsets. Hence overlapping synsets cannot be merged in general. Even if consistent, the conjoined synset may be narrower than the originals, and the inferred equiva- lences have less application than its premises. This is why WordNet translations cannot be just merged into the synsets that they translate. We need to find criteria for when synsets are safe to merge and what is the risk. Translation equivalences are more informative than monolingual synsets be- cause of mismatches between languages. Ambiguous words in one language may have unambiguous equivalents in another. Synonyms and hypernyms that are lexical in one language might be phrasal in another. This is particularly true about direct translations of WordNet to another language, as lexical gaps in the other language are often filled by phrasal definitions or paraphrases. 3.2 Merging Dictionaries To match two dictionaries, we may pool together the equivalent sets from two dictionaries (with some markings to tell where they came from) and test if the combined dictionary satisfies various quality criteria. In theory, we merge two dictionaries by merging best matched entries and apply the quality criteria to the result. In practice, merging and checking may happen interleaved. A multilingual dictionary may list bigger or smaller translation equivalences (multilingual synsets) depending on the precision and completeness of the trans- lations. An optimal translation equivalence is to be reconstructed from many such smaller translation equivalences in the different sources. To find if two translation equivalences (say, from different sources) can be merged, we may try to unify them by merging them and assuming some equiva- lences (e.g. based on label identity). The merge creates a lot of new equivalences. Some of the new equivalences may be explicitly present or attested in data, some not. We may consider a binary translation pair like "bank"@en - "pankki"@fi as a base case of an eqset. In general, translation is not symmetric. By a trans- lation norm, a translation should not add information (change a true text into a false text), but the opposite is not required: a translation may lose information in the source to satisfy other desiderata of the translation brief. The relation ’x entails y’ is a partial order (transitive and antisymmetric). Symmetry can be restored by narrowing the context (e.g. with a subject field heading), or by giving up transitivity: the weaker notion ’y may translate x’ is a symmetric non-transitive similarity relation. For constructing larger eqsets, the narrowing Multilingual Dictionary Linking and Aggregation: Quality from Consistency 5 solution is preferable, so we should prefer binary translation pairs whose sym- metry is attested in the data. If we deconstruct WordNet synsets/eqsets into a binary relation of pairwise synonymies/equivalences between word senses, they form an equivalence relation whose quotient sets are the synsets/eqsets. When such equivalence pairs are all attested, it is easy to reconstruct synsets from them by just forming the partition of the set into strong components (strongly connected graphs, cliques). Each such component is a synset/eqset. The clique test is at the strict end on a scale of attestedness. It construes synsets out of strongly connected components of the binary equivalent relation. When the evidence for synsets less complete, we may weaken the tests, with increased risk. Assuming transitivity and antisymmetry (the translation norm that translations are no narrower than original), we may check for equivalence by looking for cycles ([8]). Say we have three dictionaries, en-fi, en-sv, fi-sv. We may merge en-fi, fi-sv and check the result against sv-en: "bank"@en < "pankki"@fi, "pankki"@fi < "bank"@sv, "bank"@sv < "bank"@en Given multiple sources, there are two dimensions of degree of attestedness: number of distinct attested equivalences and number of duplicate attestations from different sources. Using such counts, we may construct quantitative variants of the consistency tests. 3.3 WordNet as Gold Standard We test our criteria and measures on WordNet by deconstructing WordNet synsets down to senses or labels and then seeing to what extent we are able to reconstruct the synsets from them. The deconstruction of WordNet synsets can be done at word sense level or down to label level. Assume given the following synsets. [ synset 1; label ’man’@en,’human’@en; property1 ’Noun’; property2 ’human being’ ] . [ synset 2; label ’man’@en; property1 ’Noun’; property2 ’adult male’ ] . 1. Word senses Sense deconstruction for the synsets produces three senses: [ sense 1; label ’man’@en; property1 ’Noun’; property2 ’human being’ ] . [ sense 2; label ’human’@en; property1 ’Noun’, property2 ’human being’ ] . [ sense 3; label ’man’@en; property1 ’Noun’, property2 ’adult male’ ] . Deconstructing synsets to word senses that inherit properties from the synsets, can we reconstruct the synsets by merging the senses? The answer is triv- ially yes if we retain the synset id or other key properties of synsets (like gloss). This exercise is more relevant when word senses come from different dictionaries. 2. Word labels Label deconstruction produces two labels: [ label ’man@en’; property1 ’Noun’; property2 ’human being’,’adult male’ ] . [ label ’human@en’ ; property1 ’Noun’, property2 ’human being’ ] 6 Multilingual Dictionary Linking and Aggregation: Quality from Consistency Distributing properties inherited from synset all the way to labels, can we reconstruct senses and synsets by splitting the labels, without knowing how the properties were clustered in the senses? In (1) the senses keep properties together, whereas in (2) we lose the in- formation regarding which properties go with which sense. In (2) there will be many more items to merge. In the example above, the three senses cannot be reconstructed from the label deconstruction since the ambiguity of ’man’ has been lost. The three senses can be merged back to two synsets from sense de- construction because the properties of the two senses agree. For other similar or more complicated cases, we try reconstruction at varied granularity – word sense, label sense or combined to find an optimal merging solution. In the general situation of combining different dictionaries, what get merged are just such “senses”, “terms” or “entries”, which combine one or more labels plus some other properties. The risk is in merging labels or/and properties be- longing to incorrect or repeated senses. The merger can lose information. Errors and duplicates may arise. The task is to aggregate the entries to obtain the most likely and meaningful synsets. Starting from a set of partial descriptions of shared meaning, we try to merge the descriptions into manageable clusters. We next look at some statistics on the different indicators in WordNet. 4 Comparing Properties In the previous section, we presented criteria that may be used in aggregating synsets and eqsets. Our criteria depend on notions of identity or sufficient simi- larity among labels and other dictionary fields/properties, which is the topic of this section. We have not yet touched the problem of matching similar but not identical properties. For other properties besides labels, such as glosses, matching is not straightforward. In the general case, we want to deal with graded measures. The problem of matching two properties is not independent of matching the whole entries. Matching property contents is an argument for matching the entries, and vice versa. We set aside this complication for now. 4.1 Sharing of Labels between Synsets The English WordNet RDF has about 100K synsets and 200K senses. It includes translations in 21 languages. The most complete one is Finnish (300K), with Malay, Japanese, Indonesian, and French next (over 100K each). To appreciate our chances in the reconstruction, we studied how far labels alone go in measuring the similarity of synsets. Less than 0.1 percent of hyper- nymous synset pairs in the English WordNet share one or more labels. About 4 percent of hypernymous eqset pairs share labels in at least one language. This is another indication that translating the English WordNet creates redundant distinctions for the target languages. For random synset pairs, the correspond- ing percentages are one or two magnitudes smaller. So label sharing is a good, though rare indicator of synset similarity. Multilingual Dictionary Linking and Aggregation: Quality from Consistency 7 Listing 1 shows some of the closest WordNet synsets measured in shared labels and translation respectively. The first column is the number of shared labels, the next two columns are the synset ids, followed by a sample shared label and glosses for the two synsets. Listing 1 Closest WordNet Synsets Measured in Shared Labels and Translation 10 wn31:107741018-n wn31:112599160-n "mung bean"@eng ’mung seed’’mung plant’ 6 wn31:107137720-n wn31:107407761-n "scream"@eng ’cry’’noise resembling cry’ 6 wn31:104647089-n wn31:104717403-n "severity"@eng ’excessive sternness’’hard to endure’ 129 wn31:200825727-v wn31:200826456-v "admonish"@eng ’take to task’’censure severely’ 94 wn31:400046739-r wn31:400473918-r "extremely"@eng ’extreme degree’’extraordinary degree’ 81 wn31:200346415-v wn31:201654152-v "start"@eng ’take first step’’get off the ground’ 4.2 String Relationships in Hypernyms A fraction of hyponymy relations are recognisable from their syntactic makeup as phrasal species terms, each composed of a hypernym denoting the genus and modifiers specifying differentia, for example skilled workman < workman . In the English WordNet, about one quarter of hypernym relations have this form, mostly phrasal verbs and special field terms. Another fraction are suffixal (in English, typically compounds), like workman < man . When all of the above types are included, 22 percent of hypernym relations contain at least one English substring relationship. Apparently, substring relationships are a useful indicator, but not strong enough alone. 4.3 Distance Measure To have a quantitative measure for the distance between similar labels and other dictionary fields/properties, we implemented a language-independent character frequency-based edit distance measure. The same measure is designed to be applicable to subject field labels and glosses, possibly with different additional information sources and parameter settings. Our distance measure is a two-level frequency weighted Levenshtein (edit) distance measure [2]. It is designed to be language-independent as far as feasible, using only information available in the dictionary itself. With this desideratum in mind, the measure derives edit costs (weights) from character and token fre- quencies extracted from the input data or imported from external sources. 4.3.1 Character-based Distance Measure for Comparing Labels We first calculate Levenshtein edit distances between tokens, with edit costs weighted by frequencies of characters per string position. Specifically, character 8 Multilingual Dictionary Linking and Aggregation: Quality from Consistency cost grows with the variety (number of different characters) per position and the information value (inverse frequency) of the character at the position. As expected for English, early positions have more variation, while mid vowels and dental consonants predominate at endings. 0: v=1 w=2 o=12 c=6 h=3 r=1 b=1 f=4 l=4 k=2 ?=1 s=6 g=1 i=5 a=20 e=10 d=2 p=4 n=1 t=12 1: s=5 g=1 i=8 p=1 t=1 n=16 e=8 a=10 y=1 c=1 o=6 w=1 b=1 f=2 x=4 h=13 u=1 r=9 2: l=2 f=1 r=4 c=1 o=4 y=1 m=2 v=6 t=9 n=10 p=1 d=3 e=3 a=8 i=6 g=1 s=7 3: r=1 a=2 e=9 l=3 t=10 n=2 d=2 f=1 g=2 w=1 i=11 s=4 c=5 o=1 m=4 4: g=3 i=6 s=2 a=1 e=4 n=5 t=9 p=2 o=3 -=1 m=1 u=1 r=4 h=2 l=2 b=1 5: g=4 i=2 s=1 t=1 n=3 d=2 p=1 e=5 a=2 y=6 o=2 c=3 w=1 v=1 l=2 b=2 f=1 r=2 6: e=5 a=1 r=2 p=1 n=3 t=3 l=3 v=1 i=2 c=1 o=2 y=1 7: t=1 n=3 l=1 d=3 e=3 a=1 c=3 y=1 i=1 s=2 8: e=3 g=2 d=1 t=1 n=1 9: e=1 a=1 g=1 n=1 10: t=1 i=1 11: n=1 l=1 12: e=1 y=1 13: d=1 The tokens are normalised to types using token distances and token frequen- cies as guides. The assumption in this reduction is that the dictionary or lemma form is close in character distance to its inflections and derivations and no less frequent than them. If a synonym dictionary is supplied, it is used in the to- kenisation, preferring types that occur in the dictionary. Also abbreviations and multiword phrases get tokenised with the dictionary if supplied. The token distances obtained on the first level are used as token costs in another Levenshtein round that compares multi-token strings (terms, glosses, definitions etc.). This round uses a similar logic to the previous one, using position-sensitive type frequencies to weight edit costs. (The built-in assump- tion is that key terms occur early in glosses.) Besides the usual edit operations (addition, deletion, substitution) gloss distance adds permutation (by lowering the cost of substitutions of low-frequency terms if they are offset by an opposite substitution elsewhere). To give more weight to low-frequency terms, the character-based token edit distances are scaled by token frequencies so that long edit distances to low- frequency, high-information tokens (terms) are stretched exponentially at the high frequency end and short distances correspondingly shrunk at the oppo- site end. Under this metric, the short end edit distances manage to single out inflectional and derivational relations between significant keywords. 0.025269 221.000000 reciprocating reciprocal 0.024468 214.000000 functional function 0.023897 209.000000 experiencing experience 0.022639 198.000000 features feature 0.016808 147.000000 characteristic characterized 0.015550 136.000000 organisms organism 0.013835 121.000000 interacting interaction 0.012920 113.000000 accomplishment accomplishing 0.008804 77.000000 substances substance 0.003544 31.000000 independently independent 4.3.2 Synonym-enhanced Distance Measure for Comparing Glosses The character-based distance measure fails to capture similarities between glosses that use unrelated but synonymous words. To remove this limitation, we import Multilingual Dictionary Linking and Aggregation: Quality from Consistency 9 semantic relatedness information from the dictionary itself. It was done here by lifting WordNet synset and hypernym relations to a semantic relatedness relation between labels. This construction is lossy in three ways: (1) the further apart two synsets are in hypernym hierarchy, the more loosely they are considered related; (2) the larger the synset, the vaguer its meaning (in general) – special language concepts tend to have fewer synonyms than vaguer or context dependent general language meanings; (3) precision falls with sense count: a polysemous label is less sure an indication of meaning than a monosemous one. We generate a fuzzy set of 1.3M semantically related pairs of labels from English WordNet, weighed by the above counts so that more precise synonymies have more bearing than fuzzier ones. In sum, the synonym-enhanced gloss distance may correctly predict semantic distances between differently worded definitions of the same thing on the one hand, and definitions pertaining to different concepts on the other hand. Table 1 shows the gloss distance for term ’REMOVAL’ and ’Removal’: Table 1. Gloss Distances for REMOVAL-Removal distance gloss 1 gloss 2 0.075714 wn31:100021914-n REMOVAL 0.064758 wn31:100021914-n Removal 0.027266 REMOVAL Removal 0.000000 wn31:100021914-n wn31:100021914-n The glosses of the terms in Table 1 are: – wn31:100021914-n “any substance such as a chemical element or inorganic compound that can be taken in by a green plant and used in organic syn- thesis” – REMOVAL “The formal expulsion or deportation of a non-citizen from the United States when the non-citizen has been found removable for violating the immigration laws A person can be removed for overstaying a visa or for breaking laws including immigration laws” – Removal “The expulsion of an alien from the United States based on grounds of either inadmissibility or deportability” Table 2 is a truncated Levenshtein distance matrix for the glosses REMOVAL- Removal. Star marks a substitution, plus addition and minus deletion. The min- imum edit path can be traced following pluses down, minuses to the right, and stars diagonally down right. 10 Multilingual Dictionary Linking and Aggregation: Quality from Consistency Table 2. Distance Matrix for REMOVAL-Removal The expulsion of an alien from the United States based on The 0.00* -1.00 -1.58 -2.16 -2.90 -3.57 -4.20 -5.56 -6.29 -6.87 formal +0.79* +1.42 +2.00 +2.58 +3.32 +3.99 +4.62 +5.98 +6.71 +7.29 expulsion +1.79 0.79* -1.37 -1.96 -2.69 -3.37 -4.00 -5.35 -6.09 -6.67 or +2.37 +1.37* +1.95 +2.53 +3.26 +3.94 +4.57 +5.93 +6.66 +7.24 deportation +3.53 +2.53* 3.11 2.88 3.40 4.07 4.70 6.06 6.79 7.37 of +4.11 +3.11 2.53* -3.12* -3.85 -4.53 -5.16 -5.82 -6.55 -7.14 a +4.64 +3.64 +3.06 +3.65* +4.38 +5.06 +5.69 +6.35 +7.08 +7.67 non-citizen +5.80 +4.80 +4.22 4.81 5.54* 6.22 6.85 7.02 7.75 8.33 from +6.48 +5.48 +4.90 5.48 6.22 5.54* -6.17* -7.53* -8.26* -8.84* the +7.11 +6.11 +5.53 6.11 6.85 +6.17 +6.80 +8.16 +8.89 +9.47* 5 Evaluation To evaluate our distance measure, we extracted the first 4130 synsets from Ba- belNet using the Java API for the BabelNet 3.6.1 Lucene index download. 5.1 Term Labels We tested the character-frequency based distance measure on the first 1000 English language BabelNet labels in our extract, listing for each label its nearest neighbour in the set according to the measure. The synonym dictionary was not used here. 177 pairs were identical. 58 percent of the near neighbours came from the same synset. This may be compared to the probability of a random pair of labels coming from the same synset in our data (0.007). "protein folding" ~ "folding" "Misfolded protein" ~ "Misfolded" "Misfoldings" ~ "Misfolding" "Incorrect protein folding" ~ "Incorrect folding" "Singleblinding" ~ "Singleblind" "Dryopithecini" ~ "Dryopithecidae" "Double-entries" ~ "Double-entry" The list deteriorates towards the end. This can be helped by adding a confi- dence index and threshold to cut off weakest cases. Another class of false posi- tives are near ties like ”fathering”/”feathering”. They may be captured using a dictionary. 5.2 Subject Field Labels To test the distance measure on subject field labels, we extracted 9890 BabelNet categories in English from our data and listed the nearest matching category la- Multilingual Dictionary Linking and Aggregation: Quality from Consistency 11 bel pairs in that set. In this run, we used the WordNet based synonym dictionary. An excerpt from both ends of the listing: "Philosophy" "Epistemology" "Polytheism" "Religion" "Behaviorism" "Psychology" ... "Knights Grand Cross of the Order of Merit of the Italian Republic" ~ "Grand Cross of the Order of Civil Merit" "Central Committee of the Communist Party of the Soviet Union members" ~ "Heads of the Communist Party of the Soviet Union" "People executed by the Bourbon dynasty of the Kingdom of France" ~ "Peers of France" The result may be evaluated again by comparing the proportion of matches in our listing which classify the same synset (0.14) to the probability of a pair of category labels chosen at random to classify the same synset in our data (0.0004). 5.3 Glosses The distance measure was tested on 1000 BabelNet glosses extracted from our data, with the nearest matching category label pairs listed in that set. In this run, we used the WordNet-based synonym dictionary. A few examples of the pairs judged nearest in the listing: "Jane Seymour Fonda is an Academy Award-winning American actress, model, writer, producer and political activist." ~ "American actress and activist" "The down of birds is a layer of fine feathers found under the tougher exterior feathers." ~ "Soft, immature feathers." "A dog sled is a sled pulled by one or more sled dogs used to travel over ice and through snow." ~ "A sled, pulled by dogs over ice and snow." "Dreams are successions of images, ideas, emotions, and sensations that occur involuntarily in the mind during certain stages of sleep." ~ "A series of mental images and emotions occurring during sleep" "Duty is a term loosely applied to any action which is regarded as morally incumbent, apart from personal likes and dislikes or any external compulsion." ~ "Nose." The result may be evaluated again by comparing the proportion of matches in our listing which classify the same synset (0.18) to the probability of a pair of glosses belonging to the same synset in our data (0.002). There were just 14 identical pairs this time. 6 Discussion As the last example above shows, there is room for improvement here. The measure is sensitive to length, while lengths of glosses may vary considerably. The effect of unequal length may be damped by truncating glosses (say the length of the shorter one), or by dropping high frequency tokens (articles, prepositions, auxiliaries). Quality aside, the Levenshtein measure is resource intensive on long glosses. 12 Multilingual Dictionary Linking and Aggregation: Quality from Consistency Obviously, edit distance is too unstructured for long glosses. We must im- prove the language model. We are currently working on a frequency-driven parser to compare glosses not as flat strings but as binary tree (dependency) structures, so as to cut down on pairwise comparisons of low-information tokens. Only the n-best edges from the parser are compared using the Levenshtein distance mea- sure. Our parser is a frequency weighted chart parser using binary (dependency) grammar rules extracted from dictionary data. 7 Conclusion To summarise, this paper presents quality criteria based on WordNet to merge linked lexical resources and to detect duplicates and errors in them. A distance measure to compare linguistic strings was described and tested on WordNet and BabelNet. The first-round tests suggested how to improve the measure for longer glosses. That done, we may proceed with the WordNet deconstruction/reconstruction exercise to test our approach. References 1. Wang. S and L. Carlson 2016. Linguistic Linked Open Data as a Source for Ter- minology - Quantity versus Quality. Proceedings of NordTerm 2015 (to appear). 2. Levenshtein, Vladimir I. (February 1966). ”Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady. 10 (8): 707–710. 3. R. Navigli and S. Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial In- telligence, 193, Elsevier, 2012, pp. 217-250. 4. N. Ide and J. Veronis. Extracting knowledge-bases from machine-readable dic- tionaries: Have we wasted our time? In Proc KB&KB93 Workshop, 1993. 5. Emmanuel Eckard, Lucie Barque, Alexis Nasr and Benoit Sagot, 2012. Dictionary-Ontology Cross-Enrichment. Using TLFi and WOLF to enrich one another. COLING Workshop on Cognitive Aspects of the Lexicon, 2012. 6. Z Zhang, AL Gentile, F Ciravegna 2011. Harnessing different knowledge sources to measure semantic relatedness under a uniform model. in Proceedings of EMNLP 2011. http://www.aclweb.org/anthology/D11-1092.pdf. 7. A Marzal, E Vidal. Computation of normalized edit distance and applications. IEEE transactions on pattern analysis and machine intelligence 15/9, September 1993. 8. Navigli, Roberto 2009. Using cycles and quasi-cycles to disambiguate dictionary glosses. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 594–602, Association for Com- putational Linguistics, 2009. 9. Princeton University ”About WordNet.” WordNet. Princeton University. 2010. .