<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Quality Checking and Matching Linked Dictionary Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kun Ji</string-name>
          <email>kun.ji@helsinki.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shanshan Wang</string-name>
          <email>shanshan.wang@helsinki.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lauri Carlson</string-name>
          <email>lauri.carlson@helsinki.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Helsinki, Department of Modern Languages</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The growth of web accessible dictionary and term data has led to a proliferation of platforms distributing the same lexical resources in different combinations and packagings. Finding the right word or translation is like finding a needle in a haystack. The quantity of the data is undercut by the redundancy and doubtful quality of the resources. In this paper, we develop ways to assess the quality of multilingual lexical web and linked data resources by internal consistency. Concretely, we deconstruct Princeton WordNet [1] to its component word senses or word labels, with the properties they have or inherit from their synsets, and see to what extent these properties allow reconstructing the synsets they came from. The methods developed should then be applicable to aggregation of term data coming from different term sources - to find which entries coming from different sources could be similarly pooled together, to cut redundancy and improve coverage and reliability. The multilingual dictionary BabelNet [2] can be used for evaluation. We restrain our current research to dictionary data and improving language models rather than introducing external sources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In ([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) we canvassed our sample of large dictionary and term databases for the
descriptive fields/properties of entries to see what sources for matching entries
they provide. The following types of properties susceptible to matching could be
found in different combinations:
1) languages and labels
2) translations / synonyms
3) thematic (subject field) classifications
4) hypernym (genus/superclass) lattice
5) other semantic relations (antonym, meronym, paronym)
6) textual definitions / examples / glosses
7) source indications
8) grammatical properties (part of speech, head word, etc)
9) distributional properties (frequency, register etc)
10) instance data (text or data containing or labeled by terms)
      </p>
      <p>Normally, these data sources in a dictionary vary in terms of coverage,
unambiguity and information value. Labels are exact, but polysemous; semantic
properties are informative, but scarce; distributional properties have a large
information potential, but hard to make precise. Subject field classifications are
potentially powerful, but alignments between them are often open as well.</p>
      <p>
        Based on the above information sources, we have developed string-based
distributional distance measures. Many of them are variations of Levenshtein
edit distance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Distributional measures are adaptable by machine learning,
language independent and cheap, but fall short with unconstrained natural
language (witness MT). A reason to hope that purely distributional methods work
better on dictionary data is that dictionaries are a constrained language and
selfdocumenting. For example, the following glosses for ”green card” (culled from
web-based glossaries) seem at first sight lexically unrelated, but WordNet synsets
and the hypernym lattice allow relating many label pairs in them: “permit” is a
“document”, “immigrant” is a “person”, “US” is “United States”.
GREEN CARD OR PERMANENT RESIDENT CARD: A green card is a document which demonstrates that a
person is a lawful permanent resident allowing a non-citizen to live and work in the United
States indefinitely. A green card/lawful permanent residence can be revoked if a person does
not maintain their permanent residence in the United States travels outside the country for
too long or breaks certain laws.
      </p>
      <p>GREEN CARD: A permit allowing an immigrant to live and work indefinitely in the US.</p>
      <p>We have tested and trained measures for some of the properties listed above
on WordNet data. Although they find definite correlations, they are too weak
taken singly to predict the synsets. The task is to develop a synset distance
measure that combines individual measures on the different criteria above to
one similarity vector and train it on WordNet data for optimal aggregation of
WordNet senses back to WordNet synsets.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Conclusion</title>
      <p>In this paper, we only use information available in the dictionary entries
themselves in the criteria 1-9 above, to see how far dictionary internal information
goes to reconstruct synset structure. Use of external information (instance data,
language specific parsers, MT) will follow in subsequent papers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Princeton University ”About WordNet.” WordNet. Princeton University.
          <year>2010</year>
          . &lt;http://wordnet.princeton.edu&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ponzetto. BabelNet:</surname>
          </string-name>
          <article-title>The automatic construction, evaluation and application of a wide-coverage multilingual semantic network</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>193</volume>
          ,
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          ,
          <year>2012</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wang</surname>
            . S and
            <given-names>L. Carlson</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Linguistic Linked Open Data as a Source for Terminology - Quantity versus Quality</article-title>
          .
          <source>Proceedings of NordTerm</source>
          <year>2015</year>
          (to appear).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>Vladimir I.</given-names>
          </string-name>
          (
          <year>February 1966</year>
          ).
          <article-title>”Binary codes capable of correcting deletions, insertions, and reversals”</article-title>
          .
          <source>Soviet Physics Doklady</source>
          .
          <volume>10</volume>
          (
          <issue>8</issue>
          ):
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>