<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploration of Semantic Spaces Obtained from Exploration of Semantic Spaces Obtained from Czech Corpora Czech Corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lubom´ır Krˇcm´aˇr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miloslav Konop´ık</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karel Jeˇzek Lubom r Krcmar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miloslav Konop k</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karel Jezek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering</institution>
          ,
          <addr-line>UDneipvaerstmityenotf oWf eCsotmBpouhteemr iSac,iPenlzcenˇa,nCdzeEcnhgRineperuibnlgic</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>This paper is focused on semantic relations between Czech words. Knowledge of these relations is crucial in many research fields such as information retrieval, machine translation or document clustering. We obtained these relations from newspaper articles. With the help of LSA1, HAL2 and COALS3 algorithms, many semantic spaces were generated. Experiments were conducted on various settings of parameters and on different ways of corpus preprocessing. The preprocessing included lemmatization and an attempt to use only ”open class” words. The computed relations between words were evaluated using the Czech equivalent of the Rubenstein-Goodenough test. The results of our experiments can serve as the clue whether the algorithms (LSA, HAL and COALS) originally developed for English can be also used for Czech texts.</p>
      </abstract>
      <kwd-group>
        <kwd>Information retrieval</kwd>
        <kwd>Semantic space</kwd>
        <kwd>LSA</kwd>
        <kwd>HAL</kwd>
        <kwd>COALS</kwd>
        <kwd>Rubenstein-Goodenough test</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        great motivation for us was also the S-Space package [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The S-Space package
is a freely available collection of implemented algorithms dealing with text
corpora. LSA, HAL and COALS algorithms are included. Our paper evaluates the
applicability of these popular algorithms to Czech corpora.
      </p>
      <p>The rest of the paper is organized as follows. The following section deals with
related works. The next section describes the way we created semantic spaces
for Cˇ TK4 corpora. Our experiments and evaluations using the RG benchmark
are presented in section 4. In the last section we summarize our experiments and
present our future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <p>
        The principles of LSA can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the HAL algorithm is described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
A great inspiration for us was a paper about the COALS algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where the
power of COALS, HAL and LSA is compared. The Rubenstein-Goodenough [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
benchmark and some other similar tests such as Miller-Charles [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or Word-353
are performed. The famous TOEFL5 or ESL6 are also included in the evaluation.
      </p>
      <p>
        We also come from a paper written by Paliwoda [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] where the
RubensteinGoodenough (RG) test translated into Polish was used. Alternative ways of
evaluating semantic spaces can be found in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] by Bullnaria and Levy.
      </p>
      <p>
        Different methods which judge how some words are related exploit lexical
databases such as WordNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. There are nouns, verbs, adjectives and adverbs
grouped in sets of synonyms called synsets in WordNet. Each synset expresses
a distinct concept and all the concepts are interlinked with relations including
hypernymy, hyponymy, holonymy or meronymy. Although lexical-based methods
are popular and still under review, we have decided to follow the fully automatic
methods.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Generation of Semantic Spaces</title>
      <p>
        The final form of semantic space is firstly defined by the quality of the corpus
used [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and secondly by the selection of algorithm. The following chapter applies
to the features of our corpus and also describes the ways we preprocessed it. The
next chapter is focused on parameter settings of LSA, HAL and COALS.
3.1
      </p>
      <p>
        Corpus and corpus preprocessing
The Cˇ TK 1999 corpus, which consists of newspaper articles, was used for our
experiments. The Cˇ TK corpus is one of the largest Czech corpuses we work with
in our department. For lemmatization, Hajic’s tagger for the Czech language
was used [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>4 Cˇeska´ Tiskova´ Kancela´ˇr (Czech News Agency)</title>
      </sec>
      <sec id="sec-3-2">
        <title>5 Test of English as a Foreign Language</title>
      </sec>
      <sec id="sec-3-3">
        <title>6 English as a Second Language</title>
        <p>There was no further preprocessing of input texts performed. Finally, 4
different input files7 for the S-Space package were used. The first input file
contained plain texts of the Cˇ TK corpora. The second one contained plain text
without stopwords. Pronouns, prepositions, conjunctions, particles, interjections
and punctuation8 were considered as stopwords in our experiments. That means
that removing stopwords from the text in our paper is the same as keeping
only open class words in the text. The third file contained lemmatized texts of
the Cˇ TK corpora. And the last file contained lemmatized Cˇ TK corpora without
stopwords. Statistics on the texts of the corpus are depicted in Table 1, statistics
on texts without stopwords are depicted in Table 2 respectively.
The LSA principle differs essentially from HAL and COALS. While HAL and
COALS are window-based, LSA deals with passages of texts. The passage of
text is presented by a whole text of any article of the Cˇ TK corpus in our case.</p>
      </sec>
      <sec id="sec-3-4">
        <title>7 Each file contained each document of the corpora. One file line corresponds with one</title>
        <p>distinct document.</p>
      </sec>
      <sec id="sec-3-5">
        <title>8 Punctuation is rather a token than a word. It was removed while it is not important</title>
        <p>for the LSA algorithm.</p>
        <sec id="sec-3-5-1">
          <title>Documents’ count</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>Tokens’ count</title>
        </sec>
        <sec id="sec-3-5-3">
          <title>Different tokens’ count</title>
        </sec>
        <sec id="sec-3-5-4">
          <title>Tokens’ count occurring more than once</title>
        </sec>
        <sec id="sec-3-5-5">
          <title>Different tokens’ count occurring more than once</title>
          <p>
            Both LSA and COALS exploit untrivial mathematical operation SVD9 while
HAL does not. COALS simply combines some HAL and LSA principles [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
          </p>
          <p>The S-Space package provides default settings for its algorithms. The settings
are based on previous research. The default settings of parameters are depicted
in Table 3. We tried to change some values of parameters because of the Czech
language of our texts. The Czech language differs from English especially in the
number of forms for one word and in word order, which is as not strictly fixed
as in English. Therefore, there are more different terms10 for Czech language
texts. Since the algorithms are sensitive to the term occurrence, this is one of
the reasons11 we tried to remove low occurring words.</p>
          <p>Another parameter we observed is HAL’s window size. It was expected that
the more terms for the Czech language meant the smaller window size would be
more appropriate.</p>
          <p>
            The last parameters we changed from defaults were HAL and COALS
retained columns’ counts. We reduced the dimensionality of spaces in this way by
setting the reduction property to the values adopted from [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. As a consequence,
columns with high entropy were retained. To reduce the dimensionality of the
COALS algorithm, the impact of SVD was also tested.
Several approaches exist to evaluate semantic spaces as noticed in section 2.
Unfortunately, most of the standard benchmarks are suitable only for English.
To the best of our knowledge, there is no similar benchmark to the
RubensteinGoodenough (RG) test or to the Miller-Charles test for the Czech language.
Therefore we have decided to translate RG test into Czech.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>9 Singular Value Decomposition 10 One word in two forms means two terms in this context. 11 Another reason is to decrease the computation costs.</title>
        <p>The following chapter describes the origination of the Czech equivalent of the
RG test. The next chapter comprises our results on this test for many generated
semantic spaces.
4.1</p>
        <p>Rubenstein-Goodenough test
The RG test comprises pairs of nouns with corresponding values from 0 to 4
indicating how much words in pairs are related. The powers of relations were
judged by 51 humans in 1965. There were 65 word pairs in the original English
RG test.</p>
        <p>
          The translation of the original English RG test into Czech was performed by
a Czech native speaker. The article by Pilot [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] describing the original meanings
of the RG test’s words was exploited. The resulting translation of the test was
corrected by 2 Czech native speakers who are involved in information retrieval.
        </p>
        <p>After our translation of the RG test into Czech, 62 pairs are left. We had
to remove the ”midday-noon”, ”cock-rooster” and ”grin-smile” pairs because we
couldn’t find any appropriate and different translations for both words of these
pairs in Czech. Our Czech RG test12 was evaluated by 24 Czech native speakers
with differing education, age and sex. Pearson’s correlation between Czech and
English evaluators is 0.94.</p>
        <p>A particular word we removed from our test before comparing it with
semantic spaces is ”crane”. The Czech translation of this word has 3 different
meanings. Furthermore, only one of these meanings was commonly known by
the people who participated in our test. Therefore, another 3 pairs disappeared:
”bird-crane”, ”crane-implement” and ”crane-rooster”. A similar ambiguate word
is the Czech translation of ”mound”, which was also used in a different
meaning in the corpus. We removed it with these 4 pairs: ”hill-mound”,
”cemeterymound”, ”mound-shore”, ”mound-stove”. In the end, 55 word pairs were left in
our test.</p>
        <p>Another issue we had to face was the low occurrence of the RG test’s words in
our corpus. Therefore, we tried to remove the least frequent words of the RG test
in sequence and the pairs which they appear in as a consequence. In the end it
turned out that especially this step showed us that the relations obtained from
SSpace algorithms correlate with human judgments quite well. To evaluate which
of the semantic spaces best fits with human judgments the standard Pearson’s
correlation coefficient was used.
4.2</p>
        <p>Experiments and results
We created many semantic spaces with the LSA, HAL and COALS algorithms.
Cosine similarity was used to evaluate whether two words are related in semantic
spaces. Other similarity metrics did not work well.</p>
        <p>The obtained results for the different semantic spaces are depicted in Table
4 for plain texts of the Cˇ TK corpora and in Table 5 for lemmatized texts. The
12 Available at http://home.zcu.cz/~lkrcmar/RG/RG-ENxCZ.pdf
best 2 scores in Table 4 and the best 3 scores in Table 5 are highlighted for each
tested set of pairs in our RG test.
LSA m2 0,26 0,25 0,27 0,33 0,35 0,36 0,24
N LSA 0,28 0,28 0,29 0,33 0,33 0,33 0,16
N LSA m2 0,27 0,26 0,29 0,33 0,30 0,32 0,11
HAL m2 0,20 0,19 0,24 0,28 0,25 0,24 0,14
HAL m2 s1 0,12 0,11 0,18 0,19 0,14 0,06 0,04
HAL m2 s2 0,17 0,18 0,25 0,30 0,25 0,18 0,15
N HAL m2 0,36 0,38 0,39 0,43 0,43 0,44 0,44
N HAL m2 s1 0,39 0,41 0,43 0,47 0,46 0,48 0,53
N HAL m2 s2 0,40 0,42 0,44 0,48 0,48 0,49 0,53
COALS m2 0,43 0,45 0,48 0,52 0,54 0,57 0,62
COALS m2 d2 0,28 0,30 0,30 0,35 0,38 0,39 0,42
COALS m2 d4 0,17 0,18 0,18 0,19 0,21 0,27 0,32
N COALS m2 0,42 0,43 0,46 0,50 0,53 0,54 0,59
N COALS m2 d2 0,31 0,27 0,25 0,35 0,31 0,23 0,34</p>
        <p>N COALS m2 d4 0,43 0,44 0,45 0,50 0,51 0,51 0,57</p>
        <p>It turned out we do not have to take into account words which occur only
once in our corpora. It saves computing time without a negative impact on the
results. This is the reason most of our semantic spaces are computed omitting
words which occur only once.</p>
        <p>
          The effect of omitting stopwords is very small for the LSA and the COALS
algorithms. However, the HAL algorithm scores are affected a lot (compare HAL
and N HAL in Tables 4 and 5). This difference in results can be caused by the
fact that LSA does not use any window and works with whole texts. The COALS
algorithm may profit from using the correlation principle[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] that helps it to deal
with stopwords.
        </p>
        <p>The scores in our tables show that especially the COALS method is very
successful. The best scores are achieved by COALS for the plain texts, and
COALS scores for the lemmatized texts are also among the best ones (compare
Table 4 and 5).</p>
        <p>The HAL method is also very successful. Furthermore, the best score of 0.72
is obtained using the HAL method on lemmatized data without stopwords (see
Table 5). It turns out that HAL even outperforms COALS when only pairs
containing only very common words are left. On the other hand, this shows the
strength of COALS when also considering low occurring words in our corpora.</p>
        <p>
          It turned out that the LSA algorithm is not as effective as the other
algorithms in our experiments. Our hypothesis is that scores of LSA would be better
when experimenting with larger corpora such as Rohde [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. However, the LSA
scores also improve when considering only common words. This Figure 1 shows
the performance of the 3 tested algorithms for the best settings found.
        </p>
        <p>
          Our results differ from scores of tests evaluated on English corpora and
performed by Rohde [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. His scores for HAL are much lower than ours. On the other
hand, his scores for LSA are higher. Therefore, we believe that the performances
of the algorithms are language dependent.
        </p>
        <p>The last Figure 2 in our paper compares human and HAL judgments about
the relatedness of 14 pairs containing the most common words from the RG word
list in the Cˇ TK corpora. The English equivalents to the Czech word pairs are
listed in Table 6. We can notice the pairs which spoil the scores of the tested
algorithms in the graph. The graph also shows the difference in human and
machine judgments. The pair ”automobile-car” is less related than ”food-fruit”
for the algorithms than for humans. On the other hand, the words of the pair
”coast-shore” are more related for our algorithms than for humans.
10 14 19 24 27 29 32 35 37 44 51</p>
        <p>Count of omitted pairs</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>Our experiments showed that HAL and COALS algorithms performed well and
better than LSA on the Czech corpora. Our hypothesis based on our results is
0,80
0,70
0,60
n 0,50
o
i
t
lae 0,40
r
r
Co 0,30
0,20
0,10
0,00
N_COALS_m2
N_HAL_m2_s1
LSA
3.5
3
sse 2.5
tedn 2
a
leR 1.5
0.5
1
0
HAL
Human</p>
      <p>Pairs containing the most common words in the ČTK corpora
u´stav - ovoce asylum - fruit bratr - chlapec brother - lad
ovoce - pec fruit - furnace j´ızda - plavba journey - voyage
pobˇreˇz´ı - les coast - forest j´ıdlo - ovoce food - fruit
u´smˇev - chlapec grin - lad auto - j´ızda car - journey
pobˇreˇz´ı - kopec coast - hill pobˇreˇz´ı - beh coast - shore
u´stav - hˇrbitov asylum - cemetery kluk - chlapec boy - lad
bˇreh - plavba shore - voyage automobil - auto automobile - car
that COALS semantic spaces are more accurate for low occurring words, while
semantic spaces generated by HAL are more accurate for pairs of words with
higher occurrence. Our experiments show that the lemmatization of corpora is
the appropriate approach to improve the scores of algorithms. Furthermore, the
best scores of correlation were achieved when only the ”open class” words were
used.</p>
      <p>It turned out that the translation of the original English RG test was not
so appropriate for our Czech corpora while it contains words which are not so
common in the corpora. However, we believe that when the pairs containing
low occurring words were removed, the applicability of the test was improved.
The evidence for this is a discovered dependency between the scores of tested
algorithms on omitting pairs with low occurring words in them.</p>
      <p>We believe that semantic spaces are applicable for the query expansion task
which we will focus on in our future work. Apart from this, we are attempting
to get some larger Czech corpora for our experiments. We also plan to
continue testing the HAL and COALS algorithms, which performed well during our
experiments.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>The work reported in this paper was supported by the Advanced Computer and
Information Systems project no. SGS-2010-028. The access to the MetaCentrum
supercomputing facilities provided under the research intent MSM6383917201 is
also highly appreciated. Finally, we would like to thank the Czech News Agency
for providing text corpora.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>1954</year>
          ).
          <article-title>Distributional structure</article-title>
          . (J. Katz, Ed.)
          <source>Word Journal Of The International Linguistic Association</source>
          ,
          <volume>10</volume>
          (
          <issue>23</issue>
          ),
          <fpage>146</fpage>
          -
          <lpage>162</lpage>
          . Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Jurgens and Stevens, (
          <year>2010</year>
          ).
          <article-title>The S-Space Package: An Open Source Package for Word Space Models</article-title>
          .
          <source>In System Papers of the Association of Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foltz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Laham</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>An introduction to latent semantic analysis</article-title>
          .
          <source>Discourse Processes</source>
          ,
          <volume>25</volume>
          (
          <issue>2</issue>
          ),
          <fpage>259</fpage>
          -
          <lpage>284</lpage>
          . Routledge.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lund</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Burgess</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>Producing high-dimensional semantic spaces from lexical co-occurrence</article-title>
          .
          <source>Behav Res Methods Instrum Comput</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ),
          <fpage>203</fpage>
          -
          <lpage>208</lpage>
          203208.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Rohde</surname>
            ,
            <given-names>D. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonnerman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Plaut</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>An improved method for deriving word meaning from lexical co-occurrence</article-title>
          .
          <source>Cognitive Science.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rubenstein</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Goodenough</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1965</year>
          ).
          <article-title>Contextual correlates of synonymy</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>8</volume>
          (
          <issue>10</issue>
          ),
          <fpage>627</fpage>
          -
          <lpage>633</lpage>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Charles</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>Contextual Correlates of Semantic Similarity</article-title>
          .
          <source>Language &amp; Cognitive Processes</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          . Psychology Press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Paliwoda-</surname>
          </string-name>
          Pe¸kosz, G.,
          <string-name>
            <surname>Lula</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Measures of Semantic Relatedness Based on Wordnet</article-title>
          . In: International workshop for
          <source>PhD students</source>
          ,
          <source>2009 Brno. ISBN 978-80-214-3980-1</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bullinaria</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Extracting semantic representations from word co-occurrence statistics: a computational study</article-title>
          .
          <source>Behavior Research Methods</source>
          ,
          <volume>39</volume>
          (
          <issue>3</issue>
          ),
          <fpage>510</fpage>
          -
          <lpage>526</lpage>
          . Psychonomic Society Publications.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>WordNet: a lexical database for English</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>J. Hajiˇc</surname>
            , A. Bo¨hmova´, E. Hajiˇcova´,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Vidov´a Hladka´, The Prague Dependency Treebank: A Three-Level Annotation Scenario</article-title>
          . In A. Abeill´e (ed.):
          <article-title>Treebanks Building and Using Parsed Corpora</article-title>
          . pp.
          <fpage>103</fpage>
          -
          <lpage>127</lpage>
          . Amsterdam, The Netherlands: Kluwer,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>OShea</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Bandar</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crockett</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>McLean</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Pilot Short Text Semantic Similarity Benchmark Data Set: Full Listing and Description</article-title>
          . Computing.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>