=Paper=
{{Paper
|id=None
|storemode=property
|title=Giving a Sense: A Pilot Study in Concept Annotation from Multiple Resources
|pdfUrl=https://ceur-ws.org/Vol-1422/88.pdf
|volume=Vol-1422
|dblpUrl=https://dblp.org/rec/conf/itat/SudarikovB15
}}
==Giving a Sense: A Pilot Study in Concept Annotation from Multiple Resources==
<pdf width="1500px">https://ceur-ws.org/Vol-1422/88.pdf</pdf>
<pre>
J. Yaghob (Ed.): ITAT 2015 pp. 88–94
Charles University in Prague, Prague, 2015


Giving a Sense: A Pilot Study in Concept Annotation from Multiple Resources

                                                Roman Sudarikov and Ondřej Bojar

                                                      Charles University in Prague
                                                  Faculty of Mathematics and Physics
                                              Institute of Formal and Applied Linguistics
                                        Malostranské náměstí 25, 11800 Praha 1, Czech Republic
                                                  http://ufal.mff.cuni.cz/
                                          {sudarikov,bojar}@ufal.mff.cuni.cz

Abstract: We present a pilot study in web-based annota-                  In this pilot study, we examine several such dictionaries
tion of words with senses coming from several knowledge               in terms of their coverage and annotator agreement. Un-
bases and sense inventories. The study is the first step in           like other works on “grounding”, which try to link only
a planned larger annotation of “grounding” and should al-             the most important words in the sentence [7, 8], we aim at
low us to select a subset of these “dictionaries” that seem           complete coverage of a given text, i.e. all content words or
to cover any given text reasonably well and show an ac-               multi-word expressions regardless their part of speech or
ceptable level of inter-annotator agreement.                          role in the sentence. Some of the examined resources have
Keywords: word-sense disambiguation, entity linking,                  a clear bias towards some parts of speech, for example, va-
linked data                                                           lency dictionaries cover only verbs. We nevertheless ask
                                                                      our annotators to annotate even across parts of speech if
                                                                      the matching POS is not included in the resource. For
1 Introduction                                                        instance, verbs can get nominal entries in Wikipedia and
Annotated resources are very important for training, tun-             nouns get verb frames.1
ing or evaluating many NLP tasks. Equipped with experi-                  In Section 2, we describe the sense inventories included
ence in treebanking, we now move to resources for word                in our experiment. Section 3 provides a unifying view on
sense disambiguation (WSD) and entity linking (EL). By                these sources and introduces our annotation interface. We
EL, we mean the task of attaching a unique ID from some               conducted two experiments with English and Czech texts
database to occurrences of (named) entities in text [1].              using the interface, slightly adapting interface for the sec-
Both entity linking and word-sense disambiguation have                ond run. Details are in Section 4 and Section 5.
been extensively studied, see for example [2–4]. Although
only a few researches consider several knowledge bases                2    Resources Included
and sense inventories at once [1, 5], the convergence be-
tween these two task is apparent, for example, the 2015               Sense inventories and knowledge bases are plentiful and
SemEval Task 13 promoted research in the direction of                 they differ in many aspects including the domain coverage,
joint word sense and named entity disambiguation [6].                 level of detail, frequency of update, integration of other re-
   We understand the terms ontology, knowledge base and               sources and ways of accessing them. Some of them imple-
sense inventory in the following way:                                 ment Resource Description Framework, the metadata data
   • Ontology is a formal representation of a domain of               model designed by W3C for the better data representation
     knowledge. It is an abstract entity: it defines the vo-          in Semantic Web, while others are simply collections of
     cabulary for a domain and the relations between con-             links in the web.
     cepts, but an ontology says nothing about how that                  We selected the following subset of general resources
     knowledge is stored (as physical file, in a database,            for our experiment:
     or in some other form), or indeed how the knowledge                 BabelNet [10] is a multilingual knowledge base,
     can be accessed.                                                 which combines several knowledge resources including
   • Knowledge base is a database, a repository of infor-             Wikipedia, Wordnet, OmegaWiki and Wiktionary. The
     mation that can be accessed and manipulated in some              sources are automatically merged and accessible via of-
     predefined fashion. Knowledge is stored in knowl-                fline Java API or online REST API. An added benefit is
     edge base according to an ontology.                              the multilinguality of BabelNet: the same resource can
   • Sense inventory is a database, often build based on a            be used for genuine (as opposed to cross-lingual) annota-
     corpus, and providing clustered senses for the words             tion for both languages of our interest, English and Czech.
     or expressions in the corpus.
However, we recognize the blending of knowledge bases                     1 The conversion of nouns to predicates whenever possible is explic-
and sense inventories, so we will use very generic terms              itly demanded in some frameworks, e.g. in Abstract Meaning Represen-
dictionary or resource interchangeably for either of them.            tation (AMR, [9]).
Giving a Sense: A Pilot Study in Concept Annotation from Multiple Resources                                                        89


Figure 1: Annotation interface, annotating the words “DELETE key” in the sentence “Move the mouse cursor. . . ” with
Google Search “senses”.


The main limitation is that BabelNet is not updated con-                expressions seen in our data, but searching the web pro-
tinuously, so we also added both live Wikipedia and Wik-                vides some explanation almost always. We thus include
tionary as separate sources. BabelNet provides informa-                 the top ten results returned by Google Search as a special
tion about nouns, verbs, adjectives and adverbs, but as                 kind of dictionary, where the “concept” is a query string
stated above, we are interested also in cross-POS anno-                 and each result is considered to be its’ “sense”.
tation.                                                                    Aside from coverage and frequency of updates, another
   Wikipedia2 is currently the biggest online encyclopedia              reason to include GS is that it provides “senses” at a very
with live updates from (hundreds of) thousands of contrib-              different level of granularity than others. For instance, the
utors so it can cover new concepts very quickly. Wikipedia              whole Wiktionary page can appear as one of the options in
tries to nest all possible concepts as nouns. For exam-                 GS “senses”. It will also often be a very sensible choice,
ple, en.wikipedia.org/wiki/funny redirects to                           despite it actually covers several different meanings of the
the page “Humour”.                                                      word.
   Wiktionary3 is a companion to Wikipedia that covers                     We find the task of matching senses coming from dif-
all parts of speech. It includes multilingual thesaurus,                ferent ontologies and providing a different angle of view
phrase books, language statistics. Each word in Wik-                    or granularity very interesting. The current experiments
tionary can have etymology, pronunciation, sample quo-                  serve as a basis for its further investigation.
tations, synonyms, antonyms and translations, for better
understanding of the word.
   PDT-VALLEX and EngVallex (Valency lexicons for                       3     Annotation Interface
Czech and English): Valency or subcategorization lexi-
cons formally capture verb valency frames, i.e. their syn-
tactic neighborhood in the sentence [11, 12]. We use the                To provide a unified view on the various resources, we use
valency lexicons for Czech and English in their offline                 the terms query, selection list and selection. Given an ex-
XML form as distributed with the tree editor TrEd 2.04 .                pression in a text, which can be a word or a phrase, even a
   Google Search5 (GS): From our preliminary experi-                    non-continuous one, and a resource which should be used
ments, we had the impression that no resource covers all                to annotate it, the system construct a query. Querying the
                                                                        resource, we get a selection list, i.e. a list of possible
    2 http://wikipedia.org
                                                                        senses.
    3 http://wiktionary.org                                                The process of extracting the selection list depends on
    4 http://ufal.mff.cuni.cz/tred/                                     the resource. It is straightforward for Google Search (each
    5 http://google.com                                                 result becomes an option) and complicated for Wiktionary,
90                                                                                                                 R. Sudarikov, O. Bojar

                                                                                    One or more senses selected
                      Source          Total   Whole page   Bad List       None   1       2      3      4 or more
                      Babelnet        28          -           1             3    23       1      0          0
                      Google Search   71          -           1             9    36      15      5          5
                      CS Vallex       38          -           0             2    29       6      1          0
                      EN Vallex       19          -           1             0    18       0      0          0
                      CS Wikipedia    38          -           9            12    15       1      0          1
                      EN Wikipedia    114         -          26            16    63       3      0          6
                      CS Wiktionary   21          -           1             3     7       4      5          1
                      EN Wiktionary   21          -           0             0    18       2      1          0
                      Babelnet        98         24           0            10    54       6      2          2
                      Google Search   93          0           0            26    19      16     11         21
                      EN Vallex       15          4           0             3     6       2      0          0
                      EN Wikipedia    103        23           7            36    35       2      0          0
                      EN Wiktionary   98         17          23             4    40       9      2          3


            Table 1: Selection statistics, the first (upper part) and second (lower part) annotation experiments


see Section 3.1 below. In principle and to include any con-           3.1 Queries and Selection Lists for Individual
ceivable resource, even field-specific or ad hoc ones, the                Resources
annotator should be free to select the selection list prior to
the annotation.                                                       This is how we construct queries and extract selection lists
   Our annotation interface allows to overwrite the query             for each of our dictionaries given one or more words from
for cases where the automatic construction does not lead              the annotated sentence:
to a satisfactory selection list.
                                                                      BabelNet We search BabelNet for the lemma of the se-
   Finally, the annotator is presented with the selection list            lected word (or the phrase of lemmas if more words
to make his choice (or multiple choices). Overall, the an-                are selected). The selection list is the list of all ob-
notator picks one of these options:                                       tained BabelNet IDs.
Whole Page means that the current URL is already a                    Google Search We search for the lemmas of the selected
   good description of the sense and no selection list                   words and return the snippets of the top ten results.
   is available on the page. The annotators were asked                   The selection list is the list of snippets’ titles.
   to change the query and rather obtain a selection list
   (e.g. a disambiguation page in Wikipedia) whenever                 Wikipedia We search for the disambiguation page for the
   possible.                                                              selected words and, if not found, we search for the
                                                                          page with the title matching the lemmas of the se-
Bad List means that the extraction of selection list failed               lected words. The selection list for disambiguation
    to provide correct senses. The annotators were sup-                   pages is constructed by fetching hyperlinks appear-
    posed to try changing the query to obtain a usable list               ing within listings nested in particular HTML blocks.
    and resort to the “Bad List” option only if inevitable.               For other pages we fetch links from the Table of Con-
                                                                          tents and the first hyperlink from each listing item.
None indicates that the selection list is correct but that it
   lacks the relevant sense.                                          Wiktionary We search for the page with the title equal to
                                                                          the lemmas of the selected words. The selection list
One or more senses selected is the desired annotation:                    is created using the same heuristics as for Wikipedia.
    The list, for the particular pair of selected word(s) and
    selected resource, was correct and the annotator was              Vallex We scan the XML file and return all the frames
    able to find the relevant sense(s) in the list.                        belonging to the verb with the lemma matching the
                                                                           selected word’s lemma.
   Our annotation interface (Figure 1) shows the input sen-
tence, tabs for individual sense inventories, the selection
list from the current resource and also the complete page             4    First Experiment
where the selection list comes from. The procedure is
straightforward: (1) select one or more words in the sen-             The first experiment was held in March 2014. The 7
tence using checkboxes, (2) select a resource (we asked               participating annotators (none of whom had any experi-
our annotators to use them all, one by one), (3) check if             ence in annotation tasks) were asked to annotate the sen-
the selection list is OK and modify the query if needed,              tences from PCEDT 2.0 6 with Czech and English sources:
(4) make the annotation choice by marking one or more of              Wikipedia and Wiktionary for both languages, BabelNet,
the checkboxes in the selection list, and (5) save the anno-             6 http://ufal.mff.cuni.cz/pcedt2.0/en/index.

tation.                                                               html
Giving a Sense: A Pilot Study in Concept Annotation from Multiple Resources                                                               91


Figure 2: Annotations from a given dictionary in the first experiment broken down by part of speech of the annotated
words.


Google Search, and the Czech and English Vallexes. Each                       Source          Annotations   2-IAA   Annotations   2-IAA
                                                                              Babelnet           29          0.69       29         0.69
annotator was given a set of sentences in English or Czech                    GS                 120         0.24      120         0.24
and they were asked to annotate as many words or phrases                      CS Vallex           46         0.58       46         0.58
in each sentence as possible, with as many reasonable                         EN Vallex          19          1.00       19         1.00
meanings as they can. We required the annotators to anno-                     CS Wikipedia        47         0.32       43         0.35
                                                                              EN Wikipedia       183         0.05      181         0.10
tate across parts of speech if possible (for instance to an-                  CS Wiktionary       38         0.29       38         0.35
notate the noun “teacher” with the corresponding verb “to                     EN Wiktionary       25           0        25           0
teach”). This requirement appeared because we wanted to                       Total:             507         0.21      501         0.24
evaluate the possibility of using more abstract senses as
used, for instance, in works with AMR.
                                                                        Table 2: Inter-annotator agreement in the first experiment,
                                                                        before (left) and after (right) the “Bad List” fix.
4.1    Gathered Annotations
                                                                        a good page, and (2) when the whole page is wrong, for ex-
In total, we collected 507 annotations for 158 units. 75 of
                                                                        ample when the system shows the Wikipedia page “South
these units had more than one annotation.
                                                                        Africa” for the word “south”. “None” was meant for cor-
   The upper part of Table 1 provides details on how of-
                                                                        rect selection lists (matching domain, reasonable options)
ten each of the annotation options was picked for a given
                                                                        but the right option missing. The guidelines for the first
source in the first annotation experiment. Note that in the
                                                                        experiment were not very clear on this so some annota-
first experiment, we did not offer the “Whole Page” option.
                                                                        tors marked problems with selection list as “Bad List” and
   We see that the sources exhibit slightly different patterns
                                                                        some used the label “None”.
of use. Wikipedia has lots of “Bad List” options selected
                                                                           Manual revision revealed that only 10 out of 40 “Bad
due to the issue described in Section 4.2. GS is the most
                                                                        List” annotations were indeed “Bad List” in one of the two
ambiguous resource, the user has picked two or more sense
                                                                        meanings described above. The right hand part of Table 2
in about one half of GS annotations. The highest num-
                                                                        shows IAA after changing wrongly annotated “Bad Lists”
ber of “Bad Lists” was received by the English Wikipedia
                                                                        into “None”.
(18 out of 40).
   Figure 2 shows the distribution of different POS per
source. Google Search seems to be the most versatile re-                4.3 Inter-Annotator Agreement
source, covering all parts of speech well. The relatively
                                                                        Inter-annotator agreement is a measure of how well two
low use of BabelNet was due to the web API usage limit.
                                                                        annotators can make the same annotation decision for
Vallexes work well for verbs but cross-POS annotation is
                                                                        a certain item. In our case it is measured as the percent-
only an exception. Wikipedia and Wiktionary are indeed
                                                                        age of cases when a pair (2-IAA) of annotators agree on
somewhat complementary in covered POSes.
                                                                        the (set of) senses for a given annotation unit. The mea-
                                                                        surement was made pairwise for all the annotations, which
4.2    Bad List vs. None Issue                                          had more that one annotator. The results are presented in
                                                                        Table 2, before and after fixing the “Bad List” issue.
The “Bad List” annotations should be used in two cases:                    In general, the IAA estimates should be treated with
(1) when the system fails to extract the selection list from            caution. Many units were assigned only to a single anno-
92                                                                                                        R. Sudarikov, O. Bojar


Figure 3: Annotations from a given dictionary in the second experiment broken down by part of speech of the annotated
words.


tator, so they weren’t taken into account while computing        Search, English Wikipedia, English Wiktionary and ENG-
IAA.                                                             VALLEX. The guidelines were refined, asking the anno-
   The extremely low IAA for English Wikipedia was               tators to mark the largest possible span for each concept
caused by the following issue. For several units, one an-        in the sentence, e.g. to annotate “mouse cursor” jointly as
notator tried to select all the senses to show that the whole    one concept and not separately as “computer pointing de-
page can be used, while others have picked one or only           vice” for the word “mouse” and “graphic representation of
a few senses. We resolved the issue by introducing a new         computer mouse on the screen” for the word “cursor”. The
option “Whole page” in the second experiment.                    option “Whole page” was newly introduced to help users
   Interestingly, we see a negative correlation (Pearson         indicate that the whole page can be used as a sense.
correlation coefficient of -0.37) between the number of
units annotated for a given source and the 2-IAA.
                                                                 5.1 Gathered Annotations
   We also report Cohen’s kappa [13], which reflects the
agreement when disregarding agreement by chance. In our          We collected 570 annotations for 35 words, 32 of which
setting, we estimate the agreement by chance as one over         had annotations from more than one annotator. The num-
the length of the selection list plus two (for “None” and        ber of units here is lower that in the first experiment, be-
“Bad List”). This is a conservative estimate, in principle       cause all our annotators used the same sentences. Also,
the annotators were allowed to select any subset of selec-       for the second experiment we required the annotators to
tion list. We compute kappa using K = P1−P    a −Pe
                                                  e
                                                    , where Pa   use all the resources for each unit, so we have more results
was the total 2-IAA and Pe was the arithmetical average of       per unit.
agreements by chance for each annotation. Kappa for the             During the second experiment, the system processed
first experiment was 0.13.                                       147 unique (in terms of selected word(s) and selected re-
   To assess the level of uncertainty for the estimates,         source) queries. All the resources got nearly equal num-
we use bootstrap resampling with 1000 resamples, which           ber of queries (about 30), except for Vallex, which got
gives us IAA of 0.25 ± 0.1 and kappa of 0.135 ± 0.115 for        only 10 queries. The annotators changed the queries
95% of samples.                                                  59 times, but this also includes cases, when Wikipedia
                                                                 used its own inner redirects, which our system did not
5 Second Experiment                                              distinguish from users’ changes. BabelNet was changed
                                                                 9 times, Google Search – 2, Vallex – 8, Wikipedia – 21
The second experiment was held in March 2015 with an-            and Wiktionary – 19. Based on these numbers, GS may
other group of 6 annotators. One of the annotators had           seem more reliable but it is not necessary true. One reason
experience in annotating tasks, while others had no such         is that some of part of the changes for Wikipedia was made
experience. The setting of the experiment was slightly dif-      automatically by Wikipedia itself. The other argument is
ferent. The annotators were asked to annotate only English       that users could limit their effort and after examining the
sentences from QTLeap project7 using BabelNet, Google            first 10 GS results for the query they just picked “Bad List”
                                                                 option and moved on, not trying to change the query.
                                                                    The POS per source distribution (see Figure 3) for the
     7 http://qtleap.eu/                                         second experiment is similar to the first one, except for the
Giving a Sense: A Pilot Study in Concept Annotation from Multiple Resources                                                          93

                                                   Content words                       Annotators
                         Source                Attempted Labeled        A1       A2    A3     A4     A5     A6
                         Babelnet                100%        91%       53%      20%   67%    66%    79%    40%
                         GS                      100%         85%      50%      13%   53%    46%    76%    20%
                         Vallex                   32%         26%      7%       6%    10%    40%     0%     0%
                         Wikipedia               100%         58%      39%      20%   35% 53%       50%    26%
                         Wiktionary              100%         88%      53%      20%   32%    40%    76%    26%
                         Total content words       34          34       28       15    28     15     34     15


Table 3: Coverage per content word (second experiment). The left part reports the union across annotators, the right part
reports the percentage of content words receiving a valid label (Labeled) for each annotator separately.

            Source        Annotations number    2-IAA                   6      Discussion
            Babelnet             114             0.49
            GS                   217             0.45
            Vallex                17             0.60                   Comparing first and second experiment, one can see, that
            Wikipedia            105             0.61                   we managed to improve IAA by expanding the set of avail-
            Wiktionary           117             0.28                   able options and refining the instructions, but IAA is still
            Total:               570             0.46
                                                                        not satisfactory.
                                                                           For resources where IAA reaches 60% (Vallex and
  Table 4: Inter-annotator agreement, second experiment                 Wikipedia), the coverage is rather low, 26% and 58%. Ba-
                                                                        belNet gives the best coverage but suffers in IAA. Google
                                                                        Search seems an interesting option for its versatility across
BabelNet, which did not reach any technical limit this time             parts of speech, on par with established knowledge bases
and was therefore used more often across all POSes.                     like BabelNet in terms of inter-annotator agreement but
                                                                        with much more ambiguous “senses”. The cross-POS
5.2    Coverage                                                         annotation does not seem very effective in practice, but
                                                                        a more thorough analysis is desirable.
In Table 3, we show the coverage of content words in the
second experiment. By content words we mean all the
words in the sentence, except for auxiliary verbs, punc-
                                                                        7      Comparison with Other Annotation Tools
tuation, articles and prepositions. The instructions asked
                                                                        Several automatic systems for sense annotation are avail-
to annotate all content words. Each annotator completed
                                                                        able. Our dataset could be used to compare them empir-
a different number of sentences, so the number of words
                                                                        ically on the annotations from the respective repository
annotated differs. The column Content words Attempted
                                                                        used by each of the tools. For now we provide only an
shows the total number of words with some annotation at
                                                                        illustrative comparison of these three systems: TAGME8 ,
all, while Labeled are words which received some sense,
                                                                        DBpedia Spotlight9 ,and Babelfy10
not just “None” or “Bad List”. Both numbers are taken
                                                                           Figure 4 provides an example of our manually collected
from the union over all annotators. Babelnet get the best
                                                                        annotations for the sentence “Move the mouse cursor to
coverage in terms of Labeled annotations. The right hand
                                                                        the beginning of the blank page and press the DELETE
side of the table shows how many words each annotator
                                                                        key as often as needed until the text is in the desired spot.”.
has labeled. Since the union is considerably higher than
                                                                           For this sentence, the TAGME system with default set-
the most productive annotator, we need to ask an impor-
                                                                        tings returned three entities (“mouse cursor”, “DELETE
tant question: How many annotators do we need to have
                                                                        key” and “text”). DBpedia Spotlight with default settings
a perfect coverage of the sentence.
                                                                        (confidence level = 0.5) returned one entity (“mouse”).
                                                                        Babelfy showed the best result among these systems in
5.3    Inter-Annotator Agreement                                        terms of coverage, failing to recognize only the verb
                                                                        “move” and adverbs “often” and “until”, but it also pro-
Results presented in Table 4 are overall better than in the             vided several false meanings for found entities.
first experiment. The kappa was computed as in Sec-
tion 4.3 with the only one difference: we added 3 instead
of 2 options when estimating the local probability of the               8      Conclusion
agreement by chance (for the new “Whole Page” option).
Kappa for the second experiment was 0.40. Bootstrapping                 In this paper, we examined how different dictionaries can
showed IAA 0.39 ± 0.055 and kappa 0.32 ± 0.06 for 95%                   be used for entity linking and word sense disambiguation.
central resamples. Again, the 2-IAA is negatively corre-                      8 http://tagme.di.unipi.it/
lated with the number of units annotated (Pearson correla-                    9 http://dbpedia-spotlight.github.io/demo/

tion coefficient -0.22).                                                      10 http://babelfy.org/
94                                                                                                                         R. Sudarikov, O. Bojar

             BabelNet                     Wikipedia                               TAGME               Spotlight           Babelfy
 Move        bn:00087012v,bn:00090948v    Motion_(physics)                        -                   -                   -
             bn:00056033n, bn:00056155n
 mouse       bn:00021487n,bn:00090942v    mouse_(disambiguation), mouse_cursor    Mouse_(computing)   Mouse_(computing)   bn:00024529n,bn:00021487n
             bn:00024529n
 cursor      bn:00024529n                 mouse_cursor, cursor_(disambiguation)   mouse_cursor        -                   bn:00024529n
 beginning   bn:00009632n,bn:00009633n    beginning, beginning_(disambiguation)   -                   -                   bn:00083340v
             bn:00009634n,bn:00009635n
 blank       bn:00098524a                 blank_page_(disambiguation)             -                   -                   bn:01161190n,bn:00098524a
 page        bn:00060158n                 blank_page_(disambiguation)             -                   -                   bn:01161190n,bn:00060158n
 press       bn:00091988v,bn:00091986v    press_(disambiguation)                  -                   -                   bn:00046094n
 DELETE      bn:01208543n                 Delete_key, DELETE                      Delete_key          -                   bn:01208543n, bn:00045088n
 key         bn:01208543n, bn:00048996n   Delete_key, key_(disambiguation)        Delete_key          -                   bn:01208543n, bn:00048985n
 often       bn:00114048r, bn:00115452r   often                                   -                   -                   -
             bn:00116418r
 needed      bn:00107194a                 Need_(disambiguation)                   -                   -                   bn:00082822v
 until       -                            until                                   -                   -                   -
 text        bn:00076732n                 text_(disambiguation)                   Plain_text          -                   bn:00076732n
 desired     bn:00100580a, bn:00026550n   Desire_(disambiguation), desired        -                   -                   bn:00086682v
             bn:00100607a
 spot        bn:00062699n                 spot_(disambiguation)                   -                   -                   bn:00062699n


Figure 4: Our BabelNet and Wikipedia manual annotations and outputs of three automatic sense taggers for the sentence
“Move the mouse cursor to the beginning of the blank page and press the DELETE key as often as needed until the text is
in the desired spot.” Overlap indicated by italics (BabelNet and Babelfy) and bold (Wikipedia and TAGME).


In our unifying view based on finding the best “selection                     [4] Navigli, R.: Word sense disambiguation: A survey. ACM
list” and selecting one or more senses from it, we tested                         Comput. Surv. 41(2) (February 2009) 10:1–10:69
standard inventories like BabelNet or Wikipedia, but also                     [5] Pereira, B.: Entity linking with multiple knowledge bases:
Google Search.                                                                    An ontology modularization approach. In: The Semantic
   We proposed and refined annotation guidelines in two                           Web–ISWC 2014. Springer (2014) 513–520
consecutive experiments, reaching average inter-annotator                     [6] Moro, A., Navigli, R.: SemEval-2015 Task 13: Multilin-
agreement of about 46%, with Wikipedia and Vallex up                              gual All-Words Sense Disambiguation and Entity Linking.
to 60%. Higher agreement seems to go together with lower                          In: Proc. of SemEval-2015. (2015) In press.
coverage, but further investigation is needed for confirma-                   [7] Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation
tion and to find the best balance of granularity, coverage                        of short text fragments (by wikipedia entities). In: Proc. of
                                                                                  CIKM, ACM (2010) 1625–1628
and versatility among existing sources.
                                                                              [8] Zhang, L., Rettinger, A., Färber, M., Tadić, M.: A compar-
                                                                                  ative evaluation of cross-lingual text annotation techniques.
                                                                                  In: Information Access Evaluation. Multilinguality, Multi-
Acknowledgements
                                                                                  modality, and Visualization. Springer (2013) 124–135
                                                                              [9] Banarescu, L., , et al.: Abstract Meaning Representation
This research was supported by the grants FP7-ICT-2013-                           for Sembanking (2013)
10-610516 (QTLeap). This research was partially sup-                         [10] Navigli, R., Ponzetto, S.P.: BabelNet: The automatic con-
ported by SVV project number 260 224. This work has                               struction, evaluation and application of a wide-coverage
been using language resources developed, stored and dis-                          multilingual semantic network. Artificial Intelligence 193
tributed by the LINDAT/CLARIN project of the Ministry                             (2012) 217–250
of Education, Youth and Sports of the Czech Republic                         [11] Žabokrtský, Z., Lopatková, M.: Valency information in
(project LM2010013).                                                              VALLEX 2.0: Logical structure of the lexicon. The Prague
                                                                                  Bulletin of Mathematical Linguistics (87) (2007) 41–60
                                                                             [12] Lopatková, M., Žabokrtský, Z., Ketnerová, V.: Valenční
References                                                                        slovník českých sloves. (2008)
                                                                             [13] Cohen, J.: A Coefficient of Agreement for Nominal Scales.
 [1] Demartini, G., et al.: Zencrowd: leveraging probabilis-                      Educational and Psychological Measurement 20(1) (1960)
     tic reasoning and crowdsourcing techniques for large-scale
     entity linking. In: Proceedings of the 21st international
     conference on World Wide Web, ACM (2012) 469–478
 [2] Bennett, P.N., et al.: Report on the sixth workshop on
     exploiting semantic annotations in information retrieval
     (ESAIR’13). In: ACM SIGIR Forum. Volume 48., ACM
     (2014) 13–20
 [3] Ratinov, L., et al.: Local and global algorithms for disam-
     biguation to wikipedia. In: Proc. of ACL/HLT, Volume 1.
     (2011) 1375–1384

</pre>