Making Sense of Microposts (#MSM2013) Concept Extraction Challenge

Making Sense of Microposts (#MSM2013) Concept Extraction Challenge AmparoElizabethCano Basave a.cano_basave@aston.ac.uk AndreaVarga a.varga@dcs.shef.ac.uk MatthewRowe m.rowe@lancaster.ac.uk MilanStankovic Aba-SahDadzie a.dadzie@cs.bham.ac.uk 1 Making Sense of Microposts (#MSM2013) Concept Extraction Challenge 87E9DD98109107119E68A2BEBD5246DA GROBID - A machine learning software for extracting information from scholarly documents

uwiD he ypen niversityD wilton ueynesD u 2 he yeu qroupD heptF of gomputer ieneD he niversity of he0eldD u 3 hool of gomputing nd gommunitionsD vnster niversityD u 4 épgeD risD prne

Abstract. wiroposts re smll frgments of soil medi ontent tht hve een pulished using lightweight prdigm @eFgF weetsD peook likesD foursqure hekEinsAF wiroposts hve een used for vriety of pplitions @eFgFD sentiment nlysisD opinion miningD trend nlysisAD y glening useful informtionD often using thirdEprty onept extrtion toolsF here hs een very lrge uptke of suh tools in the lst few yersD long with the retion nd doption of new methods for onept extrE tionF roweverD the evlution of suh e'orts hs een lrgely onsigned to doument orpor @eFgF news rtilesAD questioning the suitility of onept extrtion tools nd methods for wiropost dtF his report desries the wking ense of wiroposts orkshop @5wwPHIQA gonE ept ixtrtion ghllengeD hosted in onjuntion with the PHIQ orld ide e onferene @9IQAF he ghllenge dtset omprised mnully nnotted trining orpus of wiroposts nd n unlelled test orpusF rtiipnts were set the tsk of engineering onept extrE tion system for de(ned set of oneptsF yut of totl of PP omplete sumissions IQ were epted for presenttion t the workshopY the suE missions overed methods rnging from sequene mining lgorithms for ttriute extrtion to prtEofEspeeh tgging for wiropost lening nd ruleEsed nd disrimintive models for token lssi(tionF sn this reE port we desrie the evlution proess nd explin the performne of di'erent pprohes in di'erent ontextsF 1 Introduction Since the rst Making Sense of Microposts (#MSM) workshop at the Extended Semantic Web Conference in 2011 through to the most recent workshop in 2013 eFiF gno fsve hs sine hnged 0litionD toX ingineering nd epplied ieneD eston niversityD firminghmD u @eEmil s oveAF eFEF hdzie hs sine hnged 0litionD toX hool of gomputer ieneD niversity of firminghmD idgstonD firminghmD u @eEmil s oveAF we have received over 60 submissions covering a wide range of topics related to interpreting Microposts and (re)using the knowledge content of Microposts.

One central theme that has run through such work has been the need to understand and learn from Microposts (social network-based posts that are small in size and published using minimal eort from a variety of applications and on dierent devices), so that such information, given its public availability and ease of retrieval, can be reused in dierent applications and contexts (e.g. music recommendation, social bots, news feeds). Such usage often requires identifying entities or concepts in Microposts, and extracting them accordingly. However this can be hindered by: (i) the noisy lexical nature of Microposts, where terminology diers between users when referring to the same thing and abbreviations are commonplace;

(ii) the limited length of Microposts, which restricts the contextual information and cues that are available in normal document corpora.

The exponential increase in the rate of publication and availability of Microposts (Tweets, FourSquare check-ins, Facebook status updates, etc.), and applications used to generate them, has led to an increase in the use of third-party entity extraction APIs and tools. These function by taking as input a given text, identifying entities within them, and extracting entity type-value tuples. Rizzo & Troncy [12] evaluated the performance of entity extraction APIs over news corpora, assessing the performance of extraction and entity disambiguation. This work has been invaluable in providing a reference point for judging the performance of extraction APIs over well-structured news data. However, an assessment of the performance of extraction APIs over Microposts has yet to be performed.

This prompted the Concept Extraction Challenge held as part of the Making Sense of Microposts Workshop (#MSM2013) at the 2013 World Wide Web Conference (WWW'13). The rationale behind this was that such a challenge, in an open and competitive environment, would encourage and advance novel, improved approaches to extracting concepts from Microposts. This report describes the #MSM2013 Concept Extraction Challenge, collaborative annotation of the corpus of Microposts and our evaluation of the performance of each submission. We also describe the approaches taken in the systems entered using both established and developing alternative approaches to concept extraction, how well they performed, and how system performance diered across concepts.

The resulting body of work has implications for researchers interested in the task of extracting information from social data, and for application designers and engineers who wish to harvest information from Microposts for their own applications.

The Task and Goal

The challenge required participants to build semi-automated systems to identify concepts within Microposts and extract matching entity types for each concept identied, where concepts are dened as abstract notions of things. In order to focus the challenge we restricted the classication to four entity types:

(i) Person PER, e.g. Obama;

(ii) Organisation ORG, e.g. NASA;

(iii) Location LOC, e.g. New York; (iv) Miscellaneous MISC, consisting of the following: lm/movie, entertainment award event, political event, programming language, sporting event and TV show.

Submissions were required to recognise these entity types within each Micropost, and extract the corresponding entity type-value tuples from the Micropost.

Consider the following example, taken from our annotated corpus: The fourth token in this Micropost refers to the location Canada; an entry to the challenge would be required to spot this token and extract it as an annotation, as: LOC/ c a n a d a ;

The complete description of concept types and their scope, and additional examples can be found on the challenge website 5 , and also in the appendices in the challenge proceedings.

To encourage competitiveness we solicited sponsorship for the winning submission. This was provided by the online auctioning web site eBay 6 , who oered a $1500 prize for the winning entry. This generous sponsorship is testimony to the growing industry interest in issues related to automatic understanding of short, predominantly textual posts Microposts; challenges faced by major Social Web and other web sites, and increasingly, marketing and consumer analysts and customer support across industry, government, state and not-for-prots organisations around the world.

Data Collection and Annotation

The dataset consists of the message elds of each of 4341 manually annotated Microposts, on a variety of topics, including comments on the news and politics, To assess the performance of the submissions we used an underlying ground truth, or gold standard. In the rst instance, the dataset was annotated by two of the authors of this report. Subsequent to this we logged corrections to the annotations in the training data submitted by participants, following which we release an updated dataset. After this, based on a recommendation, we set up a GitHub repository to simplify collaborative annotation of the dataset. Four of the authors of this report then annotated a quarter of the dataset each, and then checked the annotations that the other three had performed to verify correctness.

For those entries for which consensus was not reached, discussion between all four annotators was used to come to a nal conclusion. This process resulted in better quality and higher consensus in the annotations. A very small number of errors was reported subsequent to this; a nal submission version with these corrections was used by participants for their last set of experiments and to submit their nal results.

Challenge Submissions

Twenty-two complete submissions were received for the challenge; each of which consisted of a short paper explaining the system's approach, and up to three dierent test set annotations generated by running the system with dierent settings. After peer review, thirteen submissions were accepted; for each, the submission run with the best overall performance was taken as the result of the system, and used in the rankings. The accepted submissions are listed in Table 1, with the run taken as the result set for each. eutomted pprohes for 5wwPHIQ gonept ixtrtionF golumns orrespond to the strtegies employed y the prtiipnts @trtegyAD the id of the systems @ystemAD the dt used to trin the onept extrtors @rinAD stte of the rt fetures TD okenD gseD worphology @worphFAD rtEofEspeeh @yAD puntionD vol ontextD vist lookupD gontext window size @gontextAAD lssi(ers used for oth entity extrtion @ixtrtionA nd lssi(tionD dditionl linguisti knowledge used for onept extrtion @vinguisti unowledgeAD preproessing steps performed on the dt @repFAD dditionl fetures used for the extrtors @yther peturesAD list of o'EtheEself systems employed @ixternl ystemsAF for the nal concept extractors, balancing in this way the contribution of existing extractors.

Among the rule-based approaches, the winning strategy was also similar. 3 Evaluation of Challenge Submissions

Evaluation Measures

The evaluation involved assessing the correctness of a system (S), in terms of the performance of the system's entity type classiers when extracting entities from the test set (T S). For each instance in T S, a system must provide a set of tuples of the form: (entity type, entity value). The evaluation compared these output tuples against those in the gold standard (GS). The metrics used to evaluate these tuples were the standard precision (P ), recall (R) and f-measure (F 1 ), calculated for each entity type. The nal result for each system was the average performance across the four dened entity types.

To assess the correctness of the tuples of an entity type t provided by a system S, we performed a strict match between the tuples submitted and those in the GS. We consider a strict match as one in which there is an exact match, with conversion to lowercase, between a system value and the GS value for a given entity type t. Let (x, y) ∈ S t denote the set of tuples extracted for entity type t by system S, (x, y) ∈ GS t denote the set of tuples for entity type t in the gold standard. We dene the set of True Positives (T P ), False Positives (F P ) and False Negatives (F N ) for a given system as:

T P t = {(x, y) | (x, y) ∈ (S t ∩ GS t )}

(1)

F P t = {(x, y) | (x, y) ∈ S t ∧ (x, y) / ∈ GS t } (2) F N t = {(x, y) | (x, y) ∈ GS t ∧ (x, y) / ∈ S t }(3)

Therefore T P t denes the set of true positives considering the entity type and value of tuples; F P t is the set of false positives considering the unexpected results for an entity type t; F N t is the set of false negatives denoting the entities that were missed by the extraction system, yet appear within the gold standard.

As we require matching of the tuples (x, y) we are looking for strict extraction matches, this means that a system must both detect the correct entity type (x) 17 the egi rogrmX http://projects.ldc.upenn.edu/ace and extract the correct matching entity value (y) from a Micropost. From this set of denitions we dene precision (P t ) and recall (R t ) for a given entity type t as follows:

P t = |T P t | |T P t ∪ F P t | (4) R t = |T P t | |T P t ∪ F N t | (5)

As we compute the precision and recall on a per-entity-type basis, we dene the average precision and recall of a given system S, and the harmonic mean, F 1 between these measures:

P = P P ER + P ORG + P LOC + P M ISC 4 (6) R = R P ER + R ORG + R LOC + R M ISC 4 (7) F 1 = 2 × P × R P + R(8)

Evaluation Results and Discussion

We report the dierences in performance between participants' systems, with a focus on the dierences in performance by entity type. The following subsections report results of the evaluated systems in terms of precision, recall and F-measure, following the metrics dened in subsection 3.1.

Precision. We begin by discussing the performance of the submissions in terms of precision. Precision measures the accuracy, or `purity ', of the detected entities in terms of the proportion of false positives within the returned set: high precision equates to a low false positive rate. Table 3.2 shows that hybrid systems are the top 4 ranked systems (in descending order, 14, 21, 30, 15), suggesting that a combination of rules and data-driven approaches yields increased precision.

Studying the features of the top-performing systems, we note that maintaining capitalisation is correlated with high precision. There is, however, clear variance in other techniques used (classiers, extraction methods, etc.) between the systems.

Fine-grained insight into the disparities between precision performance was obtained by inspecting the performance of the submissions across the dierent concept types (person, organisation, location, miscellaneous). Figure 3a presents the distribution of precision values across these four concept types and the macro average of these values. We nd that systems do well (above the median of average precision values) for person and location concepts, and perform worse than the median for organisations and miscellaneous. For the entity type `miscellaneous', this is not surprising as it features a fairly nuanced denition, including lms and movies, entertainment award events, political events, programming languages, sporting events and TV shows. We also note that several submissions used gazetteers in their systems, many of which were for locations; this could have contributed to the higher precision values for location concepts. Recall. Although precision aords insight into the accuracy of the entities identied across dierent concept types, it does not allow for inspecting the detection rate over all possible entities. To facilitate this we also report the recall scores of each submission, providing an assessment of the entity coverage of each approach. Table 3 presents the overall recall values for each system and for each and across all concept types. Once again, as with precision, we note that hybrid systems (21, 15, 14) appear at the top of the rankings, with a rule-based approach (20) and a data driven approach (3) coming fourth and fth respectively.

Looking at the distribution of recall scores across the submissions in Fig-

ure 3c we see a similar picture as before when inspecting the precision plots.

For instance, for the person and location concepts we note that the submissions exceed the median of all concepts (when the macro-average of the recall scores is taken), while for organisation and miscellaneous lower values than the median are observed. This again comes back to the nuanced denition of the miscellaneous category, although the recall scores are higher on average than the precision score. The availability of person name and place name gazetteers also benets identication of the corresponding concept types. This suggests that additional eort is needed to improve the organisation concept extraction and to provide information to seed the detection process, for instance through q q 0.0 0. @A gonept type reision q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 PER Precision p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 ORG Precision p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 LOC Precision p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8

MISC Precision p(x)

ALL Precision p(x)

@A roility densities of onept type reision q q q 0.0 0. @A gonept type ell q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 PER Recall p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0

q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 LOC Recall p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 MISC Recall p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0. @eA gonept type p1 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 PER F1 p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 ORG F1 p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 LOC F1 p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 MISC F1 p(x) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.4 0.8 ALL F1 p(x)

@fA roility densities of onept type p1F Fig. 3. histriutions of performne sores for ll sumissionsY dshed line is the menF the provision of organisation name gazetteers. Interestingly, when we look at the best performing system in terms of recall over the organisation concept we nd that submission 14 uses a variety of third party lookup lists (Yago, Microsoft ngrams and Wordnet), suggesting that this approach leads to increased coverage and accuracy when extracting organisation names. F-Measure (F 1 ). By combining the precision and recall scores together for the individual systems using the f-measure (F 1 ) score we are provided with an overall assessment of concept extraction performance. Table 4 presents the f-measure (F 1 ) score for each submission and performance across the four concept types. We note that, as previously, hybrid systems do best overall (top-3 places), indicating that a combination of rules and data-driven approaches yields the best results.

Submission 14 records the highest overall F 1 score, and also the highest scores for the person and organisation concept types; submission 15 records the highest F 1 score for the location concept type; while submission 21 yields the highest F 1 score for the miscellaneous concept type. Submission 15 uses Google Gazetteers together with part-of-speech tagging of noun and verb phrases, suggesting that this combination yields promising results for our nuanced miscellaneous concept type.

Figure 3e shows the distribution of F 1 scores across the concept types for each submission. We nd, as before, that the systems do well for person and location and poorly for organisation and miscellaneous. The reasons behind the reduced performance for these latter two concept types are, as mentioned, attributable

to the availability of organisation information in third party lookup lists.

Conclusions

The aim of the MSM Concept Extraction Challenge was to foster an open initiative for extracting concepts from Microposts. Our motivation for hosting the challenge was born of the increased availability of third party extraction tools, and their widespread uptake, but the lack of an agreed formal evaluation of their accuracy when applied over Microposts, together with limited understanding of how performance diers between concept types. The challenge's task involved the identication of entity types and value tuples from a collection of Microposts. To our knowledge the entity annotation set of Microposts generated as a result of the challenge, and thanks to the collaboration of all the participants, is the largest annotation set of its type openly available online. We hope that this will provide the basis for future eorts in this eld and lead to a standardised evaluation eort for concept extraction from Microposts.

The results from the challenge indicate that systems performed well which:

(i) used a hybrid approach, consisting of data-driven and rule-based techniques; and (ii) exploited available lookup lists, such as place name and person name gazetteers, and linked data resources. Our future eorts in the area of concept extraction from Microposts will feature additional hosted challenges, with more complex tasks, aiming to identify the dierences in performance between disparate systems and their approaches, and inform users of extraction tools on the suitability of dierent applications for dierent tasks and contexts.

8 7 0 , 0 0 0 p e o p l e i n c a n a d a d e p e n d on #f o o d b a n k s −25% i n c r e a s e i n t h e l a s t 2 y e a r s − p l e a s e g i v e g e n e r o u s l y

collected from the end of 2010 to the beginning of 2011, with a 60% / 40% split between training and test data. The annotation of each Micropost in the training dataset gave all participants a common base from which to learn extraction patterns. The test dataset contained no annotations; the challenge task was for participants to provide these. The complete dataset, including a list of changes and the gold standard, is available on the #MSM2013 challenge web pages 7 , accessible under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Figure 11Figure 1 presents the entity type distributions over the training set, test set and over the entire corpus.

Fig. 1 .1Fig. 1. histriutions of entity types in the dtset

FigFig. 2.

The success of these models appears to rely on the application of o-the-shelf systems (e.g. AIDA [15], ANNIE [1], OpenNLP 8 , Illinois NET [9], Illinois Wiki- er [10], LingPipe 9 , OpenCalais 10 , StanfordNER [2], WikiMiner 11 , NERD [12], TWNer [11], Alchemy 12 , DBpedia Spotlight[5] 13 , Zemanta 14 ) for either entity extraction (identifying the boundaries of an entity) or classication (assigning a semantic type to an entity). For the best performing system (14), the complete concept classication component was executed by the (existing) concept disambiguation tool AIDA. Other systems (21, 15, 25), on the other hand, made use of the output of multiple o-the-shelf systems, resulting in additional features (such as the condence scores of each individual NER extractors ConfScores)

IQF F rwgiF snformtion extrtionF Foundations and Trends in DatabasesD IXPTI! QUUD PHHVF IRF iF pF jong uim ng nd pF he weulderF sntrodution to the goxvvEPHHQ shred tskX lngugeEindependent nmed entity reognitionF sn Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 -Volume 4D gyxvv 9HQD pges IRP!IRUF essoition for gomputtionl vinguistisD PHHQF ISF wF eF osefD tF ro'rtD sF fordinoD wF pniolD nd qF eikumF eidX en online tool for urte dismigution of nmed entities in text nd tlesF PVLDBD PHIIF

Table 11. umissions eptedD in order of sumissionD with uthors nd numer ofruns for ehumission xoFeuthorsxoF of runssumissionHQvn hen foshD wF et lFQsumissionIRriD wF et lFIsumissionISn irpD wF et lFQsumissionPHgortisD uFIsumissionPIhlugolinskýD F et lFQsumissionPSqodinD pF et lFIsumissionPVqenD F et lFIsumissionPWwuñozEqríD yF et lFIsumissionQProsseinD eFIsumissionQHwendesD F et lFQsumissionQQhsD eF et lFQsumissionQRde yliveirD hF et lFIsumissionQShidnndnD F et lFI2.4 System DescriptionsParticipants approached the concept extraction task with rule-based, machine learning and hybrid methods. A summary of each approach can be found in Fig-ure 2, with detail in the author descriptions that follow this report. We comparedthese approaches according to various dimensions: state of the art (SoA) named entity recognition (NER) features employed (columns 4-11) ([13,6]), classiersused for both extraction and classication of entities (columns 12-13), additionallinguistic knowledge sources used (column 14), special pre-processing steps per-formed (column 15), other non-SoA NER features used (column 16), and nally, the list of o-the-shelf systems incorporated (column 17).From the results and participants' experiments we make a number of observa-tions. With regard to the strategy employed, the best performing systems (fromthe top, 14, 21, 15, 25), based on overall F 1 score (see Section 3), were hybrid.

State-of-the-art features Classifier usedExternal SystemsANNIE [1]Freeling [8]NLTK [4]BabelNet API [7]AIDA [15]ANNIE [1],OpenNLP,Illinois NET [9],Illinois Wikifier [10],LingPipe, OpenCalais,StanfordNER [2],WikiMinerStanfordNER [2],NERD [12],TWNer [11]Alchemy,DBpedia Spotlight,OpenCalais,ZemantaDBpedia SpotlightOtherFeaturesAIDA ScoresConfScoresPrep.RT, @, #,Slang, MissSpellRT, @, URL, #,Punct, MissSpell,LowerCasePunct,CapitaliseLowerCaseURL, #, @, Punct#, Slang,MissSpellRT, #, @, URLLinguisticKnowledgeDBpediaDBpediaWikiDBpedia, BabelNetSamsad & NICTAdictionaryYago,Microsoft N-grams,WordNetDBpediaYago, Wiki,WordNetExtraction ClassificationANNIERules RulesRules RulesBabelNet Rules WSD2 IGTree memory-based taggersCRFCRFCRF+ AIDA SVM RBFC4.5 decision treeSVM SMORandom ForestDBpedia CRF SpotlightPagerank CRFContextsizeofTW2List lookupDBpedia Gazetteer,ANNIE GazetteerWiki Gazetteer,IsStopWordWiki GazetteerGeonames.org Gazetteer;JRC names corpusCountry names Gazetter,City names Gazzetter,IsOOVWiki Gazetter,Freebase GazetterYago,Microsoft N-grams,WordNet,TWGoogle GazetterDBpedia Gazetteer,BALIE GazetteerYago, Wiki, WordNetLocalsyntaxFollowsFWFirst Word,Last WordFunctionTokenLengthTokenLengthPOSANNIE PosPosFreeling2012,isNPNLTKPosPosTreebankTwPos2011TwPos2011isNP,isVPTwPos2011Morp.Prefix,SuffixPrefix,SuffixCaseIsCapIsCapIsCapIsCap,AllCapIsCap,AllCap,LowerCaseIsCap,AllCapIsCap,AllCap,LowerCaseTokenNgramNgramStemNgramNgramTrainTWTWTWTWTW,CoNLL03,ACE04,ACE05TWTWTWTWTWTWTWTWSystem2029283233334142115253035StrategyRule-basedData-drivenHybrid

Table 2 .2reision sores for eh sumission over the di'erent onept typesRank EntryPERORGLOCMISCALL

Table 3 .3ell sores for eh sumission over the di'erent onept typesRank EntryPERORGLOCMISCALLIPI E QHFWQVHFTIRHFTIQHFPVUHFTIQPIS E QHFWSPHFRVSHFUQWHFPTWHFTIIQIR E IHFWHVHFTIIHFTPHHFPUUHFTHRRPH E IHFVSWHFSVUHFSIUHFRIVHFSWSSHQ E QHFWPTHFRTQHFTVPHFIPPHFSRVTPS E IHFVVUHFRHSHFTVSHFPHSHFSRTUPV E IHFVTRHFPWHHFTWPHFISSHFSHHVPW E IHFUQTHFRVWHFRRRHFPTQHFRVQWQP E IHFURIHFPVWHFSHTHFQWIHFRVPIHQS E IHFWPHHFQRTHFSHTHFIHPHFRTVIIQQ E QHFVUUHFPRVHFSIVHFHUUHFRQHIPQR E IHFUVUHFPVQHFRQWHFHWVHFRHPIQQH E IHFTISHFPTVHFRRRHFPHRHFQVQ

Table 4 .4p1 sores hieved y eh sumission for eh nd ross ll onept typesRank EntryPERORGLOCMISCALLIIR E IHFWPHHFTRHHFUQVHFQVQHFTUHPPI E QHFWIHHFTHWHFUPIHFRIHHFTTPQIS E QHFWIVHFSTVHFUWHHFQSTHFTSVRPH E IHFVQQHFTIIHFTIVHFQUUHFTIHSPS E IHFVPVHFRVTHFURRHFPWVHFSVWTHQ E QHFVUHHFSSTHFUQVHFIWIHFSVWUPW E IHFUTPHFSQUHFSVUHFQSTHFSTIVPV E IHFVISHFRHSHFUHSHFPQTHFSRHWQP E IHFUPUHFQRUHFSVUHFRIHHFSIVIHQH E IHFUHVHFQUWHFSUVHFQIQHFRWRIIQQ E QHFVRTHFQTUHFTITHFIQUHFRWIIPQS E IHFVPQHFRIWHFSWUHFIIUHFRVWIQQR E IHFSRPHFQUPHFSPSHFISSHFQWW

http://oak.dcs.shef.ac.uk/msm2013/challenge.html http://www.ebay.com http://opennlp.apache.org http://alias-i.com/lingpipe http://www.opencalais.com http://wikipedia-miner.cms.waikato.ac.nz http://www.alchemyapi.com http://dbpedia.org/spotlight http://www.zemanta.com enother o'EtheEshelf entity extrtor employed ws felxet es UD in sumission QPF http://www.noslang.com/dictionary/full http://www.chatslang.com/terms/twitter http://www.chatslang.com/terms/facebook

Acknowledgments

We thank the participants who helped us improve the gold standard used for the challenge. We also thank eBay for supporting the challenge by sponsoring the prize for winning submission.

A.E. Cano is funded by the EPSRC project violenceDet (grant no. EP/J020427/1). A.-S. Dadzie was funded by the MRC project Time to Change (grant no. 129941).

References