<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Entity Extraction and Consolidation for Social Web Content Preservation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Stefan</forename><surname>Dietze</surname></persName>
							<email>dietze@l3s.de</email>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University</orgName>
								<address>
									<settlement>Hannover</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Diana</forename><surname>Maynard</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<settlement>Sheffield</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Elena</forename><surname>Demidova</surname></persName>
							<email>demidova@l3s.de</email>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University</orgName>
								<address>
									<settlement>Hannover</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Risse</surname></persName>
							<email>risse@l3s.de</email>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University</orgName>
								<address>
									<settlement>Hannover</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wim</forename><surname>Peters</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<settlement>Sheffield</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Katerina</forename><surname>Doka</surname></persName>
							<email>katerina@cslab.ece.ntua.gr</email>
							<affiliation key="aff2">
								<orgName type="department">IMIS</orgName>
								<orgName type="institution">RC ATHENA</orgName>
								<address>
									<addrLine>Artemidos 6</addrLine>
									<postCode>15125</postCode>
									<settlement>Athens</settlement>
									<country key="GR">Greece</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yannis</forename><surname>Stavrakas</surname></persName>
							<email>yannis@imis.athenainnovation.gr</email>
							<affiliation key="aff2">
								<orgName type="department">IMIS</orgName>
								<orgName type="institution">RC ATHENA</orgName>
								<address>
									<addrLine>Artemidos 6</addrLine>
									<postCode>15125</postCode>
									<settlement>Athens</settlement>
									<country key="GR">Greece</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Entity Extraction and Consolidation for Social Web Content Preservation</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">3000E2C07B93EE2D99A8E458B6DD457B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T01:06+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Knowledge Extraction</term>
					<term>Linked Data</term>
					<term>Data Consolidation</term>
					<term>Data Enrichment</term>
					<term>Web Archiving</term>
					<term>Entity Recognition</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With the rapidly increasing pace at which Web content is evolving, particularly social media, preserving the Web and its evolution over time becomes an important challenge. Meaningful analysis of Web content lends itself to an entity-centric view to organise Web resources according to the information objects related to them. Therefore, the crucial challenge is to extract, detect and correlate entities from a vast number of heterogeneous Web resources where the nature and quality of the content may vary heavily. While a wealth of information extraction tools aid this process, we believe that, the consolidation of automatically extracted data has to be treated as an equally important step in order to ensure high quality and non-ambiguity of generated data. In this paper we present an approach which is based on an iterative cycle exploiting Web data for (1) targeted archiving/crawling of Web objects, (2) entity extraction, and detection, and (3) entity correlation. The long-term goal is to preserve Web content over time and allow its navigation and analysis based on well-formed structured RDF data about entities.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Given the ever increasing pace at which Web content is constantly evolving, adequate Web archiving and preservation have become a cultural necessity. Along with "common" challenges of digital preservation, such as media decay, technological obsolescence, authenticity and integrity issues, Web preservation has to deal with the sheer size and ever-increasing growth rate of Web content. This in particular applies to user-generated content and social media, which is characterized by a high degree of diversity, heavily varying quality and heterogeneity. Instead of following a collect-all strategy, archival organizations are striving to build focused archives that revolve around a particular topic and reflect the diversity of information people are interested in. Thus, focused archives largely revolve around the entities which define a topic or area of interest, such as persons, organisations and locations. Hence, extraction of entities from archived Web content, in particular social media, is a crucial challenge in order to allow semantic search and navigation in Web archives and the relevance assessment of a given set of Web objects for a particular focused crawl. However, while tools are available for information extraction from more formal text, social media affords particular challenges to knowledge acquisition. These are detailed more explicitly in Section 3. This calls for a range of specific strategies and techniques to consolidate, enrich, disambiguate and interlink extracted data. This in particular benefits from taking advantage of existing knowledge, such as Linked Open Data <ref type="bibr" target="#b0">[1]</ref>, to compensate for, disambiguate and remedy degraded information. While data consolidation techniques traditionally exist independent from named entity recognition (NER) technologies, their coherent integration into unified workflows is of crucial importance to improve the wealth of automatically extracted data on the Web. This becomes even more crucial with the emergence of an increasing variety of publicly available and end-user friendly knowledge extraction and NER tools such as DBpedia Spotlight<ref type="foot" target="#foot_1">1</ref> , GATE<ref type="foot" target="#foot_2">2</ref> , Open Calais<ref type="foot" target="#foot_3">3</ref> , Zemanta <ref type="foot" target="#foot_4">4</ref> .</p><p>In this paper, we introduce an integrated approach to extracting and consolidating structured knowledge about entities from archived Web content. This knowledge will in the future be used to facilitate semantic search of Web archives and to further guide the crawl. This work was developed in the EC-funded Integrating Project ARCOMEM <ref type="foot" target="#foot_5">5</ref> . Note, while temporal aspects related to term and knowledge evolution are substantial to Web preservation, these are currently under investigation <ref type="bibr" target="#b23">[24]</ref> but out of scope for this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Entity recognition is one of the major tasks within information extraction and may encompass both NER and term extraction. Entity recognition may involve rule-based systems <ref type="bibr" target="#b12">[13]</ref> or machine learning techniques <ref type="bibr" target="#b13">[14]</ref>. Term extraction involves the identification and filtering of term candidates for the purpose of identifying domainrelevant terms or entities. The main aim in automatic term recognition is to determine whether a word or a sequence of words is a term that characterises the target domain.</p><p>Most term extraction methods use a combination of linguistic filtering (e.g. possible sequences of part of speech tags) and statistical measures (e.g. tf.idf) <ref type="bibr" target="#b14">[15]</ref> and <ref type="bibr" target="#b15">[16]</ref>, to determine the salience of each term candidate for each document in the corpus <ref type="bibr" target="#b22">[23]</ref>. Data consolidation has to cover a variety of areas such as enrichment, entity/identity resolution for disambiguation as well as clustering and correlation to consolidate disparate data. In addition, link prediction and discovery is of crucial importance to enable clustering and correlation of enriched data sources. A variety of methods for entity resolution have been proposed, using relationships among entities <ref type="bibr" target="#b6">[7]</ref>, string similarity metrics <ref type="bibr" target="#b5">[6]</ref>, as well as transformations <ref type="bibr" target="#b8">[9]</ref>. An overview of the most important works in this area can be found in <ref type="bibr" target="#b7">[8]</ref>. As opposed to entity correlation techniques exploited in this paper, text clustering of documents exploits feature vectors, to represent documents according to contained terms <ref type="bibr" target="#b9">[10]</ref>[11] <ref type="bibr" target="#b11">[12]</ref>. Clustering algorithms measure the similarity across the documents and assign the documents to the appropriate clusters based on this similarity. Similarly, vector-based approaches have been used to map distinct ontologies and datasets <ref type="bibr" target="#b1">[2]</ref> <ref type="bibr" target="#b2">[3]</ref>. As opposed to text clustering, entity correlation and clustering takes advantage of background knowledge from related datasets to correlate previously extracted entities. Therefore, link discovery is another crucial area to be considered. Graph summarization predicts links in annotated RDF graphs. A detailed survey of link predictions techniques in complex networks and social network are presented by <ref type="bibr" target="#b3">[4]</ref> and <ref type="bibr" target="#b4">[5]</ref>, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3</head><p>Challenges and overall approach ARCOMEM follows a use case-driven approach based on scenarios aimed at creating focused Web archives. We deploy a document repository of crawled Web content and a structured RDF knowledge base containing metadata about entities detected in the archived content. Archivists can specify or modify crawl specifications (fundamentally consisting of selected sets of relevant entities and topics). The intelligent crawler will be able to learn about crawl intentions and to refine a crawling strategy on-thefly. This is especially important for long running crawls with broader topics, such as the financial crisis or elections, where entities are changing more frequently and hence require regular adaptation of the crawl specification. End-user applications allow users to search and browse the archives by exploiting automatically extracted metadata about entities and topics. Fundamental to both crawl strategy refinement and Web archive navigation is the efficient extraction of entities from archived Web content. In particular, social media poses a number of challenges for language analysis tools due to the degraded nature of the text, especially where tweets are concerned. In one study, the Stanford NER tagger dropped from 90.8% F1 to 45.88% when applied to a corpus of tweets <ref type="bibr" target="#b16">[17]</ref>. <ref type="bibr" target="#b18">[19]</ref> also demonstrate some of the difficulties in applying traditional POS tagging, chunking and NER techniques to tweets, while language identification tools typically also do not work well on short sentences. Problems are caused by incorrect spelling and grammar, made-up words, hashtags, @ signs and emoticons, unorthodox capitalisation, and spellings (e.g duplication of letters in words for emphasis, text speak). Since tokenisation, POS tagging and matching against pre-defined gazetteer lists are key to NER, it is important to resolve these problems: we adopt methods such as adapting tokenisers, using techniques from SMS normalisation, retraining language identifiers, use of case-insensitive matching in certain cases, using shallow techniques rather than full parsing, and using more flexible forms of matching. Entity extraction and enrichment is covered by a set of dedicated components which have been incorporated into a dedicated processing chain (Figure <ref type="figure">1</ref>) which handles NER and consolidation (enrichment, clustering, disambiguation) as part of one coherent workflow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig. 1. Entity extraction and consolidation processing chain</head><p>The ARCOMEM storage composed of the object store and the knowledge base handles (a) binary data, in the form of Web objects, which represent the original content collected by the crawler; and (b) semi-structured data, in the form of RDF<ref type="foot" target="#foot_6">6</ref> triples (Web object annotations). Storage is based on a distributed solution that combines the MapReduce <ref type="bibr" target="#b8">[9]</ref> paradigm and NoSQL databases and is realised based on HBase<ref type="foot" target="#foot_7">7</ref> (see also <ref type="bibr" target="#b24">[25]</ref>). The ARCOMEM data model<ref type="foot" target="#foot_8">8</ref> provides an RDF schema to reflect the informational needs for knowledge capturing, crawling, and preservation (see <ref type="bibr" target="#b19">[20]</ref> for details).</p><p>Within the ARCOMEM model, "entity" encompasses both traditional Named Entities and also single and multi-word terms: the recognition of both is done using GATE tools. GATE has been chosen over other NLP tools primarily for its coverage, extensibility and flexibility: it has a wide range of NLP components, which are easily modifiable for the demands of the project, unlike tools such as OpenCalais and DBPedia Spotlight which are more limited in scope. While extracted data is already classified and labelled as a result of the extraction process, it is nevertheless (i) heterogeneous, i.e. not well interlinked, (ii) ambiguous and (iii) provides only very limited information. This is due to data being extracted by different components and during independent processing cycles, since the tools in GATE have no possibility to perform co-reference on entities generated asynchronously across multiple documents. For instance, during one particular cycle, the text analysis component might detect an entity from the term "Ireland", while during later cycles, entities based on the term "Republic of Ireland"' or the German term "Irland" might be extracted, together with, the entity "Dublin". These would all be classified as entities of type Location and correctly stored in the data store as disparate entities described according to the data model. Thus, Enrichment and Consolidation (Fig. <ref type="figure">1</ref>) follows three aims: (a) enrich existing entities with related publicly available knowledge; (b) disambiguation, and (c) identify data correlations such as the ones illustrated above. This is achieved by mapping isolated entities to concepts (nodes) within reference datasets (enrichment) and exploiting the corresponding graphs to discover correlations. Therefore, we exploit publicly available data from the Linked Open Data cloud which offers a vast amount of data of both domain-specific and domain-independent nature (the current release consists of 31 billion distinct triples, i.e. RDF statements <ref type="foot" target="#foot_9">9</ref> ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Implementation</head><p>For entity recognition, we use a modified version of ANNIE <ref type="bibr" target="#b17">[18]</ref> to find mentions of Person, Location, Organization, Date, Time, Money and Percent. We included extra subtypes of Organization such as Band and Political Party, and have made various modifications to deal with the problems specific to social media such as incorrect English (see <ref type="bibr" target="#b20">[21]</ref> for more details). The entity extraction framework can be divided into the following components (GATE component in Fig. <ref type="figure">1</ref>) which are executed sequentially over a corpus of documents: Document Pre-processing (document format analysis, content detection) Linguistic Pre-processing (language detection, tokenisation, POS tagging etc) Named Entity Extraction: Term Extraction (generation of ranked list of terms and thresholding) &amp; NER (gazetteers, rule-based grammars and co-reference)</p><p>For term extraction, we use an adapted version of TermRaider <ref type="foot" target="#foot_10">10</ref> . This considers noun phrases (NPs) as candidate terms (as determined by linguistic pre-processing), and ranks them in order of termhood according to 3 different scoring functions: (1) basic tf.idf (2) an augmented tf.idf which also takes into account the tf.idf score of any hyponyms of a candidate term, and (3) the Kyoto score based on <ref type="bibr" target="#b21">[22]</ref> which takes into account the number of hyponyms of a candidate term occurring in the document. All are normalised to represent a value between 0 and 100. A candidate term is not considered an entity if it matches or is contained within an existing Named Entity, to avoid duplication. Also, we have set a threshold score above which we consider a candidate term to be valid. This threshold is a parameter which can be manually changed at any time -currently it is set to an augmented score of 45, i.e. only terms with a score of 45 or greater will be used by later processes.</p><p>The entity extraction generates RDF data describing NEs and terms according to the ARCOMEM data model which is pushed to our knowledge base and directly digested by our Enrichment &amp; Consolidation component (Fig. <ref type="figure">1</ref>). The latter exploits (a) the entity label and (b) the entity type to expand, disambiguate and correlate extracted data. Note that an entity/event label might correspond directly to a label of one unique node in a structured dataset (as is likely for an entity of type person labelled "Angela Merkel"), but might also correspond to more than one node/concept, as is the case for most of the events in our dataset. For instance, the event labeled "Jean Claude Trichet gives keynote at ECB summit" will most likely be enriched with links to concepts representing the ECB as well as Jean Claude Trichet. Our approach is based on the following steps (reflected in Fig. <ref type="figure">1</ref>):</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S1.</head><p>Entity enrichment S1.a. Translation: we determine the language of the entity label, and, if necessary, translate it into English using an online translation service. S1.b. Enrichment: co-referencing with related entities in reference datasets. S2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Entity correlation and clustering</head><p>In order to obtain enrichments for these entities we perform queries on external knowledge bases. Our current enrichment approach uses DBpedia<ref type="foot" target="#foot_11">11</ref> and Freebase<ref type="foot" target="#foot_12">12</ref> as reference datasets, though it is envisaged to expand this approach with additional and more domain-specific datasets, e.g., event-specific ones. DBpedia and Freebase are particularly well-suited due to their vast size, the availability of disambiguation techniques which can utilise the variety of multilingual labels available in both datasets for individual data items and the level of inter-connectedness of both datasets, allowing the retrieval of a wealth of related information for particular items. In the case of DBpedia, we make use of the DBpedia Spotlight service which enables an approximate string matching with adjustable confidence level in the interval [0,1]. As part of our evaluation (Section 6), we experimentally selected a confidence level of 0.6 which provided the best balance of precision and recall. Note that Spotlight offers NER capabilities complementary to GATE. However, these were only utilised in cases where entities/events were not in a rather atomic form, as is often the case for events which mostly consists of free text descriptions such the one mentioned above.</p><p>Freebase contains about 22 million entities and more than 350 millions facts in about 100 domains. Keyword queries over Freebase are particularly ambiguous due to the size and the structure of the dataset. In order to reduce query ambiguity, we used the Freebase API and restricted the types of the entities to be matched using a manually defined type mapping from ARCOMEM to Freebase entity types. For example, we mapped the ARCOMEM type "person" to the "people/person" type of Freebase, and the ARCOMEM type "location" to the Freebase types "location/continent", "location/location" and "location/country". For instance, an ARCOMEM entity of type "Person" with the label "Angela Merkel" is mapped to the Freebase MQL query that retrieves one unique Freebase entity with the mid= "/m/0jl0g". With respect to data correlation, we distinct direct as well as indirect correlations. Please note, that a correlation does not describe any notion of equivalence (e.g. similar to owl:sameAs) but merely a meaningful level of relatedness.</p><p>Fig. <ref type="figure" target="#fig_0">2</ref> depicts both cases, direct as well as indirect correlations. Direct correlations are identified by means of equivalent and shared enrichments, i.e., any entities/events sharing the same enrichments are supposedly correlated and hence clustered. A direct correlation is visible between the entity of type Person labeled "Jean Claude Trichet" and the event "Trichet warns of systemic debt crisis". In addition, the retrieved enrichments associate the ARCOMEM entities and associated Web objects with the knowledge, i.e., data graph, available in associated reference datasets. For instance, the DBpedia resource of the European Central Bank (http://DBpedia.org/resource/ECB) provides additional facts (e.g., a classification as organisation, its members, or previous presidents) in a structured, and therefore, machine-processable form. Exploiting the graphs of underlying reference datasets allows us to identify additional, indirect correlations. While linguistic/syntactic approaches would fail to detect a relationship between the two enrichments above (Trichet, ECB) and hence their corresponding entities and Web objects, by analysing the DBpedia graph we are able to uncover a close relationship between the two (Trichet being the former ECB president). Hence, computing the relatedness of enrichments would allow us to detect indirect correlations to create a relationship (dashed line) between highly releated entities/events, beyond mere equivalence.</p><p>Our current implementation is limited to detect direct correlations, while ongoing experiments based on graph analysis mechanisms aim to automatically measure semantic relatedness of entities in reference datasets to detect indirect relations. While in a large graph, all nodes are connected with each other in some way, a key research challenge is the investigation of appropriate graph navigation and analysis techniques to uncover indirect but semantically meaningful relationships between resources within reference datasets, and hence ARCOMEM entities and Web objects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results &amp; evaluation</head><p>For our experiments, we used a dataset composed of English and German archived Web objects constituting a sample of crawls relating to the financial crisis <ref type="foot" target="#foot_13">13</ref> . The English content covered 32 Facebook posts, 41,000 tweets and 800 user comments from greekcrisis.net. The German content consisted of archived data from the Austrian Parliament<ref type="foot" target="#foot_14">14</ref> consisting of 326 documents (mostly PDF, some HTML). Our extraction and enrichment experiments resulted in an evaluation dataset <ref type="foot" target="#foot_15">15</ref> of 99,569 unique entities involving the types Event, Location, Money, Organization, Person, Time. Using the procedure described above, we obtained enrichments for 1,358 of the entities in our dataset using DBpedia (484 entities) and Freebase (975 entities). In total, we obtained 5,291 Freebase enrichments and 491 DBpedia enrichments. These enrichments built 5,801 entity-enrichment pairs, 5,039 with Freebase and 492 with DBpedia.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig, 3. Generated ARCOMEM graph and clusters</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Entity extraction evaluation</head><p>We have performed initial evaluations on the various text analysis components. We manually annotated a small corpus of 20 Facebook posts (in English) from the dataset described above with named entities to form a gold standard corpus. This contained 93 instances of Named Entities. For evaluating TermRaider, we took a larger set of 80 documents from the financial crisis dataset, from which, TermRaider produced 1003 term candidates (merged from the results of the three different scoring systems). Three human annotators selected valid terms from that list, and we produced a gold standard of 315, comprising each term candidate selected by at least two annotators (221 terms selected by exactly two annotators and 94 selected by all three). While inter-annotator agreement was thus quite low, this is normal for a term extraction task as it is extremely subjective; however, in future we will tighten the annotation guidelines and provide further training to the annotators with the aim of reaching a better consensus.</p><p>For the NE recognition evaluation, we compared the system annotations with the gold standard. The system achieved a Precision of 80% and a Recall of 68% on the task of NE detection (i.e. detecting whether an entity was present or not, regardless of its type). On the task of type determination (getting the correct type of the entity (Person, Organization, Location etc.)), the system performed with 98.8% Precision and 98.5% Recall. Overall (for the two tasks combined), this gives NE recognition scores of 79% Precision and 67% Recall. However, the results are slightly low because this actually includes Sentence detection also. Normally, Sentence detection is 100% accurate (or near enough), but in this case, it is subject to the language detection issue, because we only perform the entity detection on sentences deemed to be relevant (in the language of the task and which corresponds to the relevant part of the documentin this case, the actual text of the postings by the users). 26 of the missing system annotations in the document were outside the span of the sentences annotated, so could not have been annotated. Excluding these increased Recall from 68% to 83.9% for NE detection (shown in the table as "NE detection (adjusted)"), and from 67% to 73.5% for the complete NE recognition task (shown in the table as "Full NE recognition (adjusted)").</p><p>Table <ref type="table">1</ref>. NER evaluation results For term recognitions, we compared the TermRaider output for each scoring system with the gold standard set of terms, at different levels of the ranked list, as shown in Figure <ref type="figure">4</ref>. For the terms above the threshold, we achieved Precision scores of 31% and Recall of 90% for tf.idf, 73% Precision and 50% Recall for augmented tf.idf and 63% Precision and 17% Recall for the Kyoto score. For any further processing, we only use the terms scored by the augmented tf.idf above the threshold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Task</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Enrichment and correlation evaluation</head><p>For this evaluation we randomly selected a set of entity-enrichment pairs. Our evaluation was performed manually by 6 judges including graduate computer science students and researchers. The judges were asked to assign scores to each entityenrichment pair, with "0" for incorrect, and "1" for correct. We judge an enrichment as correct if it partially defines a specific dimension of the entity/event, that is, an enrichment does not need to completely match an entity. For instance, enrichments referring to http://dbpedia.org/resource/Doctor_(title) and http://dbpedia.org/page/Angela_Merkel and enriching an entity of type Person labelled "Dr Angela Merkel" were both equally ranked as correct. This is due to entities and events being potentially related to multiple enrichments, each enriching a particular facet of the source entity/event. Each entity/enrichment pair was shown to at least 3 judges and an average of their scores was built to alleviate bias. In case an entity label did not make sense to a judge, we assumed that there has been an error in the extraction phase. In this case we asked the judges to mark the corresponding entity as invalid and excluded it from the evaluation.</p><p>We computed the average scores of entity-enrichment pairs across judges and avthe scores obtained for each entity type. Our initial clustering approach simply correlated entities/events which share equivalent enrichments. In total we generated 1013 clusters with 2.85 entities on average, with a minimum of 2 and a maximum of 112 entities. Ambiguous enrichments led to redundant clusters and require additional disambiguation. For instance, a location entity labelled "Berlin" might be (correctly) enriched with http://rdf.freebase.com/ns/m/0xfhc and http://rdf.freebase.com/ns/m/047ckrl (each referring to a different location "Berlin") requiring additional disambiguation to clean up the clusters. To this end, we exploit graph analysis methods to detect closeness of enrichments originating from the same object. For instance, measuring the relatedness of two location entities "Berlin" and "Angela Merkel" used to annotate the same Web object will allow us to disambiguate enrichments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6</head><p>Discussion and future works</p><p>In this paper we have presented our current strategy for entity extraction and enrichment as realised within the ARCOMEM project, aimed at creating a large knowledge base of structured knowledge about archived heterogeneous Web content. Based on an integrated processing chain, we tackle entity consolidation and enrichment as implicit activity in the information extraction workflow. The results of the entity extraction show respectable scores for this kind of social media data on which NLP techniques typically struggle. However, current work is focusing on better handling of degraded English (tokenisation, language recognition etc) and especially of tweets, which should improve the entity extraction further. The enrichment results indicate a comparably good quality of generated enrichments. The results obtained from DBpedia Spotlight provided a lower recall, but introduced less ambiguous enrichments due to Spotlight's inherent disambiguation feature. On the other hand, partially matched keywords reduce the precision results. As future work, we foresee different directions to improve quality of the enrichment results. For example, one possibility is to use structured DBpedia queries to restrict entity types, similar to the approach used for Freebase. We also consider the introduction of subtypes of entities to further increase granularity of the types to be matched.</p><p>In addition, while preservation of Web content over time has to consider temporal aspects, evolution of entities and terms as well as time-dependent disambiguation are important research areas currently under investigation <ref type="bibr" target="#b23">[24]</ref>. While our current data consolidation approach only detects direct relationships between entities sharing the same enrichments, our main efforts are dedicated to investigate graph analysis mechanisms. Thus, we aim to further take advantage of knowledge encoded in large reference graphs to automatically identify semantically meaningful relationships between disparate entities extracted during different processing cycles. Given the increasing use of both automated NER tools and reference datasets such as DBpedia, WordNet or Freebase, there is an increasing need for consolidating automatically extracted information on the Web which we aim to facilitate with our work.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig, 2 .</head><label>2</label><figDesc>Fig, 2..Enrichment and correlation example: ARCOMEM Web objects, entities/events, associated DBpedia enrichments and identified correlations</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Table 4 presents the average scores of the enrichment-entity pairs obtained using DBpedia and Freebase for different ARCOMEM entity types. Enrichment evaluation results</figDesc><table><row><cell>Entity Type</cell><cell>Avg. Score DBpedia</cell><cell>Avg. Score Freebase</cell><cell>Avg. Score Total</cell></row><row><cell>Location</cell><cell>0.94</cell><cell>0.94</cell><cell>0.94</cell></row><row><cell>Money</cell><cell>0.63</cell><cell>-</cell><cell>0.63</cell></row><row><cell>Organization</cell><cell>0.93</cell><cell>1</cell><cell>0.97</cell></row><row><cell>Person</cell><cell>0.72</cell><cell>0.89</cell><cell>0.8</cell></row><row><cell>Time</cell><cell>1</cell><cell>-</cell><cell>1</cell></row><row><cell>Total</cell><cell>0.84</cell><cell>0.94</cell><cell>0.89</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Proceedings of the 2nd International Workshop on Semantic Digital Archives(SDA 2012)   </note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1">http://spotlight.dbpedia.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2">http://gate.ac.uk/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3">http://www.opencalais.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4">http://www.zemanta.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5">http://www.arcomem.eu Proceedings of the 2nd International Workshop on Semantic Digital Archives (SDA 2012)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6">http://www.w3.org/RDF/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_7">Apache Foundation; The Apache HBase Project: http://hbase.apache.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_8">http://www.gate.ac.uk/ns/ontologies/arcomem-datamodel.rdf Proceedings of the 2nd International Workshop on Semantic Digital Archives (SDA 2012)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_9">http://lod-cloud.net/state</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_10">http://gate.ac.uk/projects/arcomem/TermRaider.htmll Proceedings of the 2nd International Workshop on Semantic Digital Archives (SDA 2012)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_11">http://dbpedia.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_12">http://www.freebase.com/ Proceedings of the 2nd International Workshop on Semantic Digital Archives (SDA 2012)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_13">Parts of the archived crawls are available at http://collections.europarchive.org/arcomem/. &lt;Enrichment&gt;http://dbpedia.org/resource/Jean-Claude_Trichet&lt;/Enrichment&gt; &lt;Enrichment&gt;http://dbpedia.org/resource/ECB&lt;/Enrichment&gt; &lt;Event&gt;Trichet warns of systemic debt crisis&lt;/Event&gt; &lt;Person&gt;Jean Claude Trichet&lt;/Person&gt; &lt;Organisation&gt;ECB&lt;/Organisation&gt; Proceedings of the 2nd International Workshop on Semantic Digital Archives (SDA 2012)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_14">http://www.parliament.gv.at/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_15">The SPARQL endpoint of our dataset (extracted entities and enrichments) is available at http://arcomem.l3s.uni-hannover.de:9988/openrdf-sesame/repositories/arcomem-rdf?query.Proceedings of the 2nd International Workshop on Semantic Digital Archives(SDA 2012)   </note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work is partly funded by the European Union under FP7 grant agreement n° 270239 (ARCOMEM).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Linked data -The Story So Far. Special Issue on Linked data</title>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Heath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal on Semantic Web and Information Systems</title>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Exploiting Conceptual Spaces for Ontology Integration</title>
		<author>
			<persName><forename type="first">S</forename><surname>Dietze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Domingue</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop: Data Integration through Semantic Technology (DIST2008) Workshop at 3rd Asian Semantic Web Conference (ASWC) 2008</title>
				<meeting><address><addrLine>Bangkok, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Exploiting Metrics for Similarity-based Semantic Web Service Discovery</title>
		<author>
			<persName><forename type="first">S</forename><surname>Dietze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gugliotta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Domingue</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 7th International Conference on Web Services (ICWS 2009)</title>
				<meeting><address><addrLine>Los Angeles, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Link prediction in complex networks: a survey</title>
		<author>
			<persName><forename type="first">L</forename><surname>Lü</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Physica A</title>
		<imprint>
			<biblScope unit="volume">390</biblScope>
			<biblScope unit="page" from="1150" to="1170" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A survey of link prediction in social networks</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Social Network Data Analytics</title>
				<editor>
			<persName><forename type="first">C</forename></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<biblScope unit="page" from="243" to="276" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A comparison of string distance metrics for name-matching tasks</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">W</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">D</forename><surname>Ravikumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Fienberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IIWeb</title>
				<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Reference reconciliation in complex information spaces</title>
		<author>
			<persName><forename type="first">X</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Halevy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Madhavan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
			<publisher>SIGMOD</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Duplicate record detection: A survey</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Elmagarmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Ipeirotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">S</forename><surname>Verykios</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">TKDE</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Learning domain-independent string transformation weights for high accuracy object identification</title>
		<author>
			<persName><forename type="first">S</forename><surname>Tejada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Knoblock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Minton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">KDD</title>
				<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Principal Direction Divisive Partitioning</title>
		<author>
			<persName><forename type="first">D</forename><surname>Boley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Mining and Knowledge Discovery</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Syntactic Clustering of the Web</title>
		<author>
			<persName><forename type="first">A</forename><surname>Broder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Glassman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Manasse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zweig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th International World Wide Web Conference</title>
				<meeting>the 6th International World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Ontology-based Text Clustering</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hotho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Maedche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Staab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IJCAI Workshop on \Text Learning: Beyond Supervision</title>
				<meeting>the IJCAI Workshop on \Text Learning: Beyond Supervision</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Named Entity Recognition from Diverse Text Types</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tablan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ursu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cunningham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wilks</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Recent Advances in Natural Language Processing 2001 Conference</title>
				<meeting><address><addrLine>Tzigov Chark, Bulgaria</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Adapting SVM for Data Sparseness and Imbalance: A Case Study on Information Extraction</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cunningham</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Natural Language Engineering</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="issue">02</biblScope>
			<biblScope unit="page" from="241" to="271" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Term-weighting approaches in automatic text retrieval</title>
		<author>
			<persName><forename type="first">C</forename><surname>Buckley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="513" to="523" />
			<date type="published" when="1988">1988</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">NLP techniques for term extraction and ontology population</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peters</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Ontology Learning and Population: Bridging the Gap between Text and Knowledge</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Buitelaar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</editor>
		<meeting><address><addrLine>Amsterdam</addrLine></address></meeting>
		<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="171" to="199" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Cross-domain feature selection for language identification</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 5th International Joint Conference on Natural Language Processing</title>
				<meeting>5th International Joint Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2011-11">2011. November</date>
			<biblScope unit="page" from="553" to="561" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications</title>
		<author>
			<persName><forename type="first">H</forename><surname>Cunningham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tablan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL&apos;02)</title>
				<meeting>the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL&apos;02)</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Named entity recognition in tweets: An experimental study</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ritter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><surname>Mausam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Etzioni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of Empirical Methods for Natural Language Processing (EMNLP)</title>
				<meeting>of Empirical Methods for Natural Language essing (EMNLP)<address><addrLine>Edinburgh, UK</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Exploiting the Social and Semantic Web for guided Web Archiving</title>
		<author>
			<persName><forename type="first">T</forename><surname>Risse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dietze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Doka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Stavrakas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Senellart</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The International Conference on Theory and Practice of Digital Libraries</title>
				<meeting><address><addrLine>Cyprus</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012-09">2012. TPDL2012. September 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Challenges in developing opinion mining tools for social media</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Rout</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of @NLP can u tag #usergeneratedcontent?! Workshop at LREC 2012</title>
				<meeting>@NLP can u tag #usergeneratedcontent?! Workshop at LREC 2012<address><addrLine>Istanbul, Turkey</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012-05">May 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Bootstrapping languageneutral term extraction</title>
		<author>
			<persName><forename type="first">W</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vossen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">7th Language Resources and Evaluation Conference (LREC)</title>
				<meeting><address><addrLine>Valletta, Malta</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A nonparametric method for extraction of candidate phrasal terms</title>
		<author>
			<persName><forename type="first">P</forename><surname>Deane</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics</title>
				<meeting>the 43rd Annual Meeting on Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Towards Automatic Language Evolution Tracking: A Study on Word Sense Tracking</title>
		<author>
			<persName><forename type="first">N</forename><surname>Tahmasebi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Risse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dietze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Joint Workshop on Knowledge Evolution and Ontology Dynamics</title>
				<meeting><address><addrLine>Bonn, Germany</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011">2011. 2011. EvoDyn2011</date>
		</imprint>
	</monogr>
	<note>the 10th International Semantic Web Conference (ISWC2011)</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Hexastore: sextuple indexing for semantic web data management</title>
		<author>
			<persName><forename type="first">C</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Karras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bernstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1008" to="1019" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
