<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Using Events for Content Appraisal and Selection in Web Archives ⋆</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Risse</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<address>
									<settlement>Hanover</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stefan</forename><surname>Dietze</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<address>
									<settlement>Hanover</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Diana</forename><surname>Maynard</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<settlement>Sheffield</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nina</forename><surname>Tahmasebi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<address>
									<settlement>Hanover</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wim</forename><surname>Peters</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<settlement>Sheffield</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Using Events for Content Appraisal and Selection in Web Archives ⋆</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">6EE83EABB97AE8B5440136BE98A25B87</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:54+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Event Detection</term>
					<term>Crawler Guidance</term>
					<term>Web Archiving</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With the rapidly growing volume of resources on the Web, Web archiving becomes an important challenge. In addition, the notion of community memories extends traditional Web archives with related data from a variety of sources on the Social Web. Community memories take an entity-centric view to organise Web content according to the events and the entities related to them, such as persons, organisations and locations. To this end, the main challenge is to extract, detect and correlate events and related information from a vast number of heterogeneous Web resources where the nature and quality of the content may vary heavily. In this paper we present the approach of the ARCOMEM project which is based on an iterative cycle consisting of (1) targeted archiving/crawling of Web objects, (2) entity and event extraction and detection, and (3) refinement of crawling strategy.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Given the ever increasing importance of the World Wide Web as a source of information, adequate Web archiving and preservation has become a cultural necessity in preserving knowledge. However, in addition to the "common" challenges of digital preservation, such as media decay, technological obsolescence, authenticity and integrity issues, Web preservation has to deal with the sheer size and ever-increasing growth rate of Web data. Hence, selection of content sources becomes a crucial task for archival organizations. Instead of following a "collectall" strategy, archival organizations are trying to build community memories that reflect the diversity of information people are interested in. Community memories largely revolve around events and the entities related to them such as persons, organisations and locations. These may be unique events such as the first landing on the moon or a natural disaster, or regularly occurring events such as elections or TV serials.</p><p>In this work, we refer to an event as a situation within the domain (states, actions, processes, properties) expressed by one or more relations. Events can be expressed by text elements such as:</p><p>verbal predicates and their arguments ("The committee dismissed the proposal"); -noun phrases headed by nominalizations ("economic growth"); -adjective-noun combinations ("governmental measure"; "public money"); -event-referring nouns ("crisis", "cash injection").</p><p>Events can denote different levels of semantic granularity, i.e. general events can contain more specific sub-events. For instance, the performances of various bands form sub-events of a wider music event, while a general event like "Turkey's EU accession" has sub-events such as the European Parliament approving Turkey's Progress Report.</p><p>In this paper, we provide an overview of the approach we follow in the AR-COMEM<ref type="foot" target="#foot_0">3</ref> project. The overall aim is to create incrementally enriched Web archives which allow access to all sorts of Web content in a structured and semantically meaningful way. In addition to topic-centred preservation approaches, we are exploring event-and entity-centred processes for content appraisal and acquisition as well as rich preservation. By considering a wide range of content, a more diverse archive is created, taking into account a variety of dimensions including perspectives taken, sentiments, images used, and information sources.</p><p>To build a community archive from Web content, a web crawler needs to be guided in an intelligent way based on the events and entities derived from previous crawl campaigns so that pages are crawled and archived if they relate to a specified event or entity. While at the beginning of any crawl campaign the amount of information is very limited, the crawler needs to learn about the event incrementally, while at the same time it has to decide about following links. Therefore, our approach is based on an iterative cycle consisting of the following steps:</p><p>1. Targeted archiving/crawling of Web objects; 2. Entity and event extraction and detection; 3. Refinement of crawling strategy.</p><p>To this end, the main challenges are related to the extraction, detection and correlation of entities, events and related information in a vast number of heterogeneous Web resources. While extraction covers the identification and structured representation of knowledge about events and entities from previously unstructured material from scratch, detection refers to the detection of previously extracted events and entities. Therefore, in contrast to the extraction step, detection takes advantage of existing structured data about events and entities. Both processes face issues arising from the diversity of the nature and quality of Web content, in particular when considering social media and user-generated content, where further issues are posed by informal use of language.</p><p>In the following section, we give an overview of related work, and introduce the ARCOMEM approach and architecture in Section 3. Section 4 provides an overview of the event detection mechanisms deployed by ARCOMEM, while we discuss some key challenges in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Since 1996, several projects have pursued Web archiving (e.g. <ref type="bibr" target="#b0">[AL98]</ref>). The Heritrix crawler <ref type="bibr" target="#b15">[MKSR04]</ref>, jointly developed by several Scandinavian national libraries and the Internet Archive through the International Internet Preservation Consortium (IIPC)<ref type="foot" target="#foot_1">4</ref> , is a mature and efficient tool for large-scale, archival-quality crawling.</p><p>The method of choice for memory institutions is client-side archiving based on crawling. This method is derived from search engine crawl, and has been evolved by the archiving community to achieve a better completeness of capture and to reduce temporal coherence of crawls. These two requirements come from the fact that, for web archiving, crawlers are used to build collections and not only to index <ref type="bibr" target="#b10">[Mas06]</ref>. These issues were addressed in the European project LiWA (Living Web Archives) <ref type="foot" target="#foot_2">5</ref> .</p><p>The task of crawl prioritisation and focusing is the step in the crawl processing chain which combines the different analysis results and the crawl specification for filtering and ranking the URLs of a seed list. The filtering of URLs is necessary to avoid unrelated content in the archive. For content that is partly relevant, URLs need to be prioritised to focus the crawler tasks to crawl in order of relevancy. A number of strategies and therefore URL ordering metrics exist for this, such as breadth-first, back link count and PageRank. PageRank and breadth-first are good strategies to crawl "important" content on the web [CGMP98,BYCMR05], but since these generic approaches do not cover specific information needs, focused or topical crawls have been developed <ref type="bibr" target="#b5">[CBD99]</ref> [MPS04]. However, these approaches have only a vague notion of topicality and do not address event-based crawling.</p><p>Entity and event recognition are two of the major tasks within Information Extraction, and have been successfully applied in research areas such as ontology generation, bioinformatics, news aggregation, business intelligence and text classification. Recognising events in these fields is generally carried out by means of pre-defined sets of relations, possibly structured into an ontology, which makes such tasks domain dependent, but feasible. Entity extraction in this case comprises both named entity recognition <ref type="bibr" target="#b9">[CMBT02]</ref> and term recognition <ref type="bibr" target="#b3">[BS09,</ref><ref type="bibr" target="#b16">MLP08]</ref>.</p><p>The identification of relations between entities in text is generally performed by means of heuristic, rule-based applications using background knowledge from instantiated ontologies and lexico-syntactic patterns to establish links between textual entities and their ontological provenance <ref type="bibr" target="#b13">[MFP09a]</ref>, or a combination of statistical and linguistic techniques <ref type="bibr" target="#b17">[MPB08]</ref>. Tools such as Espresso <ref type="bibr" target="#b20">[PP06]</ref> and Text2Onto [CLS05] make use of predefined or automatically extracted text patterns in order to structure the domain in terms of classes and relations. Furthermore, shallow parsing techniques such as semantic role labelling [Gil02] characterise the relationship between predicates (relations) and their arguments (entities) on a semantic level by means of roles such as agent and patient. On the other hand, unsupervised machine learning techniques such as TextRunner <ref type="bibr" target="#b2">[BE08]</ref> and Powerset<ref type="foot" target="#foot_3">6</ref> scale to the extraction of facts from hundreds of millions of web pages, but they use only very shallow linguistic analysis and may not be so accurate. While PowerSet, for example, uses advanced parsing and some NLP techniques, it does not understand word and phrase meanings in context. In this work, we position our event extraction approach somewhere between the very constrained template-filling approach used in MUC, and the open domain approach of finding new relations over the whole web, used by systems such as TextRunner and Powerset.</p><p>In addition, for representation of events and entities we consider Semantic Web and Linked Data-based approaches, as one of our fundamental aims is to expose the generated knowledge in an interoperable and reusable way. We consider in particular Linked Open Descriptions of Events, LODE <ref type="bibr" target="#b21">[STH09]</ref>, Event-Model-F [ASS09] and the Event Ontology<ref type="foot" target="#foot_4">7</ref> . While LODE and the Event Ontology follow a similar approach to and provide rather lightweight RDF schemas for event description, the Event-Model-F is a more formal OWL ontology which applies the DOLCE Descriptions and Situations pattern by using DOLCE+DnS Ultralight (DUL)<ref type="foot" target="#foot_5">8</ref> as an upper level ontology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Approach and Architecture</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Overall Approach</head><p>The goal for the ARCOMEM system is to develop methods and tools for transforming digital archives into community memories based on novel socially-aware and socially-driven preservation models. This will be done by leveraging the Wisdom of the Crowds reflected in the rich context and reflective information in the Social Web for driving innovative, concise and socially-aware content appraisal and selection processes for preservation, taking events, entities and topics as seeds, and by encapsulating this functionality into an adaptive decision support tool for the archivist.</p><p>Archivists will be able to trigger interactive and intelligent content appraisal and selection processes in two ways: either by example or by a high-level description of relevant entities, topics and events. Intelligent and adaptive decision support for this will be based on combining and reasoning about the extracted information and inferring semantic knowledge, combining logic reasoning with adaptive content selection strategies and heuristics.</p><p>The system is built around two loops: content selection and content enrichment. The content selection loop aims at content filtering based on community reflection and appraisal. Social Web content will be analysed regarding the interlinking, context and popularity of web content, regarding events, topics and entities. These results are used for building the seed lists to be used by existing Web crawlers. Within the content enrichment loop, newly crawled pages will be analysed for topics, entities, events, perspectives, Social Web context and evolutionary aspects in order to link them together by means of the events and entities.</p><p>In the following we will focus on the content selection loop.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Architecture</head><p>The main tasks of a Web crawler are to download a Web page and to extract links from that page to find more pages to crawl. An intelligent filtering and ranking of links enables focusing of the crawls. We will combine a breadth-first strategy with a semantic ranking that takes into account events, topics, opinions and entities (ETOEs). The extracted links are weighted according to the relevance of the page to the semantically rich crawl specification. The general architecture is depicted in figure <ref type="figure" target="#fig_0">1</ref>.  The whole process is divided into an online and offline phase. The online phase focuses on the crawl task itself and the guiding of the crawler, while the offline phase is used to analyze the crawl results and the crawl specification to setup a knowledge base for the online decision making.</p><p>Offline Phase To bootstrap a new crawl campaign, the archivist specifies a crawl by giving an initial seed list complemented with some information about events, entities and topics. e.g. [event: "Rock am Ring"], [Band: "Coldplay"] , [Location: "Nürburgring"]. The idea behind the following process is that the archivist is not able to give a full crawl specification as they cannot be fully aware of how the events, topics, etc. they are interested in are represented on the web. Therefore the crawler needs to help the archivist to improve the specification.</p><p>The initial seed list is used by the URL Fetcher to initiate a reference crawl. This reference crawl will be analyzed by the offline analysis component to extract ETOEs, which are used to derive an extended crawl specification. In this step the archivists need to assess the relevance of the extracted information to the envisioned crawl. They have the possibility to weight the information and also to explicitly exclude some of it from the crawl. The resulting extended crawl specification is handed over to the online phase.</p><p>In addition to the extended crawl specification, a knowledge base will be built, in order to provide additional information such as more detailed descriptions of events or entities, different lexical forms or other disambiguation information. The offline phase will be called regularly from the online phase to further improve the crawl specification and the knowledge base.</p><p>Online Phase The online analysis component receives newly crawled pages from the crawler and the extended crawl specification from the offline phase. Due to the necessary high crawl frequency, the processing time and decision making for a single page should take no longer then 2-3 secs. Therefore complex analysis like extracting new ETOEs is not possible. Instead, the analysis component will rely on the information in the knowledge base to detect the degree of relevance of a page to the crawl specification, to rank the extracted links and to update the priority queue of the crawler accordingly. The crawler processes the priority queue and hands over new pages to the online analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Event Extraction</head><p>The event extraction method we adopt involves the recognition of entities and the relations between them in order to find domain-specific events and situations. As discussed in Section 2, in a (semi-)closed domain, this approach is preferable to an open IE-based approach which holds no preconceptions about the kinds of entities and relations possible. Building on the work of <ref type="bibr" target="#b19">[MYKK05]</ref>, we combine a number of different techniques, using two parallel strategies for event detection. The top-down approach, similar to a template-based IE approach as used in the Message Understanding Conferences <ref type="bibr" target="#b7">[CHL93]</ref>, consists of identifying a number of important events, based on analysis of the user needs and manual inspection of the corpora. Here, the slots are known in advance and the values are entities extracted from the text. In our Rock am Ring use case, the following example depicts a band perfmance event:</p><p>Band:Coldplay Relation: performed Date: 3 June 2011</p><p>The technique consists of pre-defining a set of templates for the various relations, and then using a rule-based approach based on GATE <ref type="bibr" target="#b9">[CMBT02]</ref> to identify the relevant slot values. First, we perform linguistic pre-processing (tokenisation, sentence splitting, POS tagging, morphological analysis, and verb and noun phrase chunking), followed by entity extraction, which includes both named entities and terms: for this we make use of slightly modified versions of ANNIE <ref type="bibr" target="#b9">[CMBT02]</ref> and TermRaider 9 respectively. The third stage involves a semantic approach to finding the verbal expressions which represent the relations. We automatically create sets of verbs representing each relation, using information from WordNet and VerbNet to group verbs into semantic categories: for example, the relation "perform" might be represented by any morphosyntactic variant of the verbs "perform", "play", "sing", "appear" etc. We then develop hand-crafted rules to match sentences containing the relevant entities and verbs: for example, a rule to match the "performance" event described above should contain an entity representing a band name as the subject of a "perform" verb, and optionally a date and/or time within the sentence.</p><p>This kind of rule-based approach tends to be very accurate, achieving relatively high levels of precision (depending on how specific the rules are), but can suffer from low recall. On the other hand, a bottom-up technique involving open-domain IE can find previously unknown events and does not limit us to a fixed set of relations. This can be vital for discovering new information. By combining the high precision of the top-down method with the high recall of the bottom-up method, we can get the best of both worlds if done correctly.</p><p>The bottom-up approach we adopt is rather different from the machine learning approach adopted by e.g. <ref type="bibr" target="#b2">[BE08]</ref>, in that we still specify hand-coded rules. However, these rules are flexible and under-specified, making use of linguistic structure and semantic relations from WordNet <ref type="bibr" target="#b11">[ME90]</ref> rather than prespecifying exact relations. We use the Noun Phrase and Verb Phrase chunker from GATE to identify certain linguistic patterns contextualising verb phrases, and then cluster these verbs into semantically related categories to find new relations. The participants in the relations can also be semantically clustered around similar relation types, such that an iterative development cycle can be produced. We also combine rules for ontology learning developed in SPRAT <ref type="bibr" target="#b14">[MFP09b]</ref> which can be used to find patterns denoting relations between entities, such as hyponyms and properties. Preliminary experiments with news texts in English have found relations such as the following: We do not only restrict ourselves to verbal relations, but also look for nominalisations. For example, "the arrest of Mr Daoudi in Leicester" is semantically equivalent to "Mr Daoudi was arrested in Leicester". The work on event detection is still very much in progress, and it is clear that there are many difficult issues to solve. We do not use full parsing because it is very slow and because it does not work so well on social media where English is often not written correctly in full sentences. Related work on opinion mining from tweets <ref type="bibr" target="#b12">[MF11]</ref> has proved that shallow linguistic techniques are, however, promising for extracting knowledge from this kind of noisy data, using backoff strategies and fuzzy matching where necessary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Challenges</head><p>For the long-term availability and usage of Web content, it is important to preserve not only the content itself but also its context and interactions from relevant Web destinations. These include those that the content providers own (the main portal, channel portals or programme portals), those that they partner with (e.g. joint broadcaster portals), social media services or platforms, and both professional and user blogs/websites. This type of content is varied and comprises general content, commenting, rating, ranking and forwarding, while containing both structured data and unstructured free text.</p><p>To this end, it is a challenge to manage and correlate content from these information sources, differing in quality, form (e.g. both audiovisual and textual material) and structure. In order to achieve a focused crawl, it is necessary to identify semantically related objects, e.g. ones which discuss the same events or entities. However, the preservation and identification of correlations within such a diverse variety of Web sources poses a number of key challenges:</p><p>1. extraction of events and entities from heterogeneous and unstructured content; 2. detection of events and entities in heterogeneous and unstructured content; 3. targeted Web crawling.</p><p>Entity and event extraction from unstructured and heterogeneous Web data is one of the key challenges. This involves the use of natural language processing (NLP) techniques to extract events and entities from unstructured and heterogeneous text (as described in Section 4, and video analysis techniques to deal with audiovisual material. Although extraction is performed in the offline phase (see Fig. <ref type="figure" target="#fig_0">1</ref>), there are still time requirements. Because the newly extracted entities and events are used in the online phase to focus the crawl, the extraction must be reasonably fast. To keep the crawl from becoming too diffuse, the results of the extraction must also be highly accurate, which provides an additional challenge.</p><p>In contrast to the extraction, the detection of events and entities needs to exploit the data captured in the knowledge base in order to automatically detect events and entities. Both NLP and video processing techniques need to be exploited here too, but with much less time for analysis: this means that the processing will be more shallow. Because the detection occurs in the online phase (see Fig. <ref type="figure" target="#fig_0">1</ref>) and is in close interaction with the crawler, a key challenge is to perform the detection in a very short time frame and with limited time for deep, linguistic analysis.</p><p>Finally, the results of both processing phases in Section 3.2 are used for targeted Web crawling. This allows the crawling strategy to be gradually refined, based on the outcomes of the previous crawling, extraction and detection activities. It is a challenge to make appropriate use of these outcomes to create focused archives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>In this paper we have presented the approach we follow in the ARCOMEM project to build Web archives as community memories that revolve around events and the entities related to them. The need to make decisions during the crawl process with only a limited amount of information raises a number of issues. The division of online and offline processing allows us to separate the initial complex extraction of events and entities from the faster but shallower detection of them at crawl time. Furthermore, it allows learning more about the particular events and topics the archivist is interested in. However, the typically limited set of reference pages and the limited time to detect events during crawling are open issues to be addressed in the future. Moreover, the whole approach needs to be evaluated in real world scenarios: namely, crawling pages related to the election and to the upcoming Olympic Games.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Architecture for the Content Selection</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">ARCOMEM -From Collect-All Archives to Community Memories, http://www.arcomem.eu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">http://netpreserve.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">http://wiki.liwa-project.eu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">http://www.powerset.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">http://motools.sourceforge.net/event/event.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">http://www.loa-cnr.it/ontologies/DUL.owl</note>
		</body>
		<back>

			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>⋆ This work is partly funded by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n • 270239 (ARCOMEM)</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The Kulturarw Project -The Swedish Royal Web Archive</title>
		<author>
			<persName><forename type="first">Allan</forename><surname>Arvidson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Frans</forename><surname>Lettenström</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronic library</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">F-a model of events based on the foundational ontology dolce+dns ultralight</title>
		<author>
			<persName><forename type="first">C</forename><surname>Saathoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Scherp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Franz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Staab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Knowledge Capturing (K-CAP)</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The tradeoffs between open and traditional relation extraction</title>
		<author>
			<persName><forename type="first">M</forename><surname>Banko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Etzioni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ACL-08</title>
				<meeting>ACL-08</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Term-weighting approaches in automatic text retrieval</title>
		<author>
			<persName><forename type="first">C</forename><surname>Buckley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="513" to="523" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Crawling a country: better strategies than breadth-first for web page ordering</title>
		<author>
			<persName><forename type="first">R</forename><surname>Baeza-Yates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Castillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Marin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodriguez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Special interest tracks and posters of the 14th international conference on World Wide Web, WWW &apos;05</title>
				<meeting><address><addrLine>New York</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="864" to="872" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Focused crawling: a new approach to topic-specific web resource discovery</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Den</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Berg</surname></persName>
		</author>
		<author>
			<persName><surname>Dom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Networks</title>
				<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="1623" to="1640" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Efficient crawling through url ordering</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Page</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the seventh international conference on World Wide Web 7, WWW7</title>
				<meeting>the seventh international conference on World Wide Web 7, WWW7<address><addrLine>Amsterdam, The Netherlands, The Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>Elsevier Science Publishers B. V</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="161" to="172" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3)</title>
		<author>
			<persName><forename type="first">N</forename><surname>Chinchor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hirschman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lewis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="409" to="449" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Gimme&apos; The Context: Contextdriven automatic semantic annotation with C-PANKOW</title>
		<author>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ladwig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Staab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th World Wide Web Conference</title>
				<meeting>the 14th World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications</title>
		<author>
			<persName><forename type="first">H</forename><surname>Cunningham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tablan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL&apos;02)</title>
				<meeting>of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL&apos;02)</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Web archiving</title>
		<author>
			<persName><forename type="first">Julien</forename><surname>Masanès</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">WordNet: An on-line lexical database</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Lexicography</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="235" to="312" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Automatic detection of political opinions in tweets</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Funk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of MSM 2011: Making Sense of Microposts Workshop at 8th Extended Semantic Web Conference</title>
				<meeting>of MSM 2011: Making Sense of Microposts Workshop at 8th Extended Semantic Web Conference<address><addrLine>Heraklion, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011-05">May 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">NLP-based support for ontology lifecycle development</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Funk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peters</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CK 2009 -ISWC Workshop on Workshop on Collaborative Construction, Management and Linking of Structured Knowledge</title>
				<meeting><address><addrLine>Washington, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009-10">October 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">SPRAT: a tool for automatic semantic pattern-based ontology population</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Funk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peters</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of Int. Conf. for Digital Libraries and the Semantic Web</title>
				<meeting>of Int. Conf. for Digital Libraries and the Semantic Web<address><addrLine>Trento, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009-09">September 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Introduction to heritrix, an archival quality web crawler</title>
		<author>
			<persName><forename type="first">G</forename><surname>Mohr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kimpton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ranitovic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">4th International Web Archiving Workshop (IWAW04)</title>
				<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">NLP Techniques for Term Extraction and Ontology Population</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peters</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Bridging the Gap between Text and Knowledge -Selected Contributions to Ontology Learning and Population from Text</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Buitelaar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</editor>
		<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Tree kernels for semantic role labeling</title>
		<author>
			<persName><forename type="first">A</forename><surname>Moschitti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Pighin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Basili</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="193" to="224" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Topical web crawlers: Evaluating adaptive algorithms</title>
		<author>
			<persName><forename type="first">F</forename><surname>Menczer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Srinivasan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Internet Technol</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="378" to="419" />
			<date type="published" when="2004-11">Nov. 2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Ontology-based information extraction for market monitoring and technology watch</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maynard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yankova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kourakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kokossis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ESWC Workshop &quot;End User Aspects of the Semantic Web</title>
				<meeting><address><addrLine>Heraklion, Crete</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Espresso: Leveraging generic patterns for automatically harvesting semantic relations</title>
		<author>
			<persName><forename type="first">P</forename><surname>Pantel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pennacchioni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL-06)</title>
				<meeting>Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL-06)<address><addrLine>Sydney, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="113" to="120" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Lode: Linking open descriptions of events</title>
		<author>
			<persName><forename type="first">R</forename><surname>Shaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Troncy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hardman</surname></persName>
		</author>
		<editor>Asunción Gómez-Pérez, Yong Yu, and Ying Ding</editor>
		<imprint/>
	</monogr>
	<note>4th</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
	</analytic>
	<monogr>
		<title level="m">Asian Semantic Web Conference</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">5926</biblScope>
			<biblScope unit="page" from="153" to="167" />
		</imprint>
	</monogr>
	<note>ASWC-2009</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
