<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">User driven Information Extraction with LODIE</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Anna</forename><forename type="middle">Lisa</forename><surname>Gentile</surname></persName>
							<email>a.gentile@sheffield.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Suvodeep</forename><surname>Mazumdar</surname></persName>
							<email>s.mazumdar@sheffield.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">User driven Information Extraction with LODIE</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1E4C355F2C9EE5C859F2701B6F8EB012</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Information Extraction (IE) is the technique for transforming unstructured or semi-structured data into structured representation that can be understood by machines. In this paper we use a user-driven Information Extraction technique to wrap entity-centric Web pages. The user can select concepts and properties of interest from available Linked Data. Given a number of websites containing pages about the concepts of interest, the method will exploit (i) recurrent structures in the Web pages and (ii) available knowledge in Linked data to extract the information of interest from the Web pages.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Information Extraction transforms unstructured or semi-structured text into structured data that can be understood by machines. It is a crucial technique towards realizing the vision of the Semantic Web. Wrapper Induction (WI) is the task of automatically learning wrappers (or extraction patterns) for a set of homogeneous Web pages, i.e. pages from the same website, generated using consistent templates <ref type="foot" target="#foot_0">1</ref> . WI methods <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> learn a set of rules enabling the systematic extraction of specific data records from the homogeneous Web pages. In this paper we adopt a user driven paradigm for IE and we perform on demand extraction on entity-centric webpages. We adopt our WI method <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref> developed within the LODIE (Linked Open Data for Information Extraction) framework <ref type="bibr" target="#b3">[4]</ref>. The main advantage of our method is that does not require manually annotated pages. The training examples for the WI method are automatically generated exploiting Linked Data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">State of the Art</head><p>Using WI to extract information from structured Web pages has been studied extensively. Early studies focused on the DOM-tree representation of Web pages and learn a template that wrap data records in HTML tags, such as <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>. Supervised methods require manual annotation on example pages to learn wrappers for similar pages <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8]</ref>. The number of required annotations can be drastically reduced by annotating pages from a specific website and then adapting the learnt rules to previously unseen websites of the same domain <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>. Completely unsupervised methods (e.g. RoadRunner <ref type="bibr" target="#b10">[11]</ref> and EXALG <ref type="bibr" target="#b11">[12]</ref>) do not require any training data, nor an initial extraction template (indicating which concepts and attributes to extract), and they only assume the homogeneity of the considered pages. The drawback of unsupervised methods is that the semantic of produced results is left as a post-process to the user. Hybrid methods <ref type="bibr" target="#b1">[2]</ref> intend to find a tradeoff with these two limitations by proposing a supervised strategy, where the training data is automatically generated exploiting Linked Data. In this work we perform IE using the method proposed in <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref> and follow the general IE paradigm from <ref type="bibr" target="#b3">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">User-driven Information Extraction</head><p>In LODIE we adopt a user driven paradigm for IE. As first step, the user must define her/his information need. This is done via a visual exploration of linked data (Figure <ref type="figure" target="#fig_0">1</ref>). The user can explore underlying linked data using the Affective Graphs visualization tool <ref type="bibr" target="#b12">[13]</ref> and select concepts and properties she/he is interested in (a screenshot is shown in Figure <ref type="figure" target="#fig_0">1</ref>). These concepts and properties get added to the side panel. Once the selection is finished, she/he can start the IE process. The IE starts with a dictionary generation phase. A dictionary d i,k consists of values for the attribute a i,k of instances of concept c i . Noisy entries in the dictionaries are removed using a cleaning procedure detailed in <ref type="bibr" target="#b2">[3]</ref>. As a running example we will assume the user wants to extract title and author for the concept Book. We retrieve from the Web k websites containing entity-pages of the concept types selected by the user, and save the pages W ci,k . Following the Book example, Barnes&amp;Noble<ref type="foot" target="#foot_1">2</ref> or AbeBooks<ref type="foot" target="#foot_2">3</ref> websites can be used, and pages collected in W book,barnesandnoble and W book,abebooks .</p><p>For each W ci,k we generate a set of extraction patterns for every attribute. In our example we will produce 4 sets of patterns, one per each website and attribute. To produce the patterns we (i) use our dictionaries to generate bruteforce annotations on the pages in W ci,k and then (ii) use statistical (occurrence frequency) and structural (position of the annotations in the webpage) clues to choose the final extraction patterns.</p><p>Briefly, a page is transformed to a simplified page representation P ci : a collection of pairs 〈xpath<ref type="foot" target="#foot_3">4</ref> , text value〉. Candidates are generated matching the dictionaries d i,k against possible text values in P ci (Figure <ref type="figure" target="#fig_1">2</ref>).</p><p>/HTML <ref type="bibr" target="#b0">[1]</ref>/BODY <ref type="bibr" target="#b0">[1]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b0">[1]</ref>/H2 <ref type="bibr" target="#b0">[1]</ref>/text() <ref type="bibr" target="#b0">[1]</ref> breaking dawn /HTML <ref type="bibr" target="#b0">[1]</ref>/BODY <ref type="bibr" target="#b0">[1]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b3">[4]</ref>/DIV <ref type="bibr" target="#b0">[1]</ref>/H2 <ref type="bibr" target="#b0">[1]</ref>/EM <ref type="bibr" target="#b0">[1]</ref>/text() <ref type="bibr" target="#b0">[1]</ref> breaking dawn /HTML <ref type="bibr" target="#b0">[1]</ref>/BODY <ref type="bibr" target="#b0">[1]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b1">[2]</ref>/DIV <ref type="bibr" target="#b3">[4]</ref>/TABLE <ref type="bibr" target="#b9">[10]</ref>/TBODY <ref type="bibr" target="#b0">[1]</ref>/TR <ref type="bibr" target="#b0">[1]</ref>/TD <ref type="bibr" target="#b2">[3]</ref> Final patterns are chosen amongst the candidates exploiting frequency information and other heuristics. Details of the method can be found in <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3</ref>]. In the running example, higher scoring patterns for extracting book title from AbeBooks website are shown in Figure <ref type="figure" target="#fig_2">3</ref>. All extraction patterns are then used to extract target values from all W ci,k . Results are produced as linked data, using the concept and properties initially selected by the user for representation, and made accessible to the user via an exploration interface (Figure <ref type="figure" target="#fig_3">4</ref>), implemented using Simile Widgets <ref type="foot" target="#foot_4">5</ref> .</p><formula xml:id="formula_0">/B[1]/A[1]/text()[1] break- ing dawn /HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[1]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break- ing dawn /HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[2]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break- ing dawn /HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[3]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break- ing dawn /HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[6]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break- ing dawn /HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[2]/DIV[4]/TABLE[8]/TBODY[1]/TR[1]/TD[3]/B[1]/A[1]/text()[1] break- ing dawn /HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[3]/DIV[3]/UL[1]/LI[2]/A[1]/text()[1] the host /HTML[1]/BODY[1]/DIV[2]/DIV[2]/DIV[3]/DIV[3]/UL[1]/LI[5]/A[1]/text()[1] new moon</formula><p>A video showing the proposed system used with the running Book example can be found at http://staffwww.dcs.shef.ac.uk/people/A.L.Gentile/ demo/iswc2014.html.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusions and future work</head><p>In this paper we describe the LODIE approach to perform IE on user defined extraction tasks. The user is prompted a visual tool to explore available linked data and choose concepts for which she/he wants to mine additional material from the Web. We learn extraction patterns to wrap relevant websites and return structured results to the user. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Exploring linked data to define user need, by selecting concepts and attributes to extract.Here the user selected the concept Book and the attributes title and author. As author is a datatype attribute, of type P erson, the attribute name is chosen.</figDesc><graphic coords="2,169.35,324.01,276.66,103.54" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 :</head><label>2</label><figDesc>Fig.2: Example of candidates for book title for a Web page on the book "Breaking Dawn", from the website AbeBooks.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>/Fig. 3 :</head><label>3</label><figDesc>Fig. 3: Extraction patterns for book titles from AbeBooks website.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 :</head><label>4</label><figDesc>Fig. 4: Exploration of results produced by the IE method</figDesc><graphic coords="4,167.81,116.83,276.65,129.67" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">For example, a yellow page website will use the same template to display information (e.g., name, address, cuisine) of different restaurants.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.barnesandnoble.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.abebooks.co.uk</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://www.w3.org/TR/xpath/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://www.simile-widgets.org/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Wrapper Induction for information Extraction</title>
		<author>
			<persName><forename type="first">N</forename><surname>Kushmerick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IJCAI97</title>
		<imprint>
			<biblScope unit="page" from="729" to="735" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Unsupervised wrapper induction using linked data</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Gentile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Augenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ciravegna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the seventh international conference on Knowledge capture. K-CAP &apos;13</title>
				<meeting>of the seventh international conference on Knowledge capture. K-CAP &apos;13<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="41" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Self training wrapper induction with linked data</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Gentile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ciravegna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th International Conference on Text, Speech and Dialogue (TSD</title>
				<meeting>the 17th International Conference on Text, Speech and Dialogue (TSD</meeting>
		<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="295" to="302" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Lodie: Linked open data for web-scale information extraction</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ciravegna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Gentile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SWAIE</title>
		<imprint>
			<biblScope unit="page" from="11" to="22" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Hierarchical wrapper induction for semistructured information sources</title>
		<author>
			<persName><forename type="first">I</forename><surname>Muslea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Minton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Knoblock</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Autonomous Agents and Multi-Agent Systems</title>
		<imprint>
			<biblScope unit="page" from="1" to="28" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Learning information extraction rules for semi-structured and free text</title>
		<author>
			<persName><forename type="first">S</forename><surname>Soderland</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Mach. Learn</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">1-3</biblScope>
			<biblScope unit="page" from="233" to="272" />
			<date type="published" when="1999-02">February 1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction</title>
		<author>
			<persName><forename type="first">I</forename><surname>Muslea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Minton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Knoblock</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAI&apos;03 8th international joint conference on Artificial intelligence</title>
				<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="415" to="420" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Automatic wrappers for large scale web extraction</title>
		<author>
			<persName><forename type="first">N</forename><surname>Dalvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Soliman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proc. of the VLDB Endowment</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="219" to="230" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lam</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="523" to="536" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
	<note>IEEE</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">From One Tree to a Forest : a Unified Solution for Structured Web Data Extraction</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Hao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR 2011</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="775" to="784" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Automatic information extraction from large websites</title>
		<author>
			<persName><forename type="first">V</forename><surname>Crescenzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mecca</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the ACM</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="731" to="779" />
			<date type="published" when="2004-09">September 2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Extracting structured data from web pages</title>
		<author>
			<persName><forename type="first">A</forename><surname>Arasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 2003 ACM SIGMOD international conference on Management of data</title>
				<meeting>of the 2003 ACM SIGMOD international conference on Management of data</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="337" to="348" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Affective graphs: The visual appeal of linked data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mazumdar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Petrelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Elbedweihy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lanfranchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ciravegna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Semantic Web-Interoperability, Usability, Applicability</title>
				<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2013">2014. 2013</date>
		</imprint>
	</monogr>
	<note>to appear</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
