<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">The Use of Ontologies in Wrapper Induction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Marek</forename><surname>Nekvasil</surname></persName>
							<email>nekvasim@vse.cz</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information and Knowledge Engineering</orgName>
								<orgName type="institution">University of Economics</orgName>
								<address>
									<addrLine>Winston Churchill Sq. 4, 130 67</addrLine>
									<settlement>Prague 3</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Department of Information and Knowledge Engineering</orgName>
								<orgName type="institution">University of Economics Prague</orgName>
								<address>
									<addrLine>Winston Churchill Sq. 4, 130 67</addrLine>
									<settlement>Prague 3</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">The Use of Ontologies in Wrapper Induction</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">15951547D03A391A88A69ED31DEB683C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:10+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>ontology</term>
					<term>automatic annotation</term>
					<term>information extraction</term>
					<term>wrapper</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The purpose of this entry is to bring in an extension of ontologies so that they can be utilized in the process of automated information extraction from the web documents. Major part of it is dedicated to a proposition and derivation of an inference model for evaluation of the pattern matches and their combination. Further is proposed a simple naïve method of wrapper induction which is able to use the results of the first part.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>One of the simplest alternatives to the manual handling of web documents is the use of wrappers, sets of rules to identify the desired values in the document. These rules can be created either by hand or automatically, in which case we are talking about wrapper induction <ref type="bibr" target="#b2">[3]</ref>. For automatic wrapper induction in principle some examples of the real data to be extracted are needed along with their respective context, which form basically an annotated document. The purpose of this entry is to propose a concept by which it should be possible to annotate the documents automatically with the use of ontologies. We will call any ontology that is designed to this use in information extraction an extraction ontology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Extending an OWL ontology</head><p>An ontology written in a bare OWL language has very limited capabilities of describing the possible values of datatype properties (properties which's value is a literal) and therefore does not contain enough information to identify these inside a document. To remove this insufficiency we introduce an extension which will enable us to append a pattern of typical values to each datatype property. Any general rule for which it is possible on any continuous part of a document (in terms of a string of characters or words) to determine to what extent it is satisfied we call a pattern. An example of a pattern can be a rule that evaluates whether a given part of a document is a number from a given range or a string from a given list. From now on we will distinguish between two kinds of patterns. Foremost there will be simple patterns, or rather atomic patterns that will be formed by a simple rule, which can be directly evaluated on a part of a document. An example of atomic patterns can be the aforementioned patterns that match a number or a string. Moreover there will be composite patterns, which we will define as such a combination of rules that can be evaluated on a given continuous part of a document as a whole. As such, the composite pattern will be always a combination of other patterns, both atomic and composite. The composite patterns can be hierarchically combined which can be of significant concern in some specific situations. As it has sense to assign only one pattern per datatype property, more patterns will have to be joined via including them in some composite pattern. Every part of a document that matches a given pattern, i.e. the pattern rule evaluates positively on that part, will be considered a suspected partial candidate for the occurrence of the value of datatype property the pattern is assigned to. If that given pattern is the one that is assigned directly to the datatype property, every matching part of the document will be considered to be a suspected candidate for the occurrence of the value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Atomic Patterns</head><p>While evaluating the match of atomic patterns we encounter the problem of deriving the certainty degree of marking the candidate. We take a pattern as an algorithm that can tell for every place in the document to what extent the rule is satisfied depending on its parameters. Here we have two distinct terms, the degree of pattern match which represents the certainty with which the pattern's algorithm marked the given place in the document, and the certainty degree of marking the partial candidate for the value of a certain datatype property which represents the certainty that the given place in the document really is the occurrence of the value, given sole by the observation of the single pattern and independently of any other patterns. We will denote marking the partial candidate as the pattern evidence and therefore the second term will be equivalent to a degree of pattern evidence. The degree of pattern match and the degree of pattern evidence should intuitively correspond. If we denote the pattern match as A and the pattern evidence as E we can write down this inference rule:</p><formula xml:id="formula_0">A → E (1)</formula><p>We have chosen for our purposes a fuzzy logic inference model <ref type="bibr" target="#b1">[2]</ref>, but it should be possible to use any other. In fuzzy logic we can define A and E as propositions</p><p>• A -"The pattern has marked the given place in the document."</p><p>• E -"The marked place is really a pattern evidence" and corresponding degrees as their truth values (i.e. degree of pattern match a=val(A) and the degree of pattern evidence e=val(E)). We introduce also two universal parameters for every pattern, namely precision and cover and we define them:</p><formula xml:id="formula_1">p = val (A → E) -pattern precision (2) c = val (E → A) -pattern cover (3)</formula><p>While using these parameters we can derive on Łukasiewicz fuzzy logic this form of the inference rule (the detailed derivation is available in <ref type="bibr" target="#b5">[6]</ref>):</p><formula xml:id="formula_2">((A &amp; (A → E)) ∨ ¬(E → A)) ⇒ E (4)</formula><p>If we take into account the prescriptions of p and c and do not overestimate the degree of pattern evidence e, we get: e = max (a + p -1, 1 -c) (5) With the use of parameter p we can set a top limit to the degree of a given pattern evidence. The pattern precision denotes in this context a certainty with what the high degree of pattern match leads to a high degree of pattern evidence. The c parameter sets the lowest possible value of the degree of pattern match.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Composite patterns</head><p>Would we like to combine the evidences of multiple patterns it will be a task in form of a set operation. For this purpose just two set operations come on force, namely union and intersection, which in combination with the complement operation can form any other set operation possible. The determination of the degree of composite pattern evidence itself is then a trivial matter. The degree of composite pattern match will be determined by simply assembling the degrees of evidences of the partial patterns with the appropriate logical operation, thus there will be conjoint patterns and disjoint patters (and possibly negating patterns). From the pattern match defined in this way we get the composite pattern evidence by the same way as we did in case of atomic patterns. It is interesting to discuss the meaning of precision and cover parameters of the partial patterns in contrast with the use of the different kinds of composite patterns. In case of a disjoint composite pattern the high value of p implies that the partial pattern is a sufficient condition and in case of conjoint composite patterns the high value of c means that the partial pattern forms a necessary condition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Designing the patterns</head><p>We will denote the patterns in the extraction ontology as XML elements from a special namespace nested in the elements of the datatype properties. The extent of the patterns can vary from distinguishing time values, named entities to patterns that evaluate the context or format of the document. A few basic patterns are proposed in <ref type="bibr" target="#b5">[6]</ref>, however many others are possible. While designing a new pattern it is needed to keep in mind the way it evaluates and think carefully the possibilities of its combination of other existing patterns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">A simple wrapper induction method</head><p>By applying the rules of patterns on the content of a document we get a set of evidences along with their certainty degrees for every datatype property in the extraction ontology with a pattern assigned. If we rely on tabular structure of data we can try to separate the evidences in a few segments according to the resemblance of their XPath. We can purge the sets of evidences if we realize that the precision attribute specifies the mean ratio of evidences that are marked correctly by the pattern. Therefore up to 1-p of evidences supplied by this pattern can be false and hence we can remove that much of the worst segments. If the data are stored in the tabular structure the relevant parts of text are generally contained in the same structure of elements that is not changing throughout the segment. On the level of XPath expression this will show up as a single changing index in the absolute path by omitting which we get a set of elements that would ideally all contain the value of the respective property. The cover parameter is the mean rate of the evidences that the pattern identifies to the total real number of occurrences of the respective property. While the generalized XPath expression should identify all possible occurrences of the property we can calculate the proportion of evidences of the pattern to this "complete" set and difference from the parameter c represent the error caused by generalizing the paths. Based on the number of evidences in segments and the respective absolute XPath we can assign the corresponding segments of different properties and form the instances of extracted class. To sum up this approach is just a simple method and has many limitations. Besides that this method can extract only properties with cardinality 1 (the tabular structure) it is also limited in its tolerance to the irregularities in the structure of the document, on the other hand to the irregularities in the extracted values it is rather resistant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and future work</head><p>The proposed method of pattern notation allows hierarchical combining of partial patterns and is open to the possibility of designing additional patterns according to one's need. Similar approach is taken by <ref type="bibr" target="#b3">[4]</ref> and <ref type="bibr" target="#b4">[5]</ref>, however unlike them we do not design proprietary formats of ontologies but try to start from OWL standard. The limitation of the proposed wrapper induction method is the fact that it relies on the tabular structure of extracted data but the extraction is completely automatic and with proper setting of the attributes allows the estimation of extraction error. To propose a way of automatic learning of the patterns or at least of their parameters could be an interesting subject of future work</p></div>		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>The research leading to this paper was supported by the European Commission under contract FP6-027026, Knowledge Space of semantic inference for automatic annotation and retrieval of multimedia content, K-Space.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">XPath-Wrapper Induction by generalizing tree traversal patterns</title>
		<author>
			<persName><forename type="first">Anton</forename><forename type="middle">T</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">A Semantic Web Primer</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Antoniou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Van Harmelen</surname></persName>
		</editor>
		<meeting><address><addrLine>Cambridge MA</addrLine></address></meeting>
		<imprint>
			<publisher>MIT Press</publisher>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Metamathematics of fuzzy logic</title>
		<author>
			<persName><forename type="first">P</forename><surname>Hájek</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>Kluwer</publisher>
			<pubPlace>Dordrecht</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Wrapper induction for information extraction</title>
		<author>
			<persName><forename type="first">N</forename><surname>Kushmerick</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
		<respStmt>
			<orgName>University of Washington</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">On the Design and Exploitation of Presentation Ontologies for Information Extraction</title>
		<author>
			<persName><forename type="first">M</forename><surname>Labský</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Svátek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ESWC&apos;06 Workshop on Mastering the Gap: From Information Extraction to Semantic Representation</title>
				<meeting><address><addrLine>Budva, Montenegro</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A Hierarchical Approach to Wrapper Induction</title>
		<author>
			<persName><forename type="first">I</forename><surname>Muslea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Minton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Knoblock</surname></persName>
		</author>
		<ptr target="http://www.isi.edu/~muslea/papers.html" />
	</analytic>
	<monogr>
		<title level="m">3rd Conference on Autonomous Agents</title>
				<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Využití ontologií při indukci wrapperů, diplomová práce, VŠE</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nekvasil</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<pubPlace>Praha</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
