<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IRIS: A Protege Plug-in to Extract and Serialize Product Attribute Name-Value Pairs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tugba O zacar</string-name>
          <email>tugba.ozacar@cbu.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Engineering, Celal Bayar University Muradiye</institution>
          ,
          <addr-line>45140, Manisa</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article introduces IRIS wrapper, which is developed as a Protege plug-in, to solve an increasingly important problem: extracting information from the product descriptions provided by online sources and structuring this information so that is sharable among business entities, software agents and search engines. Extracted product information is presented in a GoodRelations-compliant ontology. IRIS also automatically marks up your products using RDFa or Microdata. Creating GoodRelations snippets in RDFa or Microdata using the product information extracted from Web is a business value, especially when you consider most of the popular search engines recommend the use of these standards to provide rich site data for their index.</p>
      </abstract>
      <kwd-group>
        <kwd>product</kwd>
        <kwd>GoodRelations</kwd>
        <kwd>Protege</kwd>
        <kwd>RDFa</kwd>
        <kwd>Microdata</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The Web contains a huge number of online sources which provides
excellent resources for product information including speci cations and
descriptions of products. If we present this product information in a
structured way, it will signi cantly improve the e ectiveness of many
applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This paper introduces IRIS wrapper to solve an increasingly
important problem: extracting information from the product
descriptions provided by online sources and structuring this information so that
is sharable among business entities, software agents and search engines.
The information extraction systems can be divided into three categories
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: (a) Procedural Wrapper: The approach is based on writing customized
wrappers for accessing required data from a given set of information
sources. The extraction rules are coded into the program. Creating
wrappers are easier and it can directly output the domain data model of
application but each wrapper works only for an individual page. (b)Declarative
Wrapper: These systems consist of a general execution engine and
declarative extraction rules developed for speci c data sources. The wrapper
takes an input speci cation that declaratively states where the data of
interest is located on the HTML document, and how the data should be
wrapped into a new data model. (c) Automatic Wrapper: The automatic
extraction approach uses machine learning techniques to learn extraction
rules by examples. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] information extraction systems are classi ed
into two: solutions treating Web pages as a tree, and solutions
treating Web pages as data stream. Systems are also divided with respect to
the level of automation of wrapper creation into manual, semi-automatic
and automatic. IRIS is a declarative and manual tree wrapper 1, which
has a general rule engine that executes the rules speci ed in a template
le using XML Path Language (XPath). Manual approaches are known
to be tedious, time-consuming and require some level of expertise
concerning the wrapper language [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, manual and semi-automatic
approaches are currently better suited for creating robust wrappers than
the automatic approach. Writing an IRIS template is considerably easier
than most of the existing manual wrappers. Besides, it can be predicted
that to improve the reusability and the e ciency, the users of the IRIS
engine will share templates on the Web.
      </p>
      <p>
        There are works which directly focus on the problem of this paper. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
uses a template-independent approach to extract product attribute name
and value pair from Web. This approach makes hypothesis to identify
the speci cation block but since some detail product pages may violate
these hypothesis, the pairs in these pages cannot be extracted properly.
The second work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] needs two prede ned ontologies to extract product
attribute name and value pairs from a Web page. One of these ontologies
is built according to the contents of the page but it is not an easy task
to build that ontology from scratch for every change in the page content.
The system presented in this paper di ers from the above works in many
ways.
      </p>
      <p>
        First of all the system transforms the extracted information into an
ontology to share and reuse common understanding of structure of
information among users or software agents. To my knowledge [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], IRIS is the
rst Protege plug-in that is used to extract product information from
Web pages. Designed as a plug-in for the open source ontology editor
Protege, IRIS exploits the advantages of the ontology as a formal model
for the domain knowledge and pro ts from the bene ts of a large user
community (currently 230,914 registered users).
      </p>
      <p>
        Another feature is support for building an ontology that is compatible
with GoodRelations Vocabulary [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which is the most powerful
vocabulary for publishing all of the details of your products and services in
a way friendly to search engines, mobile applications, and browser
extensions. The goal is to have extremely deep information on millions of
products, providing a resource that can be plugged into any e-commerce
system without limitation. If you have GoodRelations in your markup,
Google, Bing, Yahoo, and Yandex will or plan to improve the rendering
of your page directly in the search results. Besides, you provide
information to the search engines so that they can rank up your page for queries
to which your o er is a particularly relevant match. Finally, as an open
source Java Application, IRIS can be further extended, xed or modi ed
according to the needs of the individual users.
      </p>
      <p>The following section (with three subsections) includes the system's
features and a scenario based quick-start guide. Section 3 concludes the
paper with a brief talk about possible future work.
1 Download link: https://github.com/tugbaozacar/iris</p>
    </sec>
    <sec id="sec-2">
      <title>Scenario-based System Speci cation</title>
      <p>IRIS system gathers semi-structured product information from an HTML
page, applies extraction rules speci ed in the template le, and presents
the extracted product data in an ontology that is compatible with
GoodRelations Vocabulary. The HTML page is rst parsed into a DOM tree using
HtmlUnit, which is a Web Driver that supports walking the DOM model
of the HTML document using XPath queries. In order to get product
information from Web page, the template le includes a tree that speci es
the paths of HTML tags around the product attribute names and
product attribute values. Figure 1 shows the architecture of the system brie y.
User builds a template for the pages containing the product information.</p>
      <p>
        Then HtmlUnit library parses the Web pages. The system evaluates the
nodes in the template and queries the HtmlUnit for the required product
properties. At the end of this process, the system returns a list of product
objects. To de ne a GoodRelations-compliant ontology the user maps the
product properties to the properties of the \gr:Individual" class, saves
the ontology and serializes the ontology into a series of structured data
markup standards. The system makes serialization via RDF Translator
API [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Each step is described in the following subsections.
The information collected is mapped to the attributes of the Product
object including title, description, brand, id, image, features, property
names, property values and components. A template has two parts; the
rst part contains the tree that speci es the paths of HTML tags around
the product attribute names and values. The second part speci es how
the HTML documents should be acquired. The product information is
extracted using the tree. The tree is created manually and its nodes
are converted to XPath expressions. HtmlUnit evaluates the speci ed
XPath expressions and returns the matching elements.Figure 2 shows
the example HTML code which contains the information about the rst
product in \amazon.com" pages that contain information about laptops.
Figure 3 shows the tree which is built for extracting product information
from the page in Figure 2.
      </p>
      <p>The leaf nodes of the tree (Figure 3) contains the HTML tag around a
product attribute name or a product attribute value, and the internal
nodes of the tree contains the HTML tags in which the HTML tag in the
leaf node is nested. Therefore the hierarchy of the tree also represents the
hierarchy of the HTML tags. c1 contains the value of the title attribute,
c2 contains the image link of the product, and c3 is one of the internal
nodes that specify the path to its leaf nodes. c3 speci es that all of its
children contain HTML tags which are nested within the h3 heading
tag having class name \newaps". Its child node (c4) speci es the HTML
link element which goes to another Web page that contains detailed
information about the product. The starting Web page is referred as
root page and the pages navigated from root page are child pages. After
jumping the page address speci ed by c4, product properties and their
values are chosen from this Web page which is shown in Figure 4.</p>
      <p>The properties and their corresponding values are stored in an HTML
table, which is nested in an HTML division identi ed by \prodDetails"
id. Therefore c5 speci es this HTML division and its child nodes c6 and
c7 speci es the HTML cells containing product properties and their
values. After determining the HTML elements which contain the product
information, the user de nes these elements in the template properly.
Each node in the tree is a combination of the following elds:
SELECT-ATTR-VALUE These three elds are used to build the
XPath query that speci es the HTML element in the page.</p>
      <p>ORDER is used when there is more than one HTML element matching
with the expression. The numeric value of the ORDER element speci es
which element will be selected.</p>
      <p>GETMETHOD is used to collect the proper values in the selected
HTML element e. If you want to get the textual representation of the
element (e), in other words what would be visible if this page was shown in
a Web browser, you de ne the value of GETMETHOD eld as \asText".
Otherwise you get the value of an element (e) attribute by specifying the
name of the attribute as the value of GETMETHOD eld.</p>
      <p>AS is only used with leaf nodes. The value collected from a leaf node
using GETMETHOD eld is mapped to the Product attribute speci ed
in the AS eld.</p>
      <p>Appendix A gives the template (amazon.txt) which contains the code
of the tree in Figure 3. The second part of a template le contains the
information on how the HTML documents should be acquired. This part
has the following elds:
NEXT PAGE The information about laptops in \amazon.com" is spread
across 400 pages. The link of the next page is stored in this eld.
PAGE RANGE speci es the number of the page or the range of pages
which you want to collect information from. In my example, I want to
collect the products in pages from 1 to 3.</p>
      <p>BASE URI represents the base URI of the site. In my example, the
value of this eld is http://www.amazon.com.</p>
      <p>PAGE URI is the URI of the rst page which you want to collect
information from. In my example, this is the URI of the page 1.
CLASS contains the name of the class that represents the products to
be collected. In my example, \Laptop" class is used.
2.2 Create an Ontology that is Compatible with
GoodRelations Vocabulary
First of all, user opens an empty ontology (\myOwl.owl") in the Protege
Ontology Editor and displays the IRIS tab which is listed on the
TabWidgets panel. Then the user selects the template le using \Open template"
button in Figure 5 (for this example: amazon.txt). Then the tool imports
all laptops from the \amazon.com" pages speci ed in the PAGE RANGE
eld. The imported individuals are listed in the \Individuals Window"
(Figure 5). The \Properties Window" lists all properties of the
individuals in \Individuals Window".</p>
      <p>
        In this section, I follow up the descriptions and examples introduced in
GoodRelations Primer [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. First of all, the system de nes the class in
your template (\Laptop" class in example) as a subclass of \gr:Individual"
class of the GoodRelations vocabulary. Then the properties of the
\Laptop" class, which are collected from the Web page should be mapped to
the properties of \gr:Individual", which can be classi ed as follows:
First category: \gr:category", \gr:color", \gr:condition", etc. (see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
for full list). If the property px is semantically equivalent of a property
from the rst category py , then user simply maps px to py.
      </p>
      <p>Second category: Properties that specify quantitative characteristics,
for which an interval is at least theoretically an appropriate value should
be de ned as subproperties of \gr:quantitativeProductOrServiceProperty".
A quantitative value is to be interpreted in combination with the
respective unit of measurement and mostly quantitative values are intervals.
Third category: All properties for which value instances are speci ed
are subproperties of \gr:qualitativeProductOrServiceProperty".
Fourth category: Only such properties that are no quantitative
properties and that have no prede ned value instances are de ned as
subproperties of \gr:datatypeProductOrServiceProperty".</p>
      <p>To create a GoodRelations-compliant ontology, user selects the
individuals and properties that will reside in the ontology. Then she clicks the
\Use GoodRelations Vocabulary" button (Figure 5) and \Use
GoodRelations Vocabulary" wizard appears. She selects the corresponding
GoodRelations property type and respective unit of measurement.
User saves the ontology in an owl le and clicks the \Export to a
serialization format" button (Figure 5) to view the ontology in one of the
structured data markup standards.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>
        This work introduces a Protege plug-in called IRIS that collects product
information from Web and transforms this information into
GoodRelations snippets in RDFa or Microformats. The system attempts to solve
an increasingly important problem: extracting useful information from
the product descriptions provided by the sellers and structuring this
information into a common and sharable format among business entities,
software agents and search engines. I plan to improve the IRIS plug-in
with an extension that gets user queries and sends them to Semantics3
API [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which is a direct replacement for Google's Shopping API and
gives developers comprehensive access to data across millions of products
and prices. Another potential future work is generating an environment
for semi-automatic template construction. An environment that
automatically constructs the tree nodes from the selected HTML parts will
signi cantly reduce the time to build a template le. And yet another
future work is diversify the supported input formats (pdf, excel, csv etc.).
SELECT=( div ) , ATTR=( i d ) , VALUE= ( r e s u l t ) [
      </p>
      <p>SELECT=(span ) , ATTR=( c l a s s ) , VALUE=( l r g bold ) ,
GETMETHOD=(asText , AS=(product . t i t l e ) ;
SELECT=(img ) , ATTR=( s r c ) , GETMETHOD=(Src ) ,
AS=(product . imgLink ) ;
SELECT=(h3 ) , ATTR=( c l a s s ) , VALUE= ( newaps ) [</p>
      <p>SELECT=(a ) , ATTR=( h r e f ) , GETMETHOD=( h r e f ) [</p>
      <p>SELECT=( div ) , ATTR=( i d ) , VALUE=( p r o d D e t a i l s ) [</p>
      <p>SELECT=(td ) , ATTR=( c l a s s ) , VALUE=( l a b e l ) ,
GETMETHOD=(asText , AS=(product . propertyName ) ;
SELECT=(td ) , ATTR=( c l a s s ) , VALUE=( value ) ,</p>
      <p>GETMETHOD=(asText , AS=(product . propertyValue ) ] ] ] ]
NEXT PAGE: fSELECT=(a ) , ATTR=( i d ) , VALUE=(pagnNextLink ) ,</p>
      <p>GETMETHOD=( h r e f )g
PAGE RANGE:f1 3g
BASE URI : f http : / /www. amazon . comg
PAGE URI: f http : / /www. amazon . com/ s / r e f=s r n n r n n n 1 ? rh=
nn%3A565108n%2Ckn%3Alaptopn&amp;keywords=laptopn&amp;
i e=UTF8n&amp;qid =1374832151n&amp; r n i d =2941120011g
CLASS: f Laptopg</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Q.M.:</given-names>
          </string-name>
          <article-title>Simultaneous product attribute name and value extraction with adaptively learnt templates</article-title>
          .
          <source>In: Proceedings of CSSS '12</source>
          . (
          <year>2012</year>
          )
          <year>2021</year>
          {
          <fpage>2025</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          .:
          <article-title>Design of Web Semantic Integration System</article-title>
          .
          <source>PhD thesis</source>
          , Tennessee State University. (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Firat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Information Integration Using Contextual Knowledge and Ontology Merging</article-title>
          .
          <source>PhD thesis</source>
          , MIT, Sloan School of Management (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Muslea</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A hierarchical approach to wrapper induction</article-title>
          , ACM Press (
          <year>1999</year>
          )
          <volume>190</volume>
          {
          <fpage>197</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wu</surname>
            , B., Cheng,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Simultaneous product attribute name and value extraction from web pages</article-title>
          .
          <source>In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference</source>
          , IEEE Computer Society (
          <year>2009</year>
          )
          <volume>295</volume>
          {
          <fpage>298</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Holzinger</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kruepl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herzog</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Using ontologies for extracting product features from web pages</article-title>
          .
          <source>In: Proceedings of the ISWC'06</source>
          , Springer-Verlag
          <year>2006</year>
          (
          <year>2006</year>
          )
          <volume>286</volume>
          {
          <fpage>299</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. :
          <article-title>Protege plug-in library Last accessed:</article-title>
          <year>2013</year>
          -09-24.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hepp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Goodrelations: An ontology for describing products</article-title>
          and
          <article-title>services o ers on the web</article-title>
          .
          <source>EKAW '08</source>
          (
          <year>2008</year>
          )
          <volume>329</volume>
          {
          <fpage>346</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stolz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hepp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Rdf translator: A restful multiformat data converter for the semantic web</article-title>
          .
          <source>Technical report</source>
          , E-Business and Web Science Research Group (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hepp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Goodrelations: An ontology for describing web o ers |primer and user's guide</article-title>
          .
          <source>Technical report, E-Business + Web Science Research Group</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>Semantics3 Inc.: Semantics3 - apis for products and prices (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>