<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Repeatable Web Data Extraction and Interlinking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M. Kopecky</string-name>
          <email>kopecky@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Vomlelova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>P. Vojtas</string-name>
          <email>vojtas@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Physics Charles University Malostranske namesti 25</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Introduction</institution>
          ,
          <addr-line>Motivation, Recent Work</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>228</fpage>
      <lpage>234</lpage>
      <abstract>
        <p>We would like to make all the web content usable in the same way as it is in 5 star Linked (Open) Data. We face several challenges. Either there are no LODs in the domain of interest or the data project is no longer maintained or even something is broken (links, SPARQL endpoint etc.). We propose a dynamic logic extension of the semantic model. Data could bear also information about their creation process. We calculate this on several movie datasets. In this work in progress we provide some preference learning experiments over extracted and integrated data.</p>
      </abstract>
      <kwd-group>
        <kwd>Repeatable experiments</kwd>
        <kwd>web data extraction</kwd>
        <kwd>annotation</kwd>
        <kwd>linking</kwd>
        <kwd>dynamic logic</kwd>
        <kwd>preference learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>For our decisions we often need automated processing
of integrated web data. Linked (open) data are one
possibility to achieve this vision. Still, there are some
challenges.</p>
      <p>Production URLs are sometimes subjects of
change ([A]). Data migrate, data project run out of
contracted sustainability period and so on.</p>
      <p>SPARQL endpoints are expensive for the server, and not
always available for all datasets. Downloadable dumps are
expensive for clients, and do not allow live querying on the
Web ([V+]).</p>
      <p>In some areas there are no corresponding Linked data
projects available at all. Imagine e.g. a customer looking
for a car. He or she would like to aggregate all web data.
Our idea is to remember previous successful extractions in
given domain and use this in the current situation. For
evaluation of previous extractions can help also social
networks. We have presented this idea first in [PLEDVF].
We concentrated on one specific purpose – extract object
attributes and use of these data in recommender systems. In
this research we have tried to contribute to increase the
degree of automation of web content processing. We
presented several methods for mining web information and
assisted annotations.</p>
    </sec>
    <sec id="sec-2">
      <title>Semantic Annotator for (X)HTML</title>
      <p>A tool for assisted annotation is available in [F2].
Semantic Annotator allows both manual and assisted
annotation of web pages directly in Google Chrome. It
requires no complicated installation and is available on all
platforms and devices where it is possible to install Google
Chrome. Semantic annotation is available to all current
users of the Internet not only to authors’ site. Browser
extension Semantic Annotator began as a prototype
implementation in the Thesis [F1].</p>
      <p>Google Chrome extension Semantic Annotator is used
for manual semantic annotation of Web sites. The goal of
semantic annotation is to assign meaning to each part of the
page. The significance of real-world objects and their
properties and relationships are described in dictionaries –
either self-created or imported. Then the annotation process
consists of selecting parts of the site and the assignment of
meaning from the dictionary.</p>
      <p>In Figure 1 we show an example of annotation of
product pages in e-shop. The user selects a web page and
product name from the dictionary describing products that
assigns a name meaning "product name". Then on the same
page the user selects a product price and gives it the
meaning of "price". Because of similarity of pages and
usage of templates, annotating few pages enables to train an
annotator for the whole site. Consequently, search of
annotated website is then much more accurate.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Semantic Annotations for Texts</title>
      <p>In previous section we have described a tool for
annotation of (X)HTML pages. These are useful for
extraction of attributes of products on pages created by the
same template even if texts are a little bit different. Even if
there are no templates, in a fixed domain like reports on
traffic accidents, there are still repetitions.</p>
      <p>A tool for annotation of text was developed in our group
in PhD thesis [D1]. The tool is available under [D2]. It is
built on GATE [BTMC], using UFAL dependency parsing
[UFAL] and Inductive Logic Programming extension. It
represents a tool for information extraction and consequent
annotation.</p>
      <p>In the thesis [D1] are presented four relatively separate
topics. Each topic represents one particular aspect of the
information extraction discipline. The first two topics are
focused on new information extraction methods based on
deep language parsing in combination with manually
designed extraction rules. The second topic deals with
a method for automated induction of extraction rules using
inductive logic programming. The third topic of the thesis
combines information extraction with rule based reasoning.
The core extraction method was experimentally
reimplemented using semantic web technologies, which
allows saving the extraction rules in so called shareable
extraction ontologies that are not dependent on the original
extraction tool. See Fig.2. on traffic accident data.</p>
      <p>The last topic of the thesis deals with document
classification and fuzzy logic. The possibility of using
information obtained by information extraction techniques
to document classification is examined. For more see also
[PLEDVF].</p>
      <p>In this research proposal we concentrate on synergy
effect of annotation and integration of data for user
preference learning, and consequently for recommendation.</p>
      <p>It turns out that similarity and dynamic aspects of web
data play a role here as well. We propose an appropriate
dynamic logic model of web data dynamics and provide
some experiments. We hope this can serve as preliminary
experiments for a more extended research proposal.
2</p>
      <sec id="sec-3-1">
        <title>Extraction Experiments</title>
        <p>Our main goal is: Try to remember information about
the context of data creation in order to enable repetition of
extraction. In general we can remember algorithms,
training data and metrics of success.</p>
        <p>In this chapter we try to describe some extraction
algorithms and process of data integration in movie
domain. These data will be used in preference learning. By
this we would like to illustrate our main goal.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Integration</title>
      <p>We use Flix data (enriched Netflix competition data),
RecSys 2014 challenge data [T] and RuleML Challenge
data [K].</p>
      <p>The datasets are quite different but they still have few
things in common. Movies have their title and usually also
the year of their production. Ratings are equipped by
timestamp that allows us to order ratings from individual
users chronologically.</p>
      <p>One of problems was that different datasets use different
MOVIEID’s, so the movies cannot be straightforwardly
mapped across datasets. To achieve this goal we wanted to
enhance every movie by the corresponding IMDb
identifier.</p>
      <p>We observed that the Twitter datasets use as their
internal MOVIEID the numeric part of the IMDb
identifier. So the movie “Midnight Cowboy” with Twitter
MOVIEID = 64665 corresponds to the IMDb record with
ID equal to ’tt0064665’. Therefore, we construct the
algorithm
τ1:Twitter-MOVIEID → IMDbId
which simply concatenated prefix ’tt’ with the MOVIEID
left-padded by zeroes to seven positions. The
successfulness of this algorithm is shown in Table 1.
where the algorithm ηi transforms movie title to query
string needed for IMDb search, while the σj algorithm then
looks for the correct record in returned table. The simplest
implementation of algorithms can be denoted as follows:
η1: TITLE → TABLE
σ1: TABLE × TITLE × YEAR → TT
where η1 algorithm concatenates all words of the title by
the plus sign and σ1 algorithm returns TT in case the
resulting table contains exactly one record. The results of
this combination of algorithm are shown in Table 2 (ratio
of correct answers).</p>
      <sec id="sec-4-1">
        <title>Algorithm</title>
        <p>σ1(η1())</p>
        <p>To illustrate different algorithms for same extraction
task we describe another version. Here the algorithm is not
learned, but it is hand crafted. One of reasons for relatively
low effectiveness of σ1(η1()) algorithm was the sub-optimal
query string used for IMDb search due to quite different
naming conventions of movie titles in different datasets. To
improve the results we enhanced the movie title
transformation incrementally and produced its new
versions. Every new version added new step of
transformation of the movie title:
η2: Convert all letters in movie title to lower case.
η3: If the movie title contains year of production at its
end in brackets remove it.</p>
        <p>η4: If the movie title still contains text in brackets at its
end, remove it. This text usually contained original name of
movie in original language.</p>
        <p>η5: Move word “the”, respectively “a”/ “an” from the
end of the title to the beginning.</p>
        <p>η6: Translate characters ”_”, ”.”, ”?” and ”,” to spaces
η7: Translate ”&amp;” and ”&amp;amp;” in titles to word ”and”
For example, the η7 transformation changes title
“Official Story, The (La Historia Oficial) (1985)” to its
canonical form “the official story”.</p>
        <p>This version of transformation then constructs the IMDb
query in form
http://www.imdb.com/find?q=the+official+story&amp;s=tt&amp;…
and then looks up the resulting table to find identifier
”tt0089276”.</p>
        <p>The results of this combination of algorithm were:</p>
      </sec>
      <sec id="sec-4-2">
        <title>Algorithm</title>
        <p>σ1(η7())</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>MovieLe ns</title>
      <p>45,4%</p>
    </sec>
    <sec id="sec-6">
      <title>Flix</title>
      <p>70,9
%</p>
    </sec>
    <sec id="sec-7">
      <title>Twitter</title>
      <p>Not
needed</p>
      <p>In optimal case, the table returning from the IMDb
search contains exactly one row with the requested record.
For this situation the algorithm σ1 behaves well and is able
to retrieve the correct IMDb identifier. In many other cases
the result contains more rows and the correct one or the
best possible one has to be identified. For this purpose we
constructed more versions of the σj algorithm as well:
σ2: The correct record should be from the requested
year, so search only for records from this year and ignore
other records
σ3: The IMDb search provides more levels of tolerance
in title matching. Try to use thee of them from the most
exact one to the most general. If the matching record from
requested year cannot be found using stricter search, the
other search level is used.</p>
      <p>Currently, we have 13 081 out of all 17 770 Flix movies
mapped onto the IMDb database. Even all 27 278 movies
from the MovieLens set are mapped to the equivalent IMDb
record. So the current results provided by the combination
of most advanced versions of algorithms are:</p>
      <p>The diagram in the Figure 3 shows the number of
movies in individual datasets and number of movies
assigned to their corresponding IMDb record. Amount of
movies associated to the IMDb record in different
intersections after the integration is different. For example,
the MovieLens dataset contains in total 27 278 movies.
From these are 13 134 unique associated movies and also
contain 3 759 associated movies common with the Flix
dataset not existing in the Twitter dataset. The number of
movies common for all three datasets is equal to 4 075. By
summing of 3 759 and 4 075 we get the total number of
7 654 associated movies belonging to both MovieLens and
Flix datasets, etc.
2.2</p>
    </sec>
    <sec id="sec-8">
      <title>Extraction of attributes</title>
      <p>For each movie registered in the IMDb database we then
retrieved XML data from the URL address
http://www.omdbapi.com/?i=ttNNNNNNN&amp;plot
=full&amp;r=xml
and then from the XML data retrieve following movie
attributes:</p>
      <sec id="sec-8-1">
        <title>IMDb title</title>
        <p>IMDb rating
IMDb rated
IMDb avards
IMDb metascore
IMDb year
IMDb country
IMDb language
IMDb genres
IMDb director
IMDb actors</p>
        <p>The similar way the movies from datasets are mapped
onto IMDb movies, we implemented the mapping
technique described in [K] and assigned DbPedia1
identifiers and semantic data to IMDb movies.</p>
        <p>The DbPedia identifier of movie is a string, for example
”The_Official_Story” or ”The_Seventh_Seal”. This
identifier can then be used to access directly the DbPedia
graph database or retrieve data in an XML format through
the URL address in form
http://dbpedia.org/page/DbPediaIdentifie
r. From the data available on the DbPedia page can be
directly or indirectly extracted movie attributes GENRE,
GENRE1, ACTION, ADVENTURE, ANIMATION,
CHILDRENS, COMEDY, CRIME, DOCUMENTARY,
DRAMA, FANTASY, FILM_NOIR, HORROR,
MYSTERY, MUSICAL, ROMANCE, SCI_FI,
THRILLER, WAR, WESTERN or attributes CALIF, LA,
NY, CAMERON, VISUAL, SEDIT, NOVELS, SMIX,
SPIELBERG, MAKEUP, WILLIAMS and many others.</p>
        <sec id="sec-8-1-1">
          <title>3 A Dynamic Logic Model for Web</title>
        </sec>
        <sec id="sec-8-1-2">
          <title>Annotation</title>
          <p>For effective using of changing and/or increasing
information we have to evolve tools (e.g. inductive
methods) used for creation of specific web service (here
recommendation of movies). Our goal is to extend the
semantic web foundations to enable describing creation,
dynamics and similarities on data. To describe the
reliability of extraction algorithms we propose a
"half-away" extension of dynamic logic.</p>
          <p>Our reference for dynamic logic is the book of D. Harel,
D. Kozen, J. Tiuryn [HKT].</p>
          <p>Dynamic logic has two types of symbols:
propositions/formulas ϕ, ψ ∈ Π and programs α, β ∈ Φ.
One can construct a program also from a formula by test ϕ?
and formulas also by generalized modality operations
[α], &lt;α&gt;. The expression &lt;α&gt;ϕ says that it is possible to
execute α and halt in a state satisfying ϕ; the expression
[α]ϕ says that whenever α halts, it does so in a state
satisfying ϕ.</p>
          <p>Main goal of dynamic logic is reasoning about programs,
e.g. in program verification. In our case programs will be
extractor/annotators and can be kept propositional, as for
now we are not interested in procedural details of
extractors. Formulas will be more expressible in order to be
able to describe the context of extraction.</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>1 http://wiki.dbpedia.org/</title>
      </sec>
      <sec id="sec-8-3">
        <title>Using the example above, let</title>
        <p>ϕ be the statement that a Twitter data entry has title
“Midnight Cowboy” and MOVIEID = 64665;</p>
        <p>α be the algorithm concatenating prefix ’tt’ with the
MOVIEID left-padded by zeroes to seven positions; and
ψ says movie “Midnight Cowboy” corresponds to the
IMDb record with ID equal to ’tt0064665’.</p>
        <p>The corresponding dynamic logic expression is
∀x(ϕ(x)  [α]ψ(x))
saying that whenever α starts in a state satisfying ϕ(m1)
then whenever α halts, it does so in a state satisfying ψ(m1)
- see illustration in Fig.4.</p>
        <p>Programs (extractors) remain propositional, states
correspond to different representation of content on the
web. On each of states the respective semantics is defined
using appropriate query language.</p>
        <p>Our logic has expressions of two sorts and each sort is,
respectively can be typed:</p>
        <p>Statements about web data: can be
either atomic, e.g. Φ0RDF, Φ0FOL, Φ0RDB, Φ0XML, Φ0DOM,
Φ0BoW, Φ0PoS, Φ0DepTree, etc. or more complex, e.g. ϕRDF,
ψFOL, etc. With the corresponding data model and query
language based semantics. All can be subject of
uncertainty, probability extensions.</p>
        <p>Programs (propositional): atomic Π0σ for subject
extraction, Π0π for property extraction or Π0ω for object
value extraction in case of HTML, XHTML, or XML data;
Π0ner for named entity extraction in case of text data, etc.
and more complex ασπω, βσπω, γσπω, etc. In this logic we do
not prove any statements about program depending on their
code, so program names point to code one would reuse.</p>
        <p>Statements are typically accompanied by information
about program creation like data mining tool, training data,
metrics (e.g. precision, recall), etc. There is also a lot of
reification describing the training and testing data and the
metrics of learning. Our model is based on dynamic logic,
calculates similarity of states and describes
uncertain/stochastic character of our knowledge.</p>
        <p>Hence we are able to express our extraction experience
in statements like
where ϕ is a statement about data D1 before extraction
(preconditions), ψ is a statement about data/knowledge D2,
K2 after extraction (postconditions), α is the program used
for extraction. Modality [α]x can be weighted, describing
uncertainty aspects of learning.</p>
        <p>Lot of additional reification about learning can be
helpful.</p>
        <p>The main idea of this paper is that if there are some data
D1’ similar to D1 and ϕ is true in some degree – e.g.
because both resources were created using same template
then after using α we can conclude with high
certainty/probability that the statement ψ will be true on
data D2’ (knowledge K2’).</p>
        <p>For instance the formula
“MyData are similar to IRI3”  [σ3η7]0.736 “IMDBId is
correct”</p>
        <p>Experiments with extraction and integration of movie
data can serve as an example of this. In the next chapter we
would like illustrate how this influences recommendation.
4</p>
        <sec id="sec-8-3-1">
          <title>Preference Learning Experiments</title>
          <p>To show usability of extracted and annotated data, we
provide experiments in area of recommender systems.
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Data Preprocessing</title>
      <p>We selected all Twitter users with ratings less or equal
to 10, random 3000 MovieLens users and random 3000 Flix
users.</p>
      <p>For these, we split the RATING data by assigning last
(according to time stamp) 5 records from each user as a test
data, the remaining data was used as train data.</p>
      <p>Based on train data, we calculated aggregated variables:</p>
      <p>In Table 10 we show preliminary results on testing
repeatability. We trained the model on the data set in the
row and tested on the test data in column. No surprise that
in each column the best result is on the diagonal.</p>
      <p>Similarly, as in Linked data quality assessment, we can
imagine similar assessment for reusability.</p>
      <p>The main idea is, that this will be less sensitive to URL
change, migration and “end-of-project-phenomenon”. One
can imagine, that these information are published at
https://sourceforge.net/ or similar service. What follows is
a vision, we would like to discuss:
As the zero model we use movie rating average MAVG.</p>
      <p>In this research proposal, we do not evaluate role of
similarity, we just illustrate similarity of our datasets.</p>
      <p>In Figure 5 we show MAVG function on a sample of
movies (with IMDB ID#s). Table 11 show MAVG
distances below diagonal. So far it is not clear to us which
metrics to use to compute similarity – Euclidean or cosine?
Further experiments are required as this can depend on
domain.</p>
      <p>Maybe the right idea to calculate similarity is content
based. We illustrate this by Fig. 6 with behavior of statistics
on genres. Table 11 show genre based distance in cells
above diagonal.
5</p>
      <sec id="sec-9-1">
        <title>Proposal, Conclusions, Future Work</title>
        <p>We have provided preliminary experiments with
reusability of our algorithms. Results are promising, but
still we need more extensive testing.
8 Reusability extraction describes a 7 one in a more
extensive way with several different data examples and
similarities. This can increase the chance that for a given
domain you find solution which fits your data.</p>
        <p>9 Reusability extraction assumes a 8 one in
a server/cloud implementation. You do not have to solve
problems that the extractor does not run in your
environment properly.
10 Reusability extraction assumes a 9 one in a more
user friendly way, you just upload your data (or their URL)
and the system finds solution and you can download result.
It is also possible to imagine this enhanced with some
social network interaction.
5.2</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Conclusions</title>
      <p>We have presented a research proposal for improving
degree of automation of web information extraction and
annotation. We propose a formal dynamic logic model for
automated web annotation with similarity and reliability.</p>
      <p>We illustrated our approach by an initial prototype and
experiments on recommendation on movie data (annotated
and integrated).
5.3</p>
    </sec>
    <sec id="sec-11">
      <title>Future work</title>
      <sec id="sec-11-1">
        <title>The challenge is twofold: - extend this to other domains - provide deeper analysis of data mining and possible similarities</title>
        <p>We can consider some more complex algorithms for
preference learning, e.g., based on the spreading activation
[GG].</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgement</title>
      <p>Research was supported by Czech project Progres Q48.
[A] [IBM] John Arwe. Coping with Un-Cool URIs in the
Web of Linked Data, LEDP-Linked Enterprise Data
Patterns workshop. Data-driven Applications on the Web.
6-7 December 2011, Cambridge, MA Hosted by W3C/MIT
[F1] D. Fiser. Semantic annotation of domain dependent
data (in Czech). Master thesis, Charles University, Prague
2011
[F2] D. Fiser. Semantic annotator.
https://chrome.google.com/webstore/detail/semanticannotator/gbphidobolopkilhnpkbagodhalimojj
http://www.doser.cz/projects/semantic-annotator/
(User Guide and Installation Guide (server part) in Czech),
,
[GG] L. Grad-Gyenge, Recommendations on a knowledge
graph, 2015
[HKT] D. Harel, D. Kozen, J. Tiuryn. Dynamic Logic
(Foundations of Computing) The MIT Press 2000</p>
      <sec id="sec-12-1">
        <title>Published by CEUR Challenge, Berlin, Germany.</title>
        <p>[PLEDVF] L. Peska, I. Lasek, A. Eckhardt, J. Dedek,
P. Vojtas, D. Fiser: Towards web semantization and user
understanding. In EJC 2012, Y. Kiyoki et al Eds. Frontiers
in Artificial Intelligence and Applications 251, IOS Press
2013, pp 63-81
[T] Twitter data from RecSys
http://2014.recsyschallenge.com/
2014
challenge
[V+] Verborgh R. et al. (2014) Querying Datasets on the
Web with High Availability. In: Mika P. et al. (eds) The
Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes
in Computer Science, vol 8796. Springer, Cham</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>