<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Evaluating Data Sources for Crawling Events from the Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Balázs Horváth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Horváth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ELTE - Eötvös Loránd University, Budapest, Hungary Faculty of Informatics, Department of Data Science and Engineering</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>218</fpage>
      <lpage>222</lpage>
      <abstract>
        <p>The bottleneck of event recommender systems is the availability of actual, up-to-date information on events. Usually, there is no single data feed, thus information on events must be crawled from numerous sources. Ranking these sources helps the system to decide which sources to crawl and how often. In this paper, a model for event source evaluation and ranking is proposed based on well-known centrality measures from social network analysis. Experiments made on real data, crawled from Budapest event sources, shows interesting results for further research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Tourist event recommender systems need a big amount of
data, preferably all events around a particular location. In
order to get that data we need the organizers to upload
every new event what they organize/create/host to the
application which handles the data. It can be the recommender
application or just a backed application where the
organizers would want to be shown. If the previous solution is not
acceptable or the organizers would not put enough effort to
do it, then the recommender system lacks of information
and cannot work as good as expected.</p>
      <p>The other solution would be to find a feed which
contains the upcoming events from each location.
Unfortunately there are no feeds like that, feeds can be found
about one particular topic or location’s events that can be
crawled as well, but do not satisfies the tourist event
recommender systems need. There are almost good sources
for one or two big cities in the USA, but that is not scalable
if the system would expect every city or town to have their
own feed like those.</p>
      <p>The only solution for the current situation is to collect
the information about the events semi-automatically from
numerous sources through a data crawler engine. These
event sources (denoted “sources” in the rest of the paper)
can be on a different level in usefulness, some of them can
be completely redundant for the system, because the same
information about its events is already crawled. Others can
upload informations or new events very rarely, so it is not
worth to check them often. Quality differences can be
discovered through the observation of the different sources.
In order to save computational resources, or when a system
reaches its limit, the import method have to rank sources
in the queue, but how could it decide which one to rank
higher? What happens if it ranks a source which played a
very important role in the system very low? These sources
have to be evaluated and indexed according to their
importance related to our purposes.</p>
      <p>
        As it is mentioned in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], WIEN [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], XWRAP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
STALKER [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], NoDoSe [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and BYU [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a
selection of the well-known often-quoted solutions for Web
Data Extraction (WDE). In the past few years new
approaches were published like FiVaTech [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], FiVaTech2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
NEXIR [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], AutoRM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and OXPath [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The last
one is a wrapper language which has an optimized
syntax for making the description of the WDE task easier.
It also supports Javascript or CSS3 transitions, most of
the modern Document Object Model (DOM)
modification triggers and it can recognize Drag-and-Drop features
as well. Pagination is a problem from the dynamic web
pages, for that, link extraction is needed. OXPath and
other solutions can handle that problem already.
Unfortunately to write OXPath expressions and maintain them
is costly, and involves much effort, thus it is not scaling
well. DIADEM [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] utilized OXPath to give wrapper
generators, which is a step closer to the right solution but
they do not provider deep insight into it. An other
wrapper language called NEXIR has been created for covering
the whole WDE process, with pagination, data extraction
and integration. The problem of scaling is not solved with
wrapper languages either. FiVaTech and its improved
version FiVaTech2 provide a page-level extraction approach
which utilizes different DOM-based information to build
up a wrapper. FiVaTech therefore utilizes tree matching,
tree alignment and mining techniques to identify a
template from a set of pages. FiVaTech2 improves the node
recognition by including node specific features, such as
visual information, DOM tree information, HTML tag
contents, id-s and classes. It is clearly visible, that a ranking
system is needed to be able to differentiate between
solutions, ARIEX [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is a defined framework for ranking data
and information extractors and solves a specific problem,
with comparing different approaches. Other missing
approach is to make ranking between data sources, not the
approaches. When we talk about scalability, until we do
not have a general solution for the problem, we can limit
the scaling by finding the way of ranking the sources and
leave out the unnecessary ones. There is no such
publication or solution available for the public, so we take this
approach in this research. For reaching the results, a
bipartite graph can be used and social network analysis methods
on it. The importance of centrality measures and social
network analysis methods are discussed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>An approach to event data source evaluation and
ranking, using network centrality measures, is presented in the
following section, followed by the description of a small
proof-of-concept preliminary experiment.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Proposed Model</title>
      <p>For evaluation of sources we are considering the following
attributes:
• Uniqueness of events contained in the source
• Number of events the source contains
• The importance of the source w.r.t. the other sources
• Freshness of events in the source
• Location of events contained in the source</p>
      <p>
        The decision was to represent sources and events in a
bipartite graph, where events and sources are both
vertices and their connection is represented with edges. Thus,
well-known centrality measures from social network
analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] can be utilized to compute the above mentioned
attributes of sources.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Uniqueness</title>
        <p>To get an indicator like uniqueness, different approaches
could be considered. The first point is to find those
sources, which has at least one unique event. If a source
has a unique event, it is important information for the
model, because it means, that if we lose that source, than
we cannot get those unique events from other sources. For
the purpose of finding those sources, the algorithm should
go through and check the cardinality of each event and
source as well.</p>
        <p>For the unique event calculation, the cardinality of
sources are less important than the cardinality of events.
If an event can be found just in one source, that means that
source is irreplaceable. Of course we cannot ignore the
fact that probably the system should be able to make
difference between sources which do not have unique events:
It is because if one of the sources which has a lot of
events, both unique and not, becomes unreachable or stops
working than it is predictable that it will cause uniqueness
changes in the graph.</p>
        <p>The way of computing the uniqueness, illustrated in the
algorithm 1, works as follows:</p>
        <p>It creates a copy of the whole graph and checks for the
lowest cardinality event (if there are more, then it picks a
random one). It chooses one of its sources and increasing
that source’s uniqueness index. Then, it is going though
all of the events of that source and deleting them one by
one. When this step finished the source with no
cardinality becomes deleted as well. These steps from picking
the lowest cardinality event repeating until all the event
vertices disappear from the copied graph. Then the whole
loop is repeated 100 times to make the result smoother and
the indicators to converge to the correct value (this step is
necessary because of the random pick). In the end, to get
the indicators between 0 and 1, we have to divide them
with one hundred. The repetition time can be increased or
decreased to make the result even smoother or make the
algorithm run faster.</p>
        <p>With this approach there will be differences between the
sources which does not have any unique event, so the issue
is solved with this solution.</p>
        <p>An other issue is that sometimes to download often
all the unique event holder sources the resources are
not enough, that is why we need to distinguish between
sources which has unique events to be able to choose the
most valuable of them.</p>
        <p>The other reason why is it needed to make a difference
between unique event holder sources is, that if the sources
would know the algorithm they could just try to avoid to be
left out or get low ranking and they would trick the system
with fake unique events. This happened with Google
indexing, called black hat search engine optimization, where
fake back links and meta keywords were embedded in sites
to increase their position in the search results.</p>
        <p>An approach for handling these issues is to make an
additional variable, the distinguisher, added to the
previously calculated indicators, defined as
distinguisher(s) =
u</p>
        <p>1
uall ∗ 1 + e−(u−u)
(1)
where s denotes the source, uall is the sum of all unique
events in the graph, u is the sum of all unique events of
the source and u is the average unique events for sources
in the whole graph.</p>
        <p>The sigmoid function in the second part of the
equation handles outliers such that this step just have to
distinguish between unique event holders, while not making
big differences, just make a ranking. Using a sigmoid
function, the differences between unique event holders are
smoothed out while keeping the ranking.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Degree and Betweenness</title>
        <p>To compute the number of events contained in the source
a simple centrality measure, the degree, of the source (as
a vertex in the graph) is used. Basically, the degree of the
source is the number of events it contains.</p>
        <p>
          An other, important, property of the source is its
betweenness. It is a measure, which shows how important is
the position of that particular vertex (source) in the whole
network, and is computed as
betweenness(v) =
(2)
∑
u6=v6=t nsp(u,t)
nspv(u,t)
where u and t are vertices not equal with v and nsp(u,t)
is the number of the shortest paths form u to t and the
Algorithm 1 Uniqueness
1: procedure UNIQUENESS(copy of graph)
2: while size of events &gt; 0 do
3: e ← minCardinalityEvent(events)
4: s ← randomPick(sources containing e)
5: increaseIndicator(originalDatasource(s))
6: for all a ∈ getevents(s) do
7: delete(a)
8: end for
9: delete(s)
10: end while
11: indicators = {indicator(s1), indicator(s2), . . .
12: . . . , indicator(sn)}
13: for all i ∈ indicators do
14: i = i + distinguisher(s)
15: end for
16: return indicators
17: end procedure
nspv(u, t) is the number of shortest paths between the
nodes, which goes through v vertex [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. In our case it
is used for showing how important is a source for events
and find events which has high betweenness. That means
that an event is connecting sources and we can observe if
that is the only event which makes the source less unique
or there are more of these high betweenness events in its
list of events. If there is more than one of those, then we
should observe if the events are connected between only
the same sources, or they are distributed: it means that the
source can be the connection between more sources and it
can be a feed as well, which provides important
information even if it does not have any unique events.
        </p>
        <p>It is important to know for us which are the nodes within
betweenness centrality. It is because it shows those nodes
which can be concert halls, clubs or concert venues,
forums, etc., collecting events of different artists. If a source
is such an event collector, it can leads us to the decision,
that even if it does not have unique events, it is very
important for the model, because it can post new events from
a new artist whose website is not crawled yet by us.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Location</title>
        <p>From the previous properties, we can already make good
measurements and propose an ranking, but there are other
relevant informations, which can be important in some
cases, such as keeping the data up to date or focusing on
different areas or performance optimization. Location is
not focusing on exact locations in this measure, just trying
to decide what distance is worth to travel for the tourists.</p>
        <p>
          In the preliminary experiment, using the Budapest
events dataset, big part of the events are inside the smaller
ring road (tram line 4–6) as illustrated in the Figure 1. For
this measure we need to observe if the source is having
events on the same location most of the time, or its events
are at different locations, usually. If events are at the same
location then the task is easy, i.e. find the relevance
borders for the recommender and divide the area into circles
and give points according to that. The other case is when
most of the events have different locations, then the
algorithm should calculate the center of the locations (carefully
with the outliers) and give the score according to that.
Freshness is a binary function [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] that measures whether
the downloaded local copy is accurate according to the live
page. The freshness of a page p in the repository at time
t is defined as Fp(t) = 1 if p is equal to the local copy at
time t, and, Fp(t) = 0, otherwise.
        </p>
        <p>Age is a measure, which indicates how outdated the
downloaded copy is. The age of a page p in the
repository, at time t is defined as Ap(t) = 0 if p is not modified
at time t, and, Ap(t) = t − mt(p), otherwise, where mt(p)
is the last modification time of p.</p>
        <p>With the help of these functions, the scheduler can
calculate how often a page is usually updating the content, or
in other words, how ofter is the downloaded copy gets
outdated. The frequency information can tell us from
different sources for the same event, which one of them posted
it earlier or which one is posting more frequently. That
information can influence the importance result. As an
example it can be important to know if an event is canceled
or changed its information like the location or the
starting time. For applications where to be up to date with
event informations is crucial the freshness property can be
weighted more.
where w = {w1, w2, w3, w4, w5} are the weights which will
change according to the application’s needs, and s is the
current source what the algorithm is evaluating/ranking
while U , D, B, A and L refer to the uniqueness, degree,
betweenness, actuality and location of the source,
respectively. The weighting is important, because there can be
application which has a goal of getting all the events or as
much as possible. Others can focus on performance to be
able to offer trust worth fast running applications on the
crawled data, and that is not harming it, if it cost some
percent of the events.
3</p>
        <p>Preliminary Experiments
251 event sources were crawled from the Web and
Facebook event pages (using the Facebook Graph API),
resulting in more than 1500 events (after the unification of the
duplicate events). All the events crawled were from
Budapest including concerts, museums, galleries, etc. The
events were located mainly in the city center as can be
seen in the figure 1.</p>
        <p>The final experiments on the uniqueness part of the
model were made on a dataset, where data were crawled
from Facebook pages’ events and clubs and museums
websites. We had to consider all the possible future cases,
so we made test sources as well like a complete copy of a
website data, or partial copies, copies which are more
important than some Facebook pages and vice versa, etc. The
distinguisher is not rounded because it still should be able
to make difference between sources even if the difference
is smaller. In opposite of the other case where we
calculate the uniqueness function on the sources, it is better to
round that number, because we do not have to make too
much loops to make it smoother.</p>
        <p>A part of the result on uniqueness is illustrated on the
figure 2. As can be seen, the “bjc.hu” and its copy
“copyofbjc” do not have a distinguisher number, because they
do not hold any unique event, obviously because they are
copies of each other. So the highest number of non unique
event holders is 0.5. If their event can be found in more
than one other sources, then the number decreases.</p>
        <p>Figure 3 shows an example, from of our experiments,
with seven sources and their events. It is obvious that
events’ distances are very different from their sources.
Despite these distances are not connected to the similarity of
the source and the event, they represent how similar the
events are. As we see in the middle, couple of the events of
the big source in the middle are very far away from the
others, but they are also connected to the other two sources.
Those events are Jazz lessons with a famous artist and all
the other events are Jazz concerts with different artists.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>A work-in-progress research was introduced in the paper
focusing on ranking and evaluation of event data sources.
The approach utilized well-known centrality measures
from social network analysis what is, according to the best
knowledge of the authors, the first attempt for event source
evaluation.</p>
      <p>The proposed model is quite general and can be easily
modified to specific use-cases and domains. Experiments
on real-world data crawled from Budapest event websites
as well as Facebook pages show interesting results and
promising future research directions.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>Authors would like to thank T-Labs for the support and
environment provided for this research. The research was
conducted within the industrial project “Telekom Open
City Services” supported by Magyar Telekom Nyrt.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Adelberg</surname>
          </string-name>
          .
          <article-title>Nodosea tool for semi-automatically extracting structured and semistructured data from text documents</article-title>
          .
          <source>ACM Sigmod</source>
          Record vol.
          <volume>27</volume>
          , no.
          <issue>2</issue>
          ., pages
          <fpage>283</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>C.-H. Chang C.-H. Chang</surname>
            and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kayed</surname>
          </string-name>
          . Fivatech:
          <article-title>Pagelevel web data extraction from template pages. Knowledge and Data Engineering, IEEE Transactions on</article-title>
          , vol.
          <volume>22</volume>
          , no.
          <issue>2</issue>
          , pages
          <fpage>249</fpage>
          -
          <lpage>263</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Castillo</surname>
          </string-name>
          .
          <article-title>Effective Web Crawling</article-title>
          .
          <source>PhD thesis</source>
          , University of Chile, 11
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Jiang S. W. Liddle D. W. Lonsdale Y.-K. Ng D. W. Embley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Campbell</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <article-title>Conceptual-model-based data extraction from multiplerecord web pages</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          , vol.
          <volume>31</volume>
          , no.
          <issue>3</issue>
          , pages
          <fpage>227</fpage>
          -
          <lpage>251</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Steve</given-names>
            <surname>Minton Ion Muslea</surname>
          </string-name>
          and
          <string-name>
            <given-names>Craig</given-names>
            <surname>Knoblock</surname>
          </string-name>
          .
          <article-title>Stalker: Learning extraction rules for semistructured, web-based information sources</article-title>
          .
          <source>Proceedings of AAAI-98 Workshop on AI and Information Integration</source>
          , pages
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kayed</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Fivatech2: A supervised approach to role differentiation for web data extraction from template pages</article-title>
          .
          <source>Proceedings of the 26th annual conference of the Japanese Society for Artifical Intelligence, Special Session on Web Intelligence &amp; Data Mining</source>
          , vol.
          <volume>26</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kushmerick</surname>
          </string-name>
          .
          <article-title>Wrapper induction for information extraction</article-title>
          .
          <source>PhD thesis</source>
          , University of Washington,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bruce</given-names>
            <surname>Cronin</surname>
          </string-name>
          .
          <source>Handbook of Research Methods and Applications in Heterodox Economics. Edward Elgar Publishing</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pu</surname>
          </string-name>
          , and W. Han.
          <article-title>Xwrap: an xml-enabled wrapper construction system for web information sources</article-title>
          .
          <source>In Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)</source>
          .
          <source>IEEE Comput. Soc.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. Corchuelo P.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Sleiman</surname>
          </string-name>
          .
          <source>Ariex: Automated ranking of information extractors. Knowledge-Based Systems</source>
          , vol.
          <volume>93</volume>
          , pages
          <fpage>84</fpage>
          -
          <lpage>108</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Sebastiano</given-names>
            <surname>Vigna Paolo Boldi</surname>
          </string-name>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu H. Wang L. Luo C. Yuan S. Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wei</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang. Nexir</surname>
          </string-name>
          :
          <article-title>A novel web extraction rule language toward a three-stage web data extraction model</article-title>
          .
          <source>Web Information Systems Engineering-WISE 2013</source>
          . Springer , pages
          <fpage>29</fpage>
          -
          <lpage>42</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen C. Yuan S. Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang. Autorm</surname>
          </string-name>
          :
          <article-title>An effective approach for automatic web data record mining</article-title>
          .
          <source>Knowledge-Based Systems</source>
          , vol.
          <volume>89</volume>
          , pages
          <fpage>314</fpage>
          -
          <lpage>331</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Andreas</surname>
            <given-names>Schulz</given-names>
          </string-name>
          , Jorg Lassig, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Gaedke</surname>
          </string-name>
          .
          <article-title>Practical web data extraction: Are we there yet? - a short survey</article-title>
          .
          <source>In 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI)</source>
          . IEEE, oct
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>G. Grasso C. Schallhart T. Furche</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Gottlob</surname>
            and
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sellers</surname>
          </string-name>
          .
          <article-title>Oxpath: A language for scalable, memory-efficient data extraction from web applications</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>11</issue>
          , pages
          <fpage>1016</fpage>
          -
          <lpage>1027</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grasso O. Gunes X. Guo A. Kravchenko G. Orsi C. Schallhart-A. Sellers</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            <surname>Furche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Diadem: domain-centric, intelligent, automated data extraction methodology</article-title>
          .
          <source>Proceedings of the 21st international conference companion on World Wide Web. ACM</source>
          , pages
          <fpage>267</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Pang-Ning</surname>
            <given-names>Tan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Steinbach</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Vipin</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Introduction to Data Mining</article-title>
          . Addison Wesley, us ed edition, May
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>