=Paper= {{Paper |id=Vol-1377/paper7 |storemode=property |title=Towards capturing and preserving changes on the Web of Data |pdfUrl=https://ceur-ws.org/Vol-1377/paper7.pdf |volume=Vol-1377 |dblpUrl=https://dblp.org/rec/conf/esws/UmbrichMP15 }} ==Towards capturing and preserving changes on the Web of Data== https://ceur-ws.org/Vol-1377/paper7.pdf
    Towards capturing and preserving changes on
                 the Web of Data

               Jürgen Umbrich, Nina Mrzelj, and Axel Polleres
            Vienna University of Economics and Business, Vienna, Austria



        Abstract. Existing Web archives aim to capture and preserve the changes
        of documents on the Web and provide data corpora of high value which
        are used in various areas (e.g. to optimise algorithms or to study the Zeit-
        geist of a generation). So far, the Web archives concentrate their efforts
        to capture the large Web of documents with periodic snapshot crawls.
        Little focus is drawn to preserve the continuously growing Web of Data
        and actually keeping track of the real frequency of changes. In this work
        we present our efforts to capture and archive the changes on the Web of
        Data. We describe our infrastructure and focus on evaluating strategies
        to accurately capture the changes of data and to also estimate the crawl
        time for a given set of URLs with the aim to optimally schedule the
        revising of URLs with limited resources.


1     Motivation

Existing Web archives capture the history of Web documents and pre-
serve the Zeitgeist for different eras [12, 10]. The archived versions of the
documents are of high value and used in various areas. For instance, the
historical data can be used to study the evolution of the Web and al-
gorithms can be optimised to deal with changing and growing data. The
preserved data is also used by news reporters, politicians and sociologists.
So far, the Web archives concentrated their efforts to capture the large
Web of documents and little focus is drawn to archive the continuously
growing Web of Open Data. While established projects, such as the Inter-
net archive1 capture some parts of the Web of Open Data sources, they
do not provide a fine grained view of the history of these resources. We
found only around 20 preserved versions for the content of the DBPedia
URI for the city of Galway, while the Wikipedia article shows hundreds
of changes in their revision.2
    There exists first and specialised activities to provide some archives
for the Web of Data such as the Dynamic Linked Data observatory
project [7], which captures 90K Linked Data URIs once a week or the
1
    https://archive.org/index.php
2
    https://web.archive.org/web/*/http://dbpedia.org/resource/Galway
DBPedia project, which converts Wikipedia articles to Linked Data and
releases infrequent dumps of their data [8]. Similarly, the FP-7 project Di-
achron aims to address in parts the challenges of preserving the evolving
data with a strong focus on modelling context information and addressing
quality issues. To the best of our knowledge, there exists no specific effort
for an infrastructure to archive the diverse Web of Data. One challenge
is to deal with heterogeneous data formats (e.g., JSON, CSV or RDF)
and access mechanisms (e.g. HTTP with content negotiation or accessing
SPARQL endpoints).
    In this work, we propose and envision a distributed infrastructure to
capture and archive the evolving Web of Data. While we provide some
initial starting points to this infrastructure, which we have already imple-
mented and started to deploy, we argue that this infrastructure should be
joint effort due to the underlying complexity and the resource demands
which can be solved by distributing the crawling and storage tasks. In the
remainder of this work we highlight the general envisioned infrastructure
in Section 2 and presents our efforts for a flexible and extensible archiver
component in Section 3. We introduce various strategies and discuss our
results to accurately capture the changes of documents in Section 4 and
we introduce and evaluate a heuristic to estimate the crawl time for a
given set of URLs in Section 5. Eventually, we conclude and present fu-
ture direction in Section 6.

2   Infrastructure Overview

The necessary infrastructure to capture and preserve the data on the
Web of Data should be similar to a distributed Web crawler infrastruc-
ture [10]. The components of this infrastructure should be independent
and connected via a P2P network. We refer to the single unit of a web
crawler as archiver which consists of the typical components of a web
crawler and has the task to politely download and store the content of a
set of URLs. Each archiver maintains a set of URIs of data sources and
needs efficient methods to capture and store the changes in the content
of the data sources. This requires an additional scheduling component
which is able to adapt the re-crawling frequency of documents based on
their change history. In addition, we envision that the archiver units are
connected in a P2P network and that each single unit is able to correctly
compute the number of URLs it can handle with the available resources,
taking into account per-domain crawl delays, the number of URLs per
domain and the crawl frequency of each data source. Figure 1 depicts
such an infrastructure where the connections between the archiver units
                                                    ARCHIVER


                                         SCHEDULER        CRAWLER




                                             META
                                             DATA        ARCHIVE
                       ARCHIVER
                                                                          ARCHIVER




                                  ARCHIVER                     ARCHIVER




                 Fig. 1. Distributed archiving infrastructure.


should indicate that each unit can exchange basic information such as,
the current capacity or which URLs are currently in the system. These in-
formation should then be used to decide to which units new URLs should
be added and where to get the versions for a particular URL. We identify
the following requirements for our envisioned infrastructure:
 – Distributed, the infrastructure needs to be distributed since no sin-
   gle party can provide the necessary resources to capture the changes
   of the Web of Data. Clearly, a company such as Google or maybe even
   the Web archive has the necessary infrastructure, but we do not see in
   the near future any efforts to capture the changes on the Web of data.
   As such, preserving the Web of Data can be achieved if several parties
   join efforts by archiving a manageable subset of documents based on
   the available resource. One further advantage of the distributed ar-
   chitecture is that the URLs can be geographically distributed which
   can improve the throughput [2].
 – URL partitioning. There is a need for a smart partitioning of the
   documents across the participating parties to guarantee an optimal
   use of the available resources. In addition, the infrastructure should
   ideally assign the same URL to two different parties to ensure that a
   document is crawled even if one party experiences some problems or
   a down time.
 – Crawling of different formats using different access methods
   The single archiver units should be able to crawl different formats
   by using different access methods due to the diversity of the Web
   of Data. For instance, the crawling of RDF documents requires very
   often content negotiation to correctly access the RDF representation
   of a data source or in case of an SPARQL server, the crawler needs
   to use the SPARQL protocol and SPARQL queries to download the
   content. One effort in such direction which we build on is the LD-
   SPider framework [6] which is also used in the Dynamic Linked Data
   observatory and provides mechanism to request RDF representation
   of data sources.
 – Adaptive (re-)scheduling component A core requirement is an
   adaptive (re-)scheduling component which is able to efficiently sched-
   ule the download frequency for a data source to capture and preserve
   (in an ideal case) all content changes. The efficiency and quality of
   such an infrastructure can be measured by the accuracy of which
   changes are captured. Ideally, the archive preserves a snapshot of a
   document whenever the content change. There exists a plethora of
   work around rescheduling algorithms on the Web of documents such
   as the influential work of Cho et. al. [4, 5] or more recent work [15,
   13]. However, common to all those works is the assumption that doc-
   uments change according to a certain distribution such as the Poisson
   distribution and to the best of our knowledge such distributions could
   not be verified for the changes on the Web of Data [7]. As such, we
   have to assume that the data source on the Web of Data do not follow
   any particular change distribution.
 – Resource planning: We assume that each party has only a certain
   amount of resources available and should be used in an optimal way.
   That is, that the archiver needs to correctly estimate the number
   of URLs that can be processed given the specifics for the available
   resources. In addition, the resource planning component needs to also
   consider the change frequency of the data sources and the current re-
   crawling frequency. These challenges are closely related to the research
   area about index maintenance and freshness[11, 14] where the aim
   is to use the available resources to keep the content of given set of
   documents as current as possible. However, one important difference
   in our assumption is we want to estimate how many documents we
   can process to capture all changes. To do so, we need first to be able
   to compute the accurate scheduling frequency and the crawl time for
   a set of URLs. We will investigate the related efforts in future work
   but do not yet address this issue in this work.

3   Archiver, the datamonitor architecture
We started to develop the archiver component of our envisioned infras-
tructure and discuss the general architecture depicted in Figure 2. Our
archiver, called datamonitor, consist of the following loosely coupled com-
ponents:
 – The meta data store uses currently a Postgres database and stores
   all vital information such as crawl logs, scheduling information and
                          Data Monitor Framework

                                                              cron


                            URI type              scheduler

                                                                                                    cron
               cron                                                          adaptive scheduler
                                                                             •   check if URL was
                                                                                 crawled,
                         downloader                  crawl
                                                   schedule
                                                                     3       •   compare content
                        politeness queue                                         with previous
                                           meta     crawl            1           crawl(s),
                                                                             •   adapt schedule
                           content                 metadata



                                                                         2
               data
                                                   links
               YYYY/MM/DD/HH/domain




                                   Fig. 2. Archiving infrastructure.



   also aggregated statistics. For instance, one crawl log entry contains
   information about the HTTP response header, the download time and
   a change state indication if the document changed compared to the
   last download based on a content checksum. Another table in the
   meta data store contains information about the next crawl time and
   the current crawl frequencies for each URI.
 – The scheduler component periodically access the meta data store
   and retrieves the URLs for the next crawl which are then processed
   by the downloader component
 – The downloader component retrieves the content of a list of sup-
   plied URLs and archives the downloaded data and collected meta
   data. The framework is a multi-threaded Web crawler which respects
   the robots.txt protocol3 and crawl delays and uses a politeness queue
   to avoid overloading the servers with too many requests [7]. Our im-
   plementation provides the flexibility to use various download handlers
   to deal with different access mechanisms or to use particular parsers
   for specific content formats.

    We decided that the downloader and scheduler component should run
in regular intervals to make the framework more robust rather than an
online system in which the downloader component would listen if new
URIs should be crawled. However, this online design would require so-
lutions to periodically persist the current state of the system to be able
to recover the state in case of unexpected system crashes, e.g., due to
memory exceptions or short term shutdowns. In our current design we
3
    http://www.robotstxt.org
can assure that the different components run independently and in fixed
intervals and that a problem in one crawl does not affect the next crawls.
We decided to start our scheduler and crawler every hour. As such, we
should ideally schedule the crawl time of the URLs in such a way that
each crawl can be completed in less than one hour to avoid an overlap of
crawls which can cause a resource problem.
    Another feature of our architecture is that the components are flexible
enough to adapt to various scenarios. With the current architecture it is
easy to extend our checksum based content change detection and integrate
for example file format tailored algorithms or graph isomorphism check
to determine more accurately changes in RDF files.

4     Rescheduling strategies

One of the core algorithms in this framework is the adaptive scheduler
which determines the next crawl time for URLs based on the content
change information and current download frequencies (see Figure 2). The
literature lists several approaches to reschedule crawls for HTML data
(e.g., [4, 3], which mainly assume that the change rate of documents follow
a Poisson process with an average change rate of λ. However, there are
no confirmed studies that the assumption of a Poisson process also holds
for the sources on the Web of Data. In fact, researchers showed that the
change rates of Wikipedia article do not follow a Poisson process [1]. As
such, we aim on a strategy which make no assumption about the change
frequency distribution.

4.1    Strategies
Next, we introduce various strategies which we evaluate in terms of how
accurately they capture the changes of documents. Figure 3 depicts dif-
ferent strategies and how they capture or miss changes of a document.
The top row shows the actual time points for the changes of a document
over 20 time units with an average change rate of λ = 0.5 changes per
time unit (10 changes in 20 time units). Below are two different strate-
gies; the middle one accesses the document with a change frequency of λ
and the bottom one simulates a simple adaptive strategy which increases
the download frequency if we observe more changes and decreases the
frequency if we observe less changes. On the side we show the numbers of
changes captures and downloads performed. We can see that the middle
strategy is overall better since the bottom strategy performs three more
downloads to captures the same amount of changes.
                                                                        changes lookups

       real                                                                  10   10
       avg                             x                 x                   8    10
   adaptive                        x                 x                       8    13
                                                                        tn
              0        4           8           12            16         20



                  Fig. 3. Visualisation of various change strategies.


   Next, we present our strategies to decide if to increase or decrease the
crawl frequency for a document based on the history of observed changes.
The crawl frequency for all strategy is fixed between a minimum and
maximum value.
fix and dyn: Dynamic crawl frequency adaptation based on a
   fixed number of snapshots This strategy determines if the crawl
   frequency should be changed based on a fixed number of last observed
   snapshots. First we determine how many observed snapshots with
   the same crawl frequency will be considered. We use two different
   methods:
   fix uses the last two snapshots
   dyn determines the number of snapshots based on the current crawl
       frequency. We use one snapshot if the frequency is higher than 2
       month, 2 snapshots if it is higher than 1 month, 3 snapshots if
       higher than 1 weeks and 4 snapshots if higher than 1 day.
   Next, we increase the frequency if the document changed in all snap-
   shots and decrease the crawl frequency if the document did not change,
   otherwise we do not adapt the frequency. The actual frequency change
   depends again on the current frequency: We increase the frequency by
   a factor of 1.5 in case the old crawl frequency is higher than 1 month
   and use a factor of 2 otherwise. Similarly, we decrease the frequency
   by a factor of 1.5 in case the current crawl frequency is lower than 1
   month, otherwise we decrease by a factor of 2.
window: Dynamic crawl frequency adaptation based on a window
   of snapshots: This strategy is similar to the previous one. However in
   contrast, we compute the ratio of detected changes divided by number
   of snapshots. We consider either the last ten snapshots or half of the
   available snapshots. We use the following formula to compute the new
   frequency based on the computed ratio: If the ratio r is higher than
   0.9 we increase the frequency by dividing the current frequency by a
   factor of 3, if r > 0.75 by a factor of 2 and if r > 0.6 b a factor of
   1.5. Similarly, we decrease the frequency by multiplying the current
   frequency by a factor of 3 if r < 0.1, by a factor of 2 if r < 0.25 and
   by a factor of 1.5 if r < 0.4. The general idea behind this is that the
   higher the ratio the more we increase the frequency and the lower the
   ratio the more we decrease the frequency.
state-1 and state-2 Dynamic crawl frequency adaptation based
   on state change probabilities In this strategy we apply the markov
   chains approach to determine the likelihood to observe a change in the
   next download considering the past observed change. We separately
   maintain the knowledge about the sequence of observed states for each
   crawl frequency. The use this information to compute the probabil-
   ity that the document changes in the next download based on this
   state-change knowledge base. Currently, we have two variations of the
   markov models:
   state-1 considers only the last state of a document to compute the
       likelihood of the next state. Table 1 shows an example knowledge

               Table 1. Example of occurrence of different states

                                      n−1
                                    n 0 1
                                    0 10 50
                                    1 40 10



     base in which we store how often two change states appeared (’0’
     indicates no change and ’1’ indicates a document change). For in-
     stance, the value (n = 1, n − 1 = 1) indicates that we observed 10
     times that the document changes in a row and that we observed
     40 times a change after a non-change(n = 1, n − 1 = 0). We use
     this information to determine how likely it is that the document
     changes in the next download: For instance considering our exam-
     ple table: P (1|0) = 40
                          50 = 0.8
  state-2 considers the last two change states of a document to com-
     pute the likelihood of the next change state. In contrast to state-1,
     we store the information about the occurrences of the last two
     change states.
  Eventually, we compute the new crawl frequency based on the com-
  puted likelihood to observe a change in the next crawl. We use the
  same frequency in/decrease factors as in the window strategy.
gold Gold standard We also use a reference strategy which downloads
  the documents according to their overall change frequency which we
   assume is know in advance. To do so, we calculated the average change
   frequency for each page based on the actual Wikipedia revision his-
   tory.
week In addition, we simulate a crawl with a fixed frequency of 1 week.

4.2     Evaluation
We evaluate the performs of our strategy with two measures from Infor-
mation Retrieval area, namely the recall and precision. In an ideal case
we would capture all changes of a document exactly shortly after they
happen. In this ideal case, we would capture all changes and would only
require the same number of downloads as changes. As such, we compute
the recall of a strategy by counting the changes we observed and divide
it by the real number of changes that actually happened. The precision
is defined as the number of observed changes divided by the number of
downloads.

Data We use the revision history of Wikipedia changes as our corpus to
test various strategies for adapting the re-crawl frequency for documents
to capture and preserve their changes. The reason for this is the already
mentioned observation that the change rates of Wikipedia article do not
follow any particular distribution and that the revision history of the
articles span over several years, providing us a large enough time span to
test our strategies.
    To collect our evaluation corpus, we performed the following steps:
1) we randomly selected Wikipedia articles and gathered their revision
history using the provided API.4 Next, we filter out articles for which the
revision history is shorter than 3 years and discard revisions which are
declared as minor revisions. Eventually, we grouped the articles based on
their average change rate which is computed by the number of revisions
divided by the history time. Eventually, we use the revision histories of
2660 articles with varying average change frequencies and that provide a
history of more than 3 years.. Table 2 provides the detailed overview about
the distribution of documents according to their change frequencies. We
aimed to select 300 documents for the various subsets.

4.3     Results
We simulated a crawl for each document over 3 years with the various
strategies and present the aggregated recall and precision values for the
4
    http://en.wikipedia.org/w/api.php
                                 Table 2. Overview of the dynamicity of our data corpus

                                         Avg. change frequency number of documents
                                                    7d > λ > 2d                                      188
                                                   14d > λ > 7d                                      295
                                                   1m > λ > 14d                                      300
                                                   2m > λ > 1m                                       300
                                                   4m > λ > 2m                                       300
                                                   6m > λ > 2m                                       300
                                                      λ > 6m                                         286
                                                       total                                        2660



different set of documents (see Table 2) and for all 2660 documents. We
fixed the minimum crawl time to 1 day and the maximum to 6 month.
    Figure 4 shows how the various strategies perform across all docu-
ments. In general, we observe that all strategy trade recall for precision
or vice versa. We also see that the strategies perform overall very similar
and only the week and window strategy show two extreme behaviours.
The week strategy performs overall the best in terms of recall and worst

                     gold                   dyn         state-2                                   gold               dyn          state-2
                     week               window                                                    week           window
                       fix               state-1                                                    fix           state-1
          0.95                                                                          0.9
           0.9                                                                          0.8

          0.85                                                                          0.7
                                                                                        0.6
                                                                            precision




           0.8
 recall




                                                                                        0.5
          0.75
                                                                                        0.4
           0.7                                                                          0.3
          0.65                                                                          0.2
           0.6                                                                          0.1
                 0           5     10       15    20    25        30   35                     0       5     10      15      20   25         30   35
                                             month                                                                    month

                                   (a) Recall                                                              (b) Precision

                                         Fig. 4. Recall and precision for all documents.



in terms of precision. This is to be expected since 92% of our documents
have an average change frequency of more than 1 week and it is very likely
that we capture most of the changes, however the strategy also requires
a significant number of unnecessary lookups. Next, we can see that the
probabilistic strategies state-1 state-2 are the second best in terms of
recall and second worst in terms of precision. The fixed snapshot strate-
gies fix and dyn follow closely the strategies with the markov models.
The window strategy in contrast trades recall for a high precision. An
interesting observation is also that our strategies seems to get better over
time in terms of precision which is one of the desired effect and indicates
that the adaptive rescheduling can improve the overall performance.
    Next, we discuss the results for selected sets of documents. Figure 5
shows the precision and recall for documents with an average change
frequency between 2 days and 1 week. We see that only the fix and

                    gold             dyn            state-2                                   gold              dyn            state-2
                    week         window                                                       week          window
                      fix         state-1                                                       fix          state-1
          0.9                                                                       0.9
                                                                                    0.8
          0.8
                                                                                    0.7
          0.7
                                                                                    0.6



                                                                        precision
 recall




          0.6                                                                       0.5
                                                                                    0.4
          0.5
                                                                                    0.3
          0.4
                                                                                    0.2
          0.3                                                                       0.1
                0       5   10      15        20   25         30   35                     0       5    10      15        20   25         30   35
                                      month                                                                      month

                            (a) Recall                                                                (b) Precision

                        Fig. 5. Recall and precision for documents with 7d > λ > 2d.


window strategy show extreme behaviours. The probabilistic strategies
(state-*) are close to the performance of the actual change frequency
(gold), indicating that with our current parameters we are able to predict
with a good accuracy the actual change frequency for documents.
    Figure 6 shows the results for the documents with λ between 4 and
6 month. Surprisingly, the window strategy performs very similar to the
actual change frequency, while the other strategies achieve a high recall
but with low precision. Again, we can observe that the probabilistic based
heuristics perform better in terms of recall and worse in terms of precision
as the fixed snapshots strategies.
    We observe similar patterns for the other sets of documents and have
to unfortunately omit the plots due to space limitations.

4.4             Discussion
Overall, we can conclude that so far the window strategy should not be
considered due to its very low recall values. The main difference between
the fixed snapshot and probabilistic strategies is that the former have a
smaller trade off between recall and precision. However, the strategies us-
ing the markov model perform very similar for all subsets while the other
                     gold                 dyn        state-2                                   gold              dyn          state-2
                     week             window                                                   week          window
                       fix             state-1                                                   fix          state-1
             1                                                                         1
          0.95                                                                       0.9
           0.9                                                                       0.8
          0.85                                                                       0.7




                                                                         precision
           0.8                                                                       0.6
 recall




          0.75                                                                       0.5
           0.7                                                                       0.4
          0.65                                                                       0.3
           0.6                                                                       0.2
          0.55                                                                       0.1
           0.5                                                                         0
                 0           5   10       15    20   25        30   35                     0       5    10      15      20   25         30   35
                                           month                                                                  month

                                 (a) Recall                                                            (b) Precision

                        Fig. 6. Recall and precision for documents with 6m > λ > 4m.



strategies can have different trade-offs for different subsets. As such, we
will in future work focus on studying in more detail rescheduling algo-
rithms based on markov models and also will use them for now in our
architecture.


5         Crawl resource estimation

The second experiment evaluates a heuristic to estimate the crawl time
given a set of URLs and the information gathered from previous crawls.
The problem we address is to compute the crawl time for a given number
of URLs using a maximum number of threads. We first group the set of
URLs by their domain and estimate the crawl time and number of threads
we can use to crawl each domain based on the crawl delay. We use the
average download time for a document for a given domain and the general
crawl-delay per domain. This allows us to estimate crawl times for URLs
which are not yet in our system, however, we also need to assign default
times and delays which are currently set to 1sec for the domain-crawl
delay and to 2 secs for the average crawl time.
    We compute the number of maximum threads to crawl the URLs
for one domain by dividing the average crawl time c by the domain-
delay d (threads = dt ). Figure 7 shows an example of a crawl process
for 6 documents with a given domain-delay (on the top axis) and an
average document crawl time which is more than three times the delay.
The resulting number of threads in this scenario is 3 and the process
would be as follows: Thread 1 (t1) starts to crawl doc1 and after the
delay time another thread can crawl the next document (t2 and doc2),
and so on. Once we have the number of threads we compute the total
                                 delay
                                                                        time
                                  doc1             doc4
                            t1
                                         doc2             doc5
                            t2
                                                doc3             doc6
                            t3




            Fig. 7. Crawling URLs of one domain with multiple threads.


                                      docs∗t
crawl time as follows crawltime = threads    + d × (threads − 1) We map
the estimation of the total crawl size in this multi-threaded scenario to
a bin packing problem and apply a first-fit bin-packing algorithms [9].
The bin size corresponds to the crawl time and we distribute the urls
per domain over the bins. We can derive the total number of required
threads and total crawl time once the algorithm terminates. Next, we
briefly describing the algorithm: Initially, we set the size of the bins to
the maximum crawl times over all domains. Next, we process the domain
groups in descending order by the number of threads and assign each
domain to the bin which minimises the bin size or create a new bin in
case the domain does not fit in any bin (that is that the current size of
the bin plus the crawl time for the domain would exceed the total bin
size). After we processed all domains we sum up the maximum number of
domain threads per bin to compute the total number of threads needed
to crawl all URLs. If the total number of threads exceeds our maximum
number of thread5 we rerun the whole process and decrease the maximum
number of threads that can be assigned to one domain. As such, in each
round we slightly increase the overall crawl time since we have to assign
less threads to each domain. Once we find a packing for which the total
number of threads is less or equals our maximum number, we take the bin
size ( the longest crawl time for a domain) as the estimated crawl time.

5.1     Evaluation

We performed first experiments which our crawl time estimation algo-
rithms based on 1388 crawls with crawl times ranging from less than
30 minutes to over 2 hours. We evaluate by how much we over or un-
derestimated the actual crawl time. Table 3 shows the average under or
overestimation fraction for the various crawls grouped by their frequency.
We can see that our current heuristic tends to overestimate the crawls by
a large factor. In contrast, the underestimation is very small considering
the overall crawl times. For instance, we underestimate the crawls which
5
    The maximum number of threads depends on the available hardware
take around 60 mins by around 10 minutes. Overall, the results are not
entirely satisfying and we will investigate improved algorithms in future
work. However, the results are already encouraging and actual usable for
the overall scheduling of URLs based on their re-crawl frequency with
the goal to schedule the URLs in such a way that each crawl should take
around one hour. The results show that we tend to underestimate the
crawls resulting in overlaps of an average of 10 minutes for two consecu-
tive crawls. While overestimating the crawl time does not cause trouble
for the scheduling. However, due to the high overestimation we could add
more URLs to the crawls and do currently not optimally use the available
resources.

                    Table 3. Results of crawl time estimation

    Actual crawl time Overestimate (crawls) Underestimate (crawls) total crawls
           t < 30min       78% (202)              -16% (154)           356
  30min < t < 60min        134% (25)               -18% (46)           71
 60min < t < 120min        638% (19)               -11% (51)           70
120min < t < 180min        1.3% (15)               -1% (215)           230
180min < t < 240min           - (0)                 -36% (4)            4
         t > 240min           - (0)                -44% (28)           28




6   Conclusion
We presented in this seminal work our vision of an infrastructure to pre-
serve and archive the changes on the Web of Data and outlined some
core requirements for such an infrastructure. In addition, we presented
our efforts in developing and researching the core component of this in-
frastructure and conducted two initial experiments. The first experiments
evaluated various heuristics to reschedule the crawl frequency of URLs to
accurately capture the changes of the data source. Our findings indicated
that a strategy based on state-change transitions probabilities provide
promising results and indications to concentrate more efforts on this in
the future. The second experiment evaluated a simple first-fit bin packing
algorithm to estimated the crawl time for a set of URLs based on their
average domain download and crawl-delay times. Our results show that
the average underestimation is very low, however the approach tends to
overestimate the crawl time by large.
    Our future work will concentrate in improving both heuristics with
the goal of developing algorithms to optimally schedule and reschedule
the crawl time of URLs so that the system is using the available resources
in an optimal way and is able to capture the changes of the source on the
Web of Data as timely and accurately as possible.
    Acknowledgements This work was partially funded by the ”Jubiläumsfond der
Stadt Wien 2014”.


References
 1. Rodrigo Almeida, Barzan Mozafari, and Junghoo Cho. On the evolution of
    wikipedia. In Proceedings of the First International Conference on Weblogs and
    Social Media, ICWSM 2007, Boulder, Colorado, USA, March 26-28, 2007, 2007.
 2. B. Barla Cambazoglu, Vassilis Plachouras, Flavio Junqueira, and Luca Telloli.
    On the feasibility of geographically distributed web crawling. In Proceedings of
    the 3rd International Conference on Scalable Information Systems, InfoScale ’08,
    pages 31:1–31:10, ICST, Brussels, Belgium, Belgium, 2008. ICST.
 3. Carlos Castillo, Mauricio Marı́n, M. Andrea Rodrı́guez, and Ricardo A. Baeza-
    Yates. Scheduling algorithms for web crawling. In (WebMedia & LA-Web 2004),
    12-15 October 2004, Ribeirao Preto-SP, Brazil, pages 10–17, 2004.
 4. Junghoo Cho and Hector Garcia-Molina. The evolution of the web and implications
    for an incremental crawler. In VLDB 2000, September 10-14, 2000, Cairo, Egypt,
    pages 200–209, 2000.
 5. Junghoo Cho and Hector Garcia-Molina. Estimating frequency of change. ACM
    Trans. Internet Techn., 3(3):256–290, 2003.
 6. Robert Isele, Jürgen Umbrich, Christian Bizer, and Andreas Harth. LDspider: An
    open-source crawling framework for the Web of Linked Data. In Posters&Demos
    at International Semantic Web Conference (ISWC). 2010.
 7. Tobias Käfer, Ahmed Abdelrahman, Jürgen Umbrich, Patrick O’Byrne, and Aidan
    Hogan. Observing Linked Data dynamics. In Extended Semantic Web Conference
    (ESWC), pages 213–227. Springer, 2013.
 8. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
    Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören
    Auer, and Christian Bizer. Dbpedia - A large-scale, multilingual knowledge base
    extracted from wikipedia. Semantic Web, 6(2):167–195, 2015.
 9. Rhyd Lewis. A general-purpose hill-climbing method for order independent mini-
    mum grouping problems: A case study in graph colouring and bin packing. Com-
    puters & OR, 36(7):2295–2310, 2009.
10. Julien Masanès. Web archiving. Springer, 2006.
11. Yadu Nagar and Niraj Singhal. Article: A users search history based approach
    to manage revisit frequency of an incremental crawler. International Journal of
    Computer Applications, 63(3):18–22, February 2013. Full text available.
12. Jinfang Niu. An overview of web archiving. D-Lib Magazine, 18(3/4), 2012.
13. Kira Radinsky and Paul N. Bennett. Predicting content change on the web. In
    Sixth ACM International Conference on Web Search and Data Mining, WSDM
    2013, Rome, Italy, February 4-8, 2013, pages 415–424, 2013.
14. Marc Spaniol, Dimitar Denev, Arturas Mazeika, Gerhard Weikum, and Pierre
    Senellart. Data quality in web archiving. In Proceedings of the 3rd ACM Workshop
    on Information Credibility on the Web, WICOW 2008, Madrid, Spain, April 20,
    2009, pages 19–26, 2009.
15. Qingzhao Tan and Prasenjit Mitra. Clustering-based incremental web crawling.
    ACM Trans. Inf. Syst., 28(4):17, 2010.