About Crawling Scheduling Problems

          Andrey A. Pechnikov1                 Denis I. Chernobrovkin2                 Anthony M. Nwohiri3
 1
     Institute of Applied Mathematical Research of Karelian Research Centre of the Russian Academy of Sciences,
                                     Petrozavodsk, Republic of Karelia, Russia
                                       2
                                         DataArt Inc, St. Petersburg, Russia
                  3
                    Department of Computer Science, Faculty of Science, University of Lagos, Nigeria
                                           anthony.nwohiri@outlook.com


                                                        Abstract
                       This paper investigates the task of scheduling jobs across several servers
                       in a software system similar to the Enterprise Desktop Grid. One of
                       the features of this system is that it has a specific area of action – col-
                       lects outbound hyperlinks for a given set of websites. The target set
                       is scanned continuously (and regularly) at certain time intervals. Data
                       obtained from previous scans are used to construct the next scanning
                       task for the purpose of enhancing efficiency (shortening the scanning
                       time in this case). A mathematical model for minimizing the scanning
                       time for a batch mode is constructed; an approximate algorithm for the
                       solution of the model is proposed. A series of experiments are carried
                       out in a real software system. The results obtained from the experi-
                       ments enabled to compare the proposed batch mode with the known
                       round robin mode. This revealed the advantages and disadvantages of
                       the batch mode.


                       Keywords: Enterprise Desktop Grid, Web Crawling, Round-Robin
                       Scheduling, Batch Job Scheduling.

1     Introduction
Construction and study of web fragments are classified under such web science field as link analysis [The04].
Web crawlers (sometimes called web spiders, internet bots or search robots) play an essential role here. These
are specialized programs that surf through web pages in a systematic manner to retrieve required information.
In general, crawling can be considered a multi-purpose search problem with some constraints. The task becomes
a very challenging one thanks to the large variety of objective functions, together with lack of relevant knowledge
about where to search for [PSM04]. The BeeBot crawler is described in [PC14]. This robot traverses the web and
collects information about outbound (external) hyperlinks from a set of websites defined by direct enumeration.
The crawler is implemented in C# language for .NET Framework Version 4.0, using Microsoft SQL Server 2008
R2 to run on a standard personal computer running Windows XP and higher. The features of the BeeBot
crawler can be enhanced, making it both parallel and distributed in nature [KK12]. This can be achieved using
an approach similar to that used in Enterprise Desktop Grid systems. Round-robin is one of the most famous
algorithms employed by process and network schedulers in computing for such. In the algorithm, a load is
assigned to a computing system by enumerating and ordering its elements in a circular order [Kle64]. In our
problem statement, we proceed from the following:

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: E. Ivashko, A. Rumyantsev (eds.): Proceedings of the Third International Conference BOINC:FAST 2017, Petrozavodsk, Russia,
August 28 - September 01, 2017, published at http://ceur-ws.org


                                                         49
    • That the set of scanned sites is large enough and is constantly (and regularly) scanned at certain intervals
    • That there are some computers that are integrated into Enterprise Desktop Grid
    • That the cost of scanning each site in each computer is known (or can be calculated).
   Hence, the task involves formation of batch jobs (for each server) that satisfy a certain optimality criterion for
the system as a whole. This paper proposes an algorithm that can create batch jobs to minimize the running time
of the system. The algorithm is based on solving the minimax optimization problem. A series of experiments
performed showed that the proposed algorithm is faster than the round-robin algorithm.

2     Parallel distributed crawler: BeeBot-EDG
The name BeeBot-EDG comes from a combination of the crawler BeeBot and abbreviation EDG (Enterprise
Desktop Grid). Parallel distributed crawler BeeBot-EDG was developed based on the principles behind the well-
known and popular grid computing platform BOINC [a6]. Accordingly, BeeBot-EDG and BOINC have similar
management and operation principles. BeeBot-EDG has client-server architecture. The two components (server
and client) will be discussed in more detail. Briefly, the general mode of operations can be represented as follows:
client crawlers are launched by users on the working machines in an Enterprise Desktop Grid and connected to
the master server. Then, the server completely manages the scanning process – clients execute only the batch
jobs sent to them by the server, and send received data to the server. Microsofts .NET Framework component
stack (.Net Framework 4.6.1 and C# 6.0 programming language) was used as the software framework. Graphical
subsystem WPF (Windows Presentation Foundation), architectural pattern MVVM (Model-View-ViewModel)
and library MVVM Light Toolkit were used to render the graphical user interface (GUI). Microsoft SQL Server
Express was used to store data obtained. Object-relational mapping Entity Framework was used for easy access
to the database. Duplex service WCF (Windows Communication Foundation) and TCP/IP protocol (as a lower-
level alternative to the currently popular data transfer via high-level HTTP protocol) were used for client-server
data exchange. The HTMLAgilityPack library was used to provide convenient syntactic parsing of information
contained in downloaded web pages.

3     Measuring client performance and site size: challenges
Constructing the mathematical model of a computing process in Beebot-EDG requires identifying the controlled
and uncontrolled parameters of the model [Ack78]. In this section, we discuss the uncontrolled parameters and
the challenges inherent in their measurement. Running the BeeBot program on a client computer requires various
system resources. The following is the list of uncontrolled system performance parameters:
    • Response time of the Beebot-EDG server,
    • Time it takes the batch job to load from the server to the client,
    • Time it takes to transfer the results file from the client to the server,
    • DNS server response time,
    • TCP connection time,
    • Web server response time,
    • Web page download time,
    • Client computer speed.
Due to the above parameters, the client’s performance is difficult to evaluate by any single integral parameter.
Computing performance, network speed, and even geographical location of the computer are all important. What
matters is to know which applications (and how many) are executed by the client’s computer when BeeBot is
launched. Obviously, measuring the performance of Beebot-EDGs client computers for each parameter will
require additional system resources, which would be better re- channeled to the scanning process. In fact, in this
case, what obtains is a performance testing problem for a software and hardware complex [KFNH99], which is
clearly outside the scope of this study. One of the possible solutions to this situation is an approach we call use
of a reference site. The reference site must meet a number of requirements, namely:


                                                      50
    • That the website should be hosted on a dedicated server that is solely hosting this website and does not
      receive any other loads from third-party applications while scanning the reference site;

    • That the site should be geographically located in the same region as the set of sites being investigated;

    • That the site should have constant number of pages for a certain scanning period of time;

    • That the site structure should be similar to those of the investigated sites;

    • That the number of web pages should be close to the average value for the investigated set;

    • That the content of the site should be similar by nature to the contents of the investigated sites.

Next, the reference site is scanned with client crawlers. The scanning time is recorded. The values obtained
(relative and not absolute values) are relevant here – they can be used in algorithms for solving optimization
problems. During the crawlers scanning process, the following characteristics and parameters of the scanned
site are of great importance – domain name, site size (number of pages) and data quantity (number of external
hyperlinks) collected via scanning. It is obvious that the size of a site and number of external hyperlinks collected
from the site can be found only by scanning that site. We proceed from the fact that the investigated set of sites
is constantly (and regularly) scanned at certain time intervals. Therefore, by scanning a given set of sites in a
round-robin mode once and recording the data (number of webpages and external hyperlink count for each site)
obtained, the values received can be used in the next scanning. Note that the obvious dynamics of the web leads
to the fact that at the next scanning, current values can significantly differ from previous values. In this case, it
is possible that the results obtained from multiple consecutive site scans can solve the problem of dynamic series
for such parameters as site size and external hyperlink count for each site [Ken76].

4      Batch job creation
In this study, job means a request to one of the clients to scan a given website, indicating the main scanning
parameters. Let Z = 1, 2, . . . , n be a set of all jobs. The following are descriptions of the uncontrolled parameters
of job i ∈ Z:

    1. dnamei – Domain name of the website (a character string without indication of the protocol and ports);

    2. Pi – Site size;

    3. hi – Size of file containing outbound hyperlinks that are pointing from site i . This file is obtained after
       scanning the site and is returned to the server by the client.

   Kilobyte (KB) is the unit of measurement for the second and third parameters. Let job subset U ⊆ Z be
a batch job (or simply a batch). The size of P    all the websites in a batch is equal to the sum of the sizes of the
websites of jobs included in a batch – that is z∈U Pz , where z is a job included in batch U .
   Let C = 1, 2, . . . , m be a set of clients. We introduce the following performance parameters for client j ∈ C:
tosj – Response time of the Beebot-EDG server for the j -th client,

vscj – Speed it takes the batch job to download from the Beebot-EDG server to the j-th client and back,

tdnsj – Response time of the DNS server for the j-th client, ttcpzj – Time of establishing TCP connection between
the j-th client and the web server containing the site contained in job z,

twebzj – Response time of the web server containing the site contained in job z for the j-th client,

vwebzj – The speed it takes the content of the site contained in job z to download from the web server to the
j-th client,

vj – Computer speed of the j-th client.
   We take KB/sec (kilobytes per second) as the unit of measurement for speed and response time, and sec
(second) as the unit of measurement for time. Let Uj be a batch sent by the server to the j-th client for execution.
So the time spent by the j-th client to execute batch Uj consists of the following three time components:


                                                      51
    • Time taken to receive batch jobs
                                                      tin
                                                       j = tosj + qj /vscj

      (qj is the size (in KB) of the file containing the list of domain names of sites that make up batch Uj ),

    • Time taken to process the batch job
                                      X
                              twork
                               j    =     (tdnsj + ttcpzj + twebzj + Pz /vwebzj + Pz /vj ),
                                           z∈Uj


    • Time taken to send results obtained from the processing of batch Uj to the server
                                                                X
                                                     tout = (        hz )/vscj .
                                                              z∈Uj


   So, the total amount of time spent by client j to execute batch Uj is equal to Tj = tin
                                                                                         j +t
                                                                                               work
                                                                                                    + tout
                                                                                                         j . To
generalize the problem statement, we will assume that if the server does not send a batch to client j, it means
that the client receives empty batch Uj = ∅. In the general case, we are dealing with the partitioning of set Z
into a set of disjoint subsets U = U1 , U2 , . . . , Um , such that,

                                                  ∀i, j ∈ C : Ui ∩ Uj = ∅,                                                  (1)
                                                  U1 ∪ U2 ∪ · · · ∪ Um = Z.                                                 (2)

Let U m be a set of all subsets of the form U = U1 , U2 , . . . , Um satisfying (1) and (2).
   The essence of the problem statement is that we want to partition the set such that the minimum execution
time is achieved by the client with maximum execution time. This can be formally expressed as follows:

                                           max{T1 , T2 , . . . , Tm } →U ∈U m min .                                         (3)

   Optimization problem (1)–(3) belongs to a large class of set partitioning problems. As a rule, obtaining an
optimal solution in such cases is possible only by complete enumeration. In our case, evaluation of the complexity
of a complete enumeration algorithm is of the order of magnitude O(mn ).

5     Reduced problem and heuristic algorithm for its solution
We will further apply a combination of both approaches.
   We will assume that all client computers are approximately on equal footing with respect to the used external
servers and communication channels. Assuming also that file sizes qj are approximately equal, we can assume
that tin
      j are equal for ∀j ∈ C. For the same reasons, we will also assume that tj
                                                                                      out
                                                                                          are equal for ∀j ∈ C.
   By removing the first and last terms from all T1 , T2 , . . . , Tm , we obtain the expression:
                                 X
                                     (tdnsj + ttcpzj + twebzj + Pz /vwebzj + Pz /vj )
                                    z∈Uj


From the same considerations that all client computers are approximately on equal footing with respect to used
external servers and communication channels, we assume that for ∀j ∈ C, tdnsj = tdns and for ∀z ∈ Uj ,
ttcpzj = ttcp and twebz = tweb. Besides, for ∀z ∈ Uj , vwebzj = vwebj . After some transformations, we get:
      X                                                                                      X                 1
             (tdns + ttcp + tweb + Pz /vwebj + Pz /vj ) = ||Uj || × (tdns + ttcp + tweb) +          Pz                  ,
                                                                                                         1/vwebj + 1/vj
      z∈Uj                                                                                   z∈Uj


where ||Uj || is the power of subset Uj .
   Measurements conducted on a set of research and educational sites show that the response time tdns + ttcp +
tweb is 0.2-0.3 sec. Measurements of the client computer performance for a site containing 1000 pages show
the time taken to process one site in a 200-300 sec interval. Of the two terms to the right of the equality


                                                         52
sign, the contribution of the second term is three orders of magnitude larger than that of the first. So we can
approximately leave it for reduced problem:
                                                       X                 1
                                               TjR =          Pz                  ,
                                                                   1/vwebj + 1/vj
                                                       z∈Uj


The objective function of the reduced optimization problem looks like this

                                          max{T1R , T2R , . . . , Tm
                                                                   R
                                                                     } →U ∈U m min                               (4)

with constraints (1)-(2).
   The heuristic algorithm for solving problem (1,2,4) is based on the well-known property of thematic web
spaces (university, research, government and other web spaces), which is that the distribution of site sizes is very
close to discrete power-law distribution [CSN09]. A brief description of the algorithm is as follows:

Step 1. Sizes of the websites of the scanned set and the performance of client computers are arranged in
    descending order;

Step 2. The total amount of sizes of the scanned sites is calculated;

Step 3. The share of the amount of site sizes for each computer is calculated proportionally to the computer
    performance;

Step 4. For each client computer (in descending order of performance), a batch job is generated (in descending
    order of sizes) by retrieving the next site from the list (set) of scanned sites and adding it to the batch until
    the total amount of sizes of the sites in the batch becomes greater than the value calculated in step 3. Then
    the last site of the batch is returned to the list of scanned sites;

Step 5. All the generated batch jobs are supplemented (in the reverse order) with sites remaining in the scanned
    set after step 4 has been executed – from the smallest to the largest.

   The scanning of target set sites in accordance with the batch jobs distributed among several clients shall be
regarded as a batch job work.

6      Natural experiments conducted: description and results
As the founding fathers of cybernetics said “... the best material model for a cat is another, or preferably the
same cat. In other words, should a material model thoroughly realize its purpose, the original situation could
be grasped in its entirety and a model would be unnecessary” [RW45, p. 320].
   In our case, we have a good example of when experiments can be conducted on a real object – in the BeeBot-
EDG system. The general description of the experiment is summarized as follows:

    1. A set of client computers is created;

    2. The performance of client computers on a reference site is measured;

    3. A target set of scanned sites is created;

    4. The target set of scanned sites is scanned in a round-robin mode. The time spent on scanning and the site
       sizes are recorded;

    5. The proposed heuristic algorithm is used to solve the batch job creation problem, where the data obtained
       from measurement of performance (step 2) and site size (step 4) are taken as the initial data;

    6. The target set of scanned sites is scanned in a batch job mode. The scanning time is measured.

Out of the large series of experiments carried out, we will, as an example, describe the outcome of eight of them
conducted on one target set using three and five client computers. The numbers of clientcomputers were chosen
based on the resources available to the authors. It seems that the results obtained graphically characterize the
conclusions drawn below.


                                                          53
                         Experiment             Scanning time of BeeBot-EDG (hour:min:sec)
                     Round robin1 99x3                            3:30:01
                     Round robin2 99x3                            6:25:03
                      Batch job1 99x3                             3:42:39
                      Batch job2 99x3                             3:23:59
                      Round robin1 99x5                           2:48:10
                      Round robin2 99x5                           4:18:44
                       Batch job1 99x5                            3:25:51
                       Batch job2 99x5                            2:47:51

                                              Table 1: Empirical data
   The time taken by the five client computers to scan a “reference” website featuring 800 web pages is 95.0,
84.6, 105.7, 110.4 and 85.8 seconds, respectively. For the three client computers, the scanning time is 105.7,
110.4 and 85.8 seconds. The widest difference in performance speed between the client computers reaches 25%.
   The target set is a sample of 99 websites obtained from a list containing over 400 websites of official institutions
of the Russian Academy of Sciences (Webometric Ranking of Russian Scientific Institutions, http://webometrics-
net.ru). The websites were scanned up to the third level. Results of the experiments are presented in the Table 1.

    The Round robin1 99x3 experiment was conducted in a round robin mode for 99 websites and on three client
computers.
    Round robin2 99x3 experiment: round robin mode, 99 websites, three client computers. But, unlike Round
robin1, the websites were scanned in the reverse order.
    Batch job1 99x3 experiment: 99 websites, three client computers. It used a batch job mode based on parti-
tioning into batches as described in the “Reduced Problem and Heuristic Algorithm for its Solution” section.
    Batch job2 99x3 experiment: 99 sites, three clients. It used a batch job mode based on partitioning into
batches as described in the “Reduced Problem and Heuristic Algorithm for its Solution” section. Here, instead
of site sizes, the average scanning time obtained in previous experiments was taken as the measure of complexity.
    The Round robin1 99x5, Round robin2 99x5, Batch job1 99x5 and Batch job2 99x5 experiments were carried
out similarly but for five client computers.
    The main conclusion from the experiments is that if the scanning data obtained from full scanning of the
target set in a (any) mode are taken into account, we will be able to build a new scanning process that is more
efficient than the previous processes. Let us confirm this through examples from the table.
    If we assume that the “previous” scanning was performed in Round robin2, then with the site sizes (number
of pages) obtained, we can create a list for “subsequent” scanning, which would ensure that ”large” sites appear
in the beginning of the list, and “small” sites at the end. Here, Round robin1 becomes a more efficient variant.
    The Batch job1 variant that uses number of pages as measure of complexity turned out to be more efficient
than Round robin2, but less efficient than Round robin1. This result confirms the main conclusion that if we
take data from previous scans into account, we can construct a job that is more efficient than the worst case
scenario. It also supports the fact that evaluation of scanning complexity based on number of webpages on the
site is a rough and inaccurate evaluation.
    The use of average scanning time instead of number of webpages in the Batch job2 variant yields the fastest
time in the BeeBot-EDG system when scanning 99 websites in three client computers.
    The scanning results for the five client computers almost confirm the conclusions made for the three client
computers.

7   Conclusion
This paper investigated the scheduling of jobs across several servers in software system BeeBot-EDG. The purpose
was to collect outbound hyperlinks for a given set of websites. Since the target set isscanned repeatedly after
a certain time interval, data obtained from previous scans can be used to construct a job for the next scan in
order to improve on efficiency.
   Two known job scheduling modes – round robin and batch job – were compared. A series of experiments,
implemented on a real software system, allowed to compare these modes and reveal certain advantages of the
batch job mode over the round robin mode.


                                                      54
   At the same time, experiments conducted showed that there is need for further research in this direction in
order to develop more efficient job scheduling algorithms.

References
[a6]       Open-source software for volunteer computing. https://boinc.berkeley.edu (date accessed: 17 Febru-
           ary 2017).
[Ack78]    R. Ackoff. The Art of Problem Solving. John Wiley and Sons, Ltd., 1978.
[CSN09]    Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. Power-law distributions in empirical
           data. SIAM Review, 51(4):661–703, 2009.

[Ken76]    M. Kendall. Time-Series. Charles Griffin&Co. Ltd., 2nd edition edition, 1976.
[KFNH99] C. Kaner, J. Falk, and Q. Nguen H. Testing Computer Software. John Wiley and Sons, Ltd., 1999.
[KK12]     D. Khurana and S. Kumar. Web crawler: A review. International Journal of Computer Science &
           Management Studies, 12(1):401–405, 2012.

[Kle64]    Leonard Kleinrock. Analysis of a time-shared processor. Naval Research Logistics Quarterly, 11(1):59–
           73, 1964.
[PC14]     A. A. Pechnikov and D. I. Chernobrovkin. Adaptive crawler for external hyperlinks search and
           acquisition. Automation and Remote Control, 75(3):587–593, Mar 2014.

[PSM04]    Gautam Pant, Padmini Srinivasan, and Filippo Menczer. Crawling the Web, pages 153–177. Springer
           Berlin Heidelberg, Berlin, Heidelberg, 2004.
[RW45]     Arturo Rosenblueth and Norbert Wiener. The role of models in science. Philosophy of Science,
           12(4):316–321, 1945.

[The04]    M. Thelwall. Link Analysis: An Information Science Approach. Elsevier Academic Press, Amsterdam,
           2004.


                                                  55