<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A review of web crawling approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elda Xhumari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Izaura Xhumari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universtity of Tirana, Department of Informatics</institution>
          ,
          <addr-line>Boulevard “Zogu I”, Tirana, 1001</addr-line>
          ,
          <country country="AL">Albania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Websites are getting richer and richer with information in different formats. The data that such sites possess today goes through millions of terabytes of data, but not every information that is on the net is useful. To enable the most efficient internet browsing for the user, one methodology is to use web crawler. This study presents web crawler methodology, the first steps of development, how it works, the different types of web crawlers, the benefits of using and comparing their operating methods which are the advantages and disadvantages of each algorithm used by them.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Web crawler</kwd>
        <kwd>Algorithms</kwd>
        <kwd>Types of web crawlers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The world wide web is a large collection of data.
Data that continue to grow day by day. Nowadays
it has become an important part of human life to use
the internet to gain access to information on the
World Wide Web. Due to bandwidth, storage
capacity, limited computer resources and the rapid
growth of the World Wide Web, unforeseen scaling
challenges have arisen for search engines. The two
most important features of the web such as the large
volume of data and the speed of their change pose
a difficulty for web crawling, as there are a large
number of pages which are added, changed and
deleted every day. Although search engine
technology has dramatically scaled up to keep pace</p>
      <p>The crawler maintains a list of unvisited URLs
called frontier. The list is first initialized with URLs
provided by a user or other program. Each crawl
cycle involves selecting a URL from the list and
retrieving the corresponding page for that URL via
HTTP, analyzing it to extract URLs and specific
information, and finally adding these unvisited
URLs to the frontier list. Before being added to the
list these URLs may be marked a point depending
on the benefit achieved if the page with the
corresponding URL is visited. The crawl process
may end when a certain number of crawled pages
are accessed. If the crawler is ready to visit another
page and the frontier list is empty then the situation
signals a dead end for the crawler and since the
crawler no longer has new pages to visit it stops.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Types of web crawler</title>
      <p>Different types of web crawlers are available
depending on how the web pages are crawled and
how the future web pages are retrieved and
accessed. Some of which are as follows.</p>
      <sec id="sec-2-1">
        <title>A. Incremental Crawler</title>
        <p>
          An incremental web crawler is one of the
traditional crawlers, which constantly updates an
existing set of downloaded pages instead of
restarting the crawling process from scratch each
time. This includes some way to determine if a page
has changed since it was last downloaded. Pages
can appear multiple times in the crawler order, and
crawling is an ongoing process that conceptually
never ends. To have an updated content of
downloaded web pages, an incremental web
crawler links the review of previously downloaded
pages to the first visit to new pages [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The goal is
to achieve updating and coverage at the same time.
The advantage of an incremental web crawler is
that only valuable data is provided to the user, thus
the network bandwidth is stored and data
enrichment is achieved.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>B. Form Focused Crawler</title>
        <p>Form Focused Crawler deals with the rare
distribution of forms on the Web. Form Crawler
avoids crawling through unproductive links by
restricting search to a specific topic, learning the
characteristics of links and pages that lead to pages
containing searchable forms, and using appropriate
stopping criteria. Web crawler uses two rankings:
site and links to guide its search. Later, a third
classifier: the shape classifier is used to filter out
useless forms.</p>
      </sec>
      <sec id="sec-2-3">
        <title>C. Focused Crawler</title>
        <p>
          A Focused Crawler collects documents which
are specific and related to the given topic.
Sometimes this crawler is also known as Topic
Crawler to approach how it works. Focused
Crawler is a web crawler that tends to transfer
pages that are related to each other. Determines if
the given page has similarities to the specific topic.
One of the advantages of the focused crawler is the
economic flexibility in hardware and network
resources. It reduces the amount of network traffic,
logging and downloads [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Focused Crawler
searches, acquires, indexes, and maintains pages
for specific groups of topics that represent a
relatively narrow segment of the network. This
crawler is run by a classifier that learns to recognize
the importance of taxonomy embedded examples,
and a distiller that identifies current online priority
points.
        </p>
        <p>
          As the size of the internet grows, it becomes
difficult to retrieve the entire or a major portion of
the web employing a single method. Therefore,
several search engines typically run multiple
processes in parallel to perform the above task, so
download rate is maximized. This kind of crawler
is known as a parallel crawler [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We can also say
that when multiple crawlers are usually run in
parallel, it's referred as Parallel crawlers. A parallel
crawler consists of multiple crawling processes
referred to as C-procs which can run on network of
workstations [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The Parallel crawlers rely on Page
freshness and Page selection. A Parallel crawler
may be on local network or be distributed at
geographically different locations. Parallelization
of the crawling system is extremely important from
the purpose of read of downloading documents in
an affordable quantity of time.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>E. Distributed Crawler</title>
        <p>
          Distributed Web Crawler is a distributed
computing technique. Many crawlers are operating
for distribution within the web crawling method to
master as much web coverage as possible [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. A
central server manages the communication and
synchronization of nodes, as it is geographically
distributed. Mainly uses PageRank algorithm to
increase its efficiency and quality search. The
advantage of distributed web crawler is that it is not
affected by system crashes or various events and
can be adapted by many crawling applications.
To design an efficient web crawler, it is required to
create the distribution task between multiple
machines in a synchronous process. Large websites
should be distributed individually on the network
and they should provide the right chance and
rationality for synchronous access. Meanwhile
synchronous distribution saves network bandwidth
resources [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
4.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Web crawling algorithms</title>
      <sec id="sec-3-1">
        <title>Breadth First Search</title>
        <p>
          It starts with a small set of pages and then
explores other pages following the links in the first
width. Indeed, websites are not strictly traversed at
first glance, but can use a variety of policies. For
example, it may crawl the most important pages
first. This method is used by many search engines.
This crawler balances the load between servers.
Breadth first algorithm work on a level by level, i.e.
algorithm starts at the root URL and searches the
all the neighbors URL at the same level. If the
desired URL is found, then the search terminates.
If it is not, then search proceeds down to the next
level and repeat the processes until the goal is
reached. When all the URLs are scanned, but the
objective is not found, then the failure reported is
generated. Breadth first Search algorithm is
generally used where the objective lies in the
depthless parts in a deeper tree [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
ii.
        </p>
        <p>This is an algorithm for traversing or searching
tree or graph data structures. It is a technique of
systematically examining the search starting from
the root node and penetrating deeper through the
child node. If there is more than one child, then
priority is given to the child on the left and
penetrates deeply until there are no more children
available. Returns to the other unexplored node and
then proceeds in a similar manner. This algorithm
ensures that all edges are visited at once. It is
suitable for search problems, but when the branches
are large, then this algorithm can end up in an
endless loop.</p>
        <p>iii.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Best First Search</title>
        <p>Best first algorithms are often used to find
search paths. Best First Search is a search algorithm
that roams a graph starting from the most promising
node selected according to a specified rule. The
basic idea is that having a URL limit, the best URL
according to some evaluation criteria such as
accuracy, recall, accuracy, and points (F-Score). In
this algorithm, the URL selection process is driven
by lexical similarity between the topic keywords
and the URL source page. Thus, the similarity
between the page and the topic keywords is used to
evaluate the fit with all the outbound links of the
page.</p>
        <p>
          The main principle of the algorithm is: it takes
as input an initial URL and a search query, and
dynamically builds a list of priorities (initialized
with the initial URL) of the next URLs (referred to
as nodes) to be explored. In each step the first node
is removed from the list and processed. As the text
of each document becomes available, it is analyzed
by a scoring component assessing whether it is
relevant or irrelevant to the search query (value
10) and, based on that result, a heuristic decides to
pursue the search in that direction or not: Whenever
a document source is retrieved, it is scanned for
links. Nodes run by these links are assigned a depth
value. If the parent is important, the depth of the
children is set to a predetermined value. Otherwise,
the depth of the children is set to be one less than
the depth of the parent. When the depth reaches
zero, the direction is interrupted and none of his
children are included in the list [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>v. Shark-Search Algorithm</p>
        <p>
          Fish-Search algorithm's main flaw is that the
interrelated computation is too simple, it only has 0
and 1, interrelated and irrelevant respectively.
Secondly, every
node's potential score has a low precision which
only has three situations (0,0.5, and l). Aimed at
these disadvantages, Michael Hersovici [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
brought forward an improved Shark-Search
algorithm which mainly ameliorates page,
interrelated query computation and potential score'
s computing method. The following process is its
detail:
- Import vector space model to compute the page
and user query's relativity.
- Consider the information given by anchor text
near the hyperlink and compute the relativity
between it and user's query.
- Calculate both of the above two factors with
child node's potential score computing
formula.
        </p>
        <p>
          Through these betterments, Shark-Search
algorithm's efficiency is much better than
FishSearch's [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>vi.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Page Rank Algorithm</title>
        <p>In Page rank algorithm web crawler decides on
the importance of web pages in each web site
through the total number of links or citations per
page. Page rank is calculated according to the
Relatedness between web pages by the Page Rank
algorithm. Website ranking calculation:
 ( ) = (1 −  ) +  (  (( 11)) + ⋯ +   ((   ))), (1)
where PR(A) - Page Rank of a given Page,D –
Dumping factor
 1 – links.</p>
        <p>
          To find the Page Rank for a page, called PR (A),
you must first find all the pages that link to page A
and Out Link from A. If we find a page  1 that links
with A then page C ( 1) will give the number of
outbound links on page A. The same procedure is
done for pages  2,  3 and all other pages that can
be linked to the main page A - and the sum of their
values will provide the Page Rank of the website
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>Table 1
Advantages and limitations of web crawling
algorithm</p>
        <p>Algorithm
Breadth
First
Search
Depth
First
Search
Fish
Search
Shark
Search
Page Rank</p>
        <p>Advantages
Suitable for
situations where the
solution is located at
the beginning in a
deep tree.</p>
        <p>Suitable for in-depth
search problems.</p>
        <p>Consumes very little
memory.</p>
        <p>The algorithm is
helpful in forming
the priority table.</p>
        <p>The algorithm
mainly ameliorates
page, interrelated
query computations
and potential score's
computing
method.</p>
        <p>In a short time, the
most important
pages are returned as
Rank is calculated
on the basis of the
popularity of a page.</p>
        <p>Limitations
If a solution is far
away then it
consumes time.</p>
        <p>Consumes a large
amount of memory.</p>
        <p>If the edges are large
then this algorithm
can end in an endless
cycle.</p>
        <p>The usage of
resources of network
is high. Fish search
crawlers
significantly load not
only network, also
web servers.</p>
        <p>The usage of
resources of network
is high.</p>
        <p>Favors older pages,
because a new page,
even a very good
one, will not have
many links unless it
is part of an existing
web site.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Approach</title>
      <p>A traditional crawler worked simply by
extracting static data from HTML code and most
websites until recently would undergo the same
crawler process. The crawling process is no longer
as simple as it was a few years ago, due to the
increasing use of JavaScript frameworks such as
Angular, React, Meteor. Many of the websites are
JavaScript heavy and generates content by doing
asynchronous JavaScript calls after page is loaded.
The use of these frameworks makes developer life
simpler and provides many benefits for creating
dynamic sites. To crawl this type of web sites Web
Crawlers, use Selenium.</p>
      <p>Selenium is a Web Browser Automation Tool
originally designed to automate web applications
for testing purposes. It is now used for many other
applications such as automating web-based admin
tasks, interact with platforms which do not provide
API, as well as for Web Crawling.</p>
      <p>Building a focused web crawler using selenium
tool is good way to collect useful information.
Focused Crawler is an approach to increase
accuracy and expert internet search. An ideal
focused crawler could only download those related
pages by ignoring other pages and would anticipate
the possibility of a link to a specific topic related
site before downloading it.</p>
      <p>One use case of a focused web crawler is
extracting financial data. Financial market is a
place of risks and instability. It’s hard to predict
how the curve will go and sometimes, for investors,
one decision could be a make-or-break move.
That’s why experienced practitioners never lose
track of the financial data. Financial data, when
extracted and analyzed in real time, can provide
wealthy information for investments and trading.
And people in different positions scrape financial
data for varied purposes.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>Web Crawler is the essential source of
information retrieval which roams the Web and
downloads web documents that suit the user's need.
Web crawler is used by search engines and other
users to regularly ensure that their database is up to
date. In this article has been presented a review of
different types crawling technologies and
algorithms, why “focused crawling” technology is
being used. The crawling algorithm is the most
important part of any search engine. Focused
Crawlers uses more complex systems and
techniques to define the information of high
relevance and quality. Searching algorithm is the
heart of the search engine system. The choice of the
algorithm has a significant impact on the work and
effectiveness of focused crawler and search engine.
In conclusion the focused crawler compared to
different crawlers is intended for advanced web
users focuses on specific topic and it does not waste
the resources on irrelevant material.</p>
    </sec>
    <sec id="sec-6">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Mahmud</surname>
            ,
            <given-names>Hasan</given-names>
          </string-name>
          &amp; Soulemane, Moumie &amp; Rafiuzzaman,
          <string-name>
            <surname>Mohammad.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>A framework for dynamic indexing from hidden web</article-title>
          .
          <source>International Journal of Computer Science Issues. 8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Su</given-names>
            <surname>Guiyang</surname>
          </string-name>
          , Li Jianhua, Ma Yinghua, Li Shenghong, Song Juping Department of Electronic Engineering, Slanghai Jiaotong University, Shanghai 200030,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>China</surname>
          </string-name>
          (
          <issue>Received April 10</issue>
          ,
          <year>2004</year>
          ) New Focused Crawling Algorithm
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Olston</surname>
          </string-name>
          , Marc Najork, Web Crawling
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Gautam</given-names>
            <surname>Pant</surname>
          </string-name>
          , Padmini Srinivasan, Filippo Menczer,
          <source>Department of Management Sciences School of Library and Information Science</source>
          , The University of Iowa,
          <year>2004</year>
          <article-title>Crawling the Web (4-6</article-title>
          ), Web Dynamics
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Dhiraj</given-names>
            <surname>Khurana</surname>
          </string-name>
          , Satish Kumar, “
          <article-title>Web Crawler: A Review”</article-title>
          ,
          <source>IJCSMS International Journal of Computer Science &amp; Management Studies</source>
          , Vol.
          <volume>12</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>01</given-names>
          </string-name>
          ,
          <year>January 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Trupti</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Udapure</surname>
            ,
            <given-names>Ravindra D.</given-names>
          </string-name>
          <string-name>
            <surname>Kale</surname>
          </string-name>
          , Rajesh C. Dharmik, “
          <article-title>Study of Web Crawler and its Different Types”</article-title>
          ,
          <source>IOSR Journal of Computer Engineering (IOSR-JCE)</source>
          , Volume
          <volume>16</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>1</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ver.</surname>
          </string-name>
          <article-title>VI (Feb</article-title>
          .
          <year>2014</year>
          ), PP
          <fpage>01</fpage>
          -
          <lpage>05</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yugandhara</given-names>
            <surname>Patil</surname>
          </string-name>
          , Sonal Patil,
          <article-title>Janar 2016 Review of Web Crawlers with Specification and</article-title>
          Working Vol.
          <volume>5</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>1</given-names>
          </string-name>
          ,
          <string-name>
            <surname>January</surname>
            <given-names>2016</given-names>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Junghoo</given-names>
            <surname>Cho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hector</given-names>
            <surname>Garcia-Molina ―</surname>
          </string-name>
          Effective
          <source>Page Refresh Policies for Web Crawlersǁ ACM Transactions on Database Systems</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Andas</given-names>
            <surname>Amrin</surname>
          </string-name>
          *,
          <string-name>
            <surname>C. X.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Focused Web Crawling Algorithms</article-title>
          . Shanghai, China.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hersovici</surname>
          </string-name>
          , Michal Jaoov, Maarek Yoelle S, et al.
          <article-title>The Shark-Search algorithm. An applicatication: Taibred Web Site Mapping</article-title>
          ,
          <source>Computer Networks and ISDN Systems 30</source>
          ,
          <year>1998</year>
          .
          <fpage>317</fpage>
          -
          <lpage>326</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>TIAN</given-names>
            <surname>Chong</surname>
          </string-name>
          “
          <article-title>A Kind of Algorithm For Page Ranking Based on Classified Tree In Search Engine”</article-title>
          <source>Proc International Conference on Computer Application and System Modeling (ICCASM</source>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>