<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Harvesting All Matching Information To A Given Query From a Deep Website</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammadreza Khelghati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Djoerd Hiemstra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurice van Keulen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>s.m.khelghati</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>d.hiemstra</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>m.vankeuleng@utwente.nl</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Databases Group, University of Twente Enschede</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, the goal is harvesting all documents matching a given (entity) query from a deep web source. The objective is to retrieve all information about for instance \Denzel Washington", \Iran Nuclear Deal", or \FC Barcelona" from data hidden behind web forms. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by applying information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The goal of this research is to harvest all documents matching a given (entity)
query from a deep web source. For instance, we aim at retrieving all information
about \Denzil Washington", \Iran Nuclear Deal", or \FC Barcelona" from data
hidden behind web forms. However, policies of search engines usually do not
allow accessing all of the matching query search results for a given query. They
limit the number of returned documents (#ResultsLimited ) and the number of
user requests (#RequestsLimited ).</p>
      <p>Given these search engines limitations, we propose a new approach which
automatically collects information related to a given query from a search engine.
To do so, we rely on search re nement techniques to uncover results beyond what
a search engine allows a user to directly access due to #ResultsLimited and
#RequestsLimited limitations. These techniques are typically based on adding
extra terms to the initial query to obtain re ned search results. We propose
an approach which re nes search results for the purpose of achieving full data
coverage.</p>
      <p>
        In this approach, reformulating queries should be carried out with the aim of
obtaining as many new results as possible for each query. Maximizing the number
of new results means submitting queries which return as many documents as
#ResultsLimited limitation allows while minimizing the number of duplicates.
Minimizing duplicates becomes complicated with the presence of ranking bias
and query bias [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Search engine's ranking algorithms (e.g. Google Page-rank)
and selection of the initial query favour some documents more than others to be
returned by the search engine.
      </p>
      <p>
        To meet this challenge, techniques from Deep Web harvesting [
        <xref ref-type="bibr" rid="ref1 ref10 ref12 ref2 ref4">1, 12, 10, 4, 2</xref>
        ],
Query-Based Sampling [
        <xref ref-type="bibr" rid="ref3 ref5">5, 3</xref>
        ], Topical Crawling, [
        <xref ref-type="bibr" rid="ref14 ref16">14, 16</xref>
        ], and Query Expansion [6,
13, 7{9] are studied. Based on these studies, several approaches are suggested,
implemented, and compared in this paper. We test our approaches on Google,
which claims to search 100 PB of Web data (60 trillion URLs)1. Google imposes
both #ResultsLimited and #RequestsLimited, and ranking bias through its
Page-Rank algorithm.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Suggested Approach</title>
      <p>To reach this data coverage, we send automatically generated queries to a search
engine's API with the goal of retrieving all documents that contain a given entity
with a minimum amount of query submissions. We compare the approaches by
their capabilities to deal with #ResultsLimited and #RequestsLimited. The
comparison is based on the average number of queries submitted to retrieve all
documents for a given query.</p>
      <p>We distinguish two kinds of approaches. Section 2.1 describes ideal approaches,
for which we estimate the number of queries needed in ideal (simulated)
conditions. Section 2.2 describes approaches in which queries are reformulated by
using an external corpus.</p>
      <sec id="sec-2-1">
        <title>2.1 Ideal Approaches</title>
        <p>The approach mentioned in this section is desirable or perfect but not easily
realized. This is investigated with the sole purpose of improving the comparison
of the introduced approaches.</p>
        <p>Oracle Perfect Approach To achieve a full data coverage on a given
entity in a search system with the #ResultsLimited and #RequestsLimited
limitations, the perfect approach is the one which returns not only the
maximum possible number of documents but also only unique ones for each request.
To have a complete coverage in this situation, it is adequate to send only the
jCollectionSizeF orQueryj number of requests. In reality, this is not easily
reachallowedDocsT oBeV isited
able. To do so, you need to know the exact mechanism of search engine
ranking algorithm. Then, you might be able to divide the collection into exactly
jjCollectionSizeF orQueryjj sub-collections. In addition to the knowledge of ranking
allowedDocsT oBeV isited
algorithm, you might need additional information. For instance, if a ranking
algorithm is based on terms frequencies, you need to know all the term frequencies
1 O cial Google Blog:
http://googleblog.blogspot.nl/2008/07/we-knew-web-wasbig.html
beforehand. This kind of information is only accessible when you have full index
access.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>List-Based Query Generation Approach</title>
        <p>
          In these approaches, the terms to be added to the seed query are selected from a
list of words. This list is generated from an external corpus and includes the
frequencies in that corpus. In this paper, this list is extracted from the ClueWeb09
dataset, which is a web crawl containing nearly 500 million English pages [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Selecting terms from the list of terms and their corresponding document
frequencies can be performed in di erent methods. In the following, these methods
are further explained and studied.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>List-Based Most/Least Frequent Approach Although primitive, choosing</title>
        <p>the most or least frequent words from a list are possible options in selecting
terms. As the ClueWeb dataset is not a topic-speci c corpus, the most frequent
words from this corpus are highly probable to be also general in all other not
topic-speci c corpora.</p>
        <p>Pre-determined Frequency Based Approach While submitting the most
frequent terms increases the chance of reaching the maximum number of returned
results and the least frequent ones increases the probabilities of generating fewer
duplicates, it is of a great interest to investigate the likelihood of nding a term
frequency which creates a trade-o between these two. To do so, statistical
formulas are applicable. If events A and B are independent, then the probability
of them both occurring is the product of the probabilities of each occurring
(P (A&amp;B) = P (A) P (B)). With samples smaller than 10 percent of the
collection, we can assume two posing query processes as statistically independent
events ("The 10% Condition"). Then, the probability of having an overlap
between two queries equals with the multiplication of the probability of each query.
This is shown in Formula 1.</p>
        <p>jM atchingDocs \ ReturnedDocsj = jM atchingDocsj jReturnedDocsj
jSearchEnginej jSearchEnginej jSearchEnginej
jReturnedDocsj = l jSearchEnginej j (jM atchingDocs\ReturnedDocsj = l)
jM atchingDocsj
(1)</p>
        <p>
          With the knowledge of targeted search engine's index size, and also the
number of documents matching seed query, through Formula 1, one can determine
the frequency of another query for which the overlap of this query and the seed
query equals the number of documents allowed to be visited. This means with
information on the seed query, returned results and search engine size, a term
can be found to formulate a new query returning at least the same number
of results that are allowed to be visited. This enables avoiding the permanent
presence of the same highly ranked documents among the results and creates
a higher chance in collecting more new documents in each trial. If the size of
search engine is unknown, as discussed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], the size can be estimated by only
using a few number of generated samples from search engine.
        </p>
        <p>As pointed out, applying this formula to our case requires information on
terms document frequencies. To access this information from the targeted search
system, we should download all its content and count all the terms document
frequencies. If this was possible, there was no need for introducing these approaches.
Instead, we can use pre-computed terms document frequencies from an external
corpus. In this paper, as we test our approaches on Google, we use the ClueWeb
dataset. However, the size di erence between the ClueWeb and Google should be
considered to be able to apply the formula. The easiest solution is to include
different sizes in the calculations. For example, assuming SizeSearchEngine = 109,
the number of English documents in ClueWeb as 5 108, limitedResults = 100,
and jM atchingDocumentsj for a given query to be 4 105, the following
calculation could provide us with a term document frequency that has higher chance
to result in samples of our desired size: 110009 = 4 101905 5 x108 =) x = 125000. In
this paper, we refer to this approach as LB-FixedFreq. approach.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <sec id="sec-3-1">
        <title>Experiments Settings</title>
        <p>Test Search Engine In these experiments, Google as the biggest web search
engine with one of the most complicated ranking algorithms is considered as our
test search engine. As the only necessary feature for applying any of these
suggested approaches is the support of keyword-search interface, targeting Google
does not limit our ndings only to Google.</p>
        <p>Entities Test Set In our experiments, we used four di erent entities (\Vitol",
\Ed Brinksma", \PhD Comics", and \Fireworks Disaster") to test and
compare the suggested approaches. We tried to include entities representing di erent
types of entities; Company, Person, Topic, and Event. In addition to di erence
in type, we tried to cover queries with di erent estimated results sets sizes.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>In this section, the results of applying the introduced approaches in Section
2 to the test entities (Section 3.1) are presented. The Figure 1 compares the
performances of all the approaches for one of the test entities in the test set.
This is a straight forward task as it is only required to compare the number
of retrieved documents by each approach. However, to compare the approaches'
performances on all the test cases, we calculate their average distances from the
Oracle Perfect approach. In Figure 3, the performances of all the approaches for
all the entities in the test set are compared with the Oracle Perfect approach.
▲
◆
●
■
▲
◆
●
■
▲
◆
●
■</p>
        <p>100
n ( Number of Submitted Queries )
▲
◆
●
■
▲
◆
●
■
100
80
ize 60
epS
l
m
aS40
20
As it is shown in Fig 3, the LB-FixedFreq. approach performs better than
Most-freq. and Least-Freq. approaches. This approach submits queries which
result in fewer duplicates than LB-MostFreq. approach while having bigger sample
sizes in regards to the Least-Freq approach. This is observable from Figure 2.
The right image in this gure shows the number of duplicates resulted from
submitting all the queries formulated by adding a term to the initial query (given
entity). The left picture shows the corresponding sample size for each of these
queries. From comparing these two images, we can conclude that a trade-o
between the big sample sizes and number of duplicates is the key to the
LBMostFreq. approach's better performance. In this approach, nding a speci c
frequency leads to a trade-o between sample sizes and number of duplicates.</p>
        <p>100
n ( Number of Submitted Queries )
150
In this work, we assessed di erent query generation mechanisms for harvesting
a web data source to move forward towards achieving a full data coverage on
a given (entity) query. From the experiments, we found that the key to success
in these approaches is to send queries which result in the maximum possible
number of results with the minimum possible number of documents downloaded
in previous query submissions. To have this success factor, we suggested three
di erent approaches based on di erent frequencies. Among these approaches,
the LB-FixedFreq. performed better than the others.</p>
        <p>Future Work In addition to the frequency of terms extracted from an external
corpus, we can include terms present in the previously retrieved documents to
select the best next query to submit. The frequency of these terms could also be
applied for a more e cient query expansion technique.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Manuel</given-names>
            <surname>Alvarez</surname>
          </string-name>
          , Juan Raposo, Alberto Pan, Fidel Cacheda,
          <article-title>Fernando Bellas, and V ctor Carneiro</article-title>
          .
          <article-title>Deepbot: a focused crawler for accessing hidden web content</article-title>
          .
          <source>In Proceedings of the 3rd international workshop on Data enginering issues in Ecommerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)</source>
          ,
          <source>DEECS '07</source>
          , pages
          <fpage>18</fpage>
          {
          <fpage>25</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Luciano</given-names>
            <surname>Barbosa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Juliana</given-names>
            <surname>Freire</surname>
          </string-name>
          .
          <article-title>Siphoning hidden-web data through keywordbased interfaces</article-title>
          .
          <source>In SBBD</source>
          , pages
          <volume>309</volume>
          {
          <fpage>321</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Krishna</given-names>
            <surname>Bharat</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrei</given-names>
            <surname>Broder</surname>
          </string-name>
          .
          <article-title>A technique for measuring the relative size and overlap of public web search engines</article-title>
          .
          <source>Comput. Netw. ISDN Syst</source>
          .,
          <volume>30</volume>
          :
          <fpage>379</fpage>
          {
          <fpage>388</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cafarella</surname>
          </string-name>
          .
          <article-title>Extracting and Querying a Comprehensive Web Database</article-title>
          .
          <source>In Proceedings of the Conference on Innovative Data Systems Research (CIDR)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>James</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Callan</surname>
            and
            <given-names>Margaret E.</given-names>
          </string-name>
          <string-name>
            <surname>Connell</surname>
          </string-name>
          .
          <article-title>Query-based sampling of text databases</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <volume>97</volume>
          {
          <fpage>130</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Guihong</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jian-Yun</surname>
            <given-names>Nie</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>Selecting good expansion terms for pseudo-relevance feedback</article-title>
          .
          <source>In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08</source>
          , pages
          <fpage>243</fpage>
          {
          <fpage>250</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Claudio</given-names>
            <surname>Carpineto</surname>
          </string-name>
          and
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Romano</surname>
          </string-name>
          .
          <article-title>A survey of automatic query expansion in information retrieval</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>44</volume>
          (
          <issue>1</issue>
          ):1:
          <issue>1</issue>
          {1:
          <fpage>50</fpage>
          ,
          <year>January 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kevyn</surname>
            Collins-Thompson and
            <given-names>Jamie</given-names>
          </string-name>
          <string-name>
            <surname>Callan</surname>
          </string-name>
          .
          <article-title>Estimation and use of uncertainty in pseudo-relevance feedback</article-title>
          .
          <source>In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07</source>
          , pages
          <fpage>303</fpage>
          {
          <fpage>310</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Ben</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iadh</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <article-title>Combining elds for query expansion and adaptive query expansion</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>43</volume>
          (
          <issue>5</issue>
          ):
          <volume>1294</volume>
          {
          <fpage>1307</fpage>
          ,
          <year>September 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yeye</surname>
            <given-names>He</given-names>
          </string-name>
          , Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and
          <string-name>
            <given-names>Nirav</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Crawling deep web entity pages</article-title>
          .
          <source>In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM '13</source>
          , pages
          <fpage>355</fpage>
          {
          <fpage>364</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mohammadreza</surname>
            <given-names>Khelghati</given-names>
          </string-name>
          , Djoerd Hiemstra, and Maurice van Keulen.
          <article-title>Size estimation of non-cooperative data collections</article-title>
          .
          <source>IIWAS '12</source>
          , pages
          <fpage>239</fpage>
          {
          <fpage>246</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jayant</surname>
            <given-names>Madhavan</given-names>
          </string-name>
          , David Ko,
          <string-name>
            <given-names>Lucja</given-names>
            <surname>Kot</surname>
          </string-name>
          , Vignesh Ganapathy, Alex Rasmussen, and
          <string-name>
            <given-names>Alon</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Google's Deep Web crawl</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <volume>1241</volume>
          {
          <fpage>1252</fpage>
          ,
          <year>August 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Christopher D. Manning</surname>
          </string-name>
          , Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Filippo</surname>
            <given-names>Menczer</given-names>
          </string-name>
          , Gautam Pant, and
          <string-name>
            <given-names>Padmini</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          .
          <article-title>Topical web crawlers: Evaluating adaptive algorithms</article-title>
          .
          <source>ACM Transactions on Internet Technology</source>
          ,
          <volume>4</volume>
          :http://dollar.biz.ui,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <article-title>The Lemur Project. A dataset to support research on information retrieval and related human language technologies</article-title>
          . http://lemurproject.org/clueweb09.php,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sergej</surname>
            <given-names>Sizov</given-names>
          </string-name>
          , Martin Theobald, Stefan Siersdorfer, Gerhard Weikum, Jens Graupmann,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Biwer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Zimmer</surname>
          </string-name>
          .
          <article-title>The bingo! system for information portal generation and expert web search</article-title>
          .
          <source>In CIDR</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>