<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How Much Data Resides in a Web Collection: How to Estimate Size of a Web Collection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammadreza Khelghati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Djoerd Hiemstra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurice Van Keulen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2. THE SUGGESTED APPROACH tiple Capture Recapture (MCR), MCR Regression</institution>
          ,
          <addr-line>Capture History (CH), CH Regression, Generalized Capture Recap-</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>ture (G-MCR) and Bar-Yossef et al. approaches from the literature are implemented. Having studied these approaches, a number of ideas are suggested to improve their accuracy. In the approaches like MCR and CH which are based on creating samples and the number of duplicates among them, the idea of considering only the di erent samples is applied. This can test if di erent samples can provide more information on the collection size. The similarity of samples is considered as the basic modi cation idea for MCR and CH. Di erent nature of Bar-Yossef et al. needs a di erent improvement idea. Bar-Yossef et al. is based on a prede ned query pool. The number of queries in this pool which cover the collection data is estimated and this number directly a ects the collection size estimation. In our experiments over Bar Yossef et al., it was noticed that de ning the query pool can highly a ect the estimation process. Based on this observation, a di erent query pool selection method is suggested. In this suggested approach, queries are divided into di erent query pools based on their frequencies. These pools are indexed and easily accessible by the approach. By sending queries and investigating their results, it is decided if the pool is appropriate or not for the collection. This helps choosing the most appropriate query pool for the collection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        With increasing amount of data in deep web sources
(hidden from general search engines behind web forms),
accessing this data has gained more attention. In the algorithms
applied for this purpose, it is the knowledge of a data source
size that enables the algorithms to make accurate decisions
in stopping crawling or sampling processes which can be so
costly in some cases [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The tendency to know the sizes
of data sources is increased by the competition among
businesses on the Web in which the data coverage is critical.
In the context of quality assessment of search engines [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
search engine selection in the federated search engines, and
in the resource/collection selection in the distributed search
eld [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], this information is also helpful. In addition, it can
give an insight over some useful statistics for public sectors
like governments. In any of these mentioned scenarios, in
case of facing a non-cooperative collection which does not
publish its information, the size has to be estimated [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
In this paper, the approaches in literature are categorized
and reviewed. The most recent approaches are implemented
and compared in a real environment. Finally, four methods
based on the modi cation of the available techniques are
introduced and evaluated. In one of the modi cations, the
estimations from other approaches could be improved
ranging from 35 to 65 percent.
      </p>
      <p>
        Contributions. As the rst contribution, an experimental
comparison among a number of size estimation approaches
is performed. Having applied these techniques on a
number of real search engines, it is shown which technique can
provide more promising results. As the second contribution,
a number of modi cations to the available approaches are
suggested (Table 1 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. RESULTS</title>
      <p>
        Having applied the Mhr, MCR, MCR-Regression, CH,
CHRegression and G-MCR approaches on the test set, the
results are illustrated in the Figure 1 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These websites are
chosen in a way to cover di erent subjects and have di erent
sizes. In this gure, to be able to compare the performance
of the approaches on di erent data collections of di erent
sizes, the results are normalized by using the Relative Bias
metric. If an approach could estimate half of the actual
size of a data collection, the corresponding relative bias for
that approach is 0:5 which is related to 50 percent in the
gure.
      </p>
      <p>
        However, it is important to mention that the Bar-Yossef et
al. approach implemented in this work was so costly in most
of the cases that caused stopping the estimation process.
This problem is introduced by the choices of the query pools
made during the implementation phase of this approach.
Among two pools suggested by Bar-Yossef et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the one
aimed at real cases and not designed for training purposes
is implemented. Therefore, the results for Bar-Yossef et al.
approach are missing in this part.
40
20
0 en.wikipedia.org
ge
tnea -20
irvcsneaePB-40
ii
lt
eaR
      </p>
    </sec>
    <sec id="sec-3">
      <title>4. CONCLUSION</title>
      <p>Having studied the state-of-the-art in size estimation of
noncooperative websites, the most recent approaches introduced
in the literature are implemented in this work. Hence, the
MCR, CH, G-MCR, Bar Yossef et al. and regression-based
approaches are selected to be studied and compared. To
provide an appropriate comparison setting, two issues were
regarded highly important. First, the test collection is de nrd
as a set of websites on the Web from di erent domains (such
as job vacancies, wikis, articles, and personal websites) with
di erent sizes. The second issue was the information
available for each approach. The number of sampling events
and the samples sizes were set to be the same for all the
approaches. Although this test environment could be
improved by adding more real dPagee e1p websites, it is believed
that it could provide an appropriate basis for comparing the
available size estimation approaches.</p>
      <p>Among all the studied approaches, the modi ed version of
Bar-Yossef et al. could provide 35 to 65 percent better
estimations on size of the tested deep websites. However, the
M-Bar-Yossef et al. approach could not be implemented for
the websites which do not provide the access to the
content of the search results. In the case of facing such
websites, the Mhr approach, both modi ed versions of the CH
approach (M-CH-1 and M-CH-2) and their regressions
(MCH-1-Regression and M-CH-2-Regression) could be among
the options to be applied. These approaches had close
estimations considering the average performances on all the
tested websites.</p>
      <p>As future work, we aim at research on the most appropriate
time to stop the sampling in estimation process. The
alternative approaches could be continuing as far as the
limitaMhr tions or to study questions like what is the adequate number
MCR of samples and the most appropriate sample size to provide
MCR-Reg the most accurate estimation. As another future work, the
CH potential further improvements could be mentioned. As an
CH-Reg example, in the selection of pools in the M-Bar-Yossef et
G-MCR al. approach, the selection procedure could be based on the
queries from di erent domains. This classi cation might
lead to higher accuracy of the size estimations.</p>
    </sec>
    <sec id="sec-4">
      <title>5. ACKNOWLEDGEMENT</title>
      <p>This publication was supported by the Dutch national
program COMMIT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bar-Yossef</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gurevich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>E cient search engine measurements</article-title>
          .
          <source>ACM Trans. Web</source>
          <volume>5</volume>
          ,
          <issue>4</issue>
          (Oct.
          <year>2011</year>
          ),
          <volume>18</volume>
          :1{
          <fpage>18</fpage>
          :
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Broder</surname>
            ,
            <given-names>A. Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fontoura</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Josifovski</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nabar</surname>
            ,
            <given-names>S. U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Panigrahy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomkins</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Estimating corpus size via queries</article-title>
          .
          <source>In CIKM</source>
          (
          <year>2006</year>
          ), pp.
          <volume>594</volume>
          {
          <fpage>603</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Khelghati</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and van Keulen,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Size estimation of non-cooperative data collections</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Information Integration and Web-based Applications &amp;#</source>
          <volume>38</volume>
          ;
          <string-name>
            <surname>Services</surname>
          </string-name>
          (New York, NY, USA,
          <year>2012</year>
          ),
          <source>IIWAS '12</source>
          , ACM, pp.
          <volume>239</volume>
          {
          <fpage>246</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Ranking bias in deep web size estimation using capture recapture method</article-title>
          .
          <source>Data Knowl. Eng</source>
          .
          <volume>69</volume>
          ,
          <issue>8</issue>
          (Aug.
          <year>2010</year>
          ),
          <volume>866</volume>
          {
          <fpage>879</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Shokouhi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zobel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scholer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Tahaghoghi</surname>
            ,
            <given-names>S. M. M.</given-names>
          </string-name>
          <article-title>Capturing collection size for distributed non-cooperative retrieval</article-title>
          .
          <source>In SIGIR</source>
          (
          <year>2006</year>
          ), pp.
          <volume>316</volume>
          {
          <fpage>323</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <article-title>Estimating collection size with logistic regression</article-title>
          .
          <source>In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          (New York, NY, USA,
          <year>2007</year>
          ),
          <source>SIGIR '07</source>
          , ACM, pp.
          <volume>789</volume>
          {
          <fpage>790</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>