=Paper=
{{Paper
|id=None
|storemode=property
|title=Size Estimation of Non-Cooperative Data Collections
|pdfUrl=https://ceur-ws.org/Vol-986/paper_24.pdf
|volume=Vol-986
|dblpUrl=https://dblp.org/rec/conf/dir/KhelghatiHK13
}}
==Size Estimation of Non-Cooperative Data Collections==
<pdf width="1500px">https://ceur-ws.org/Vol-986/paper_24.pdf</pdf>
<pre>
        How Much Data Resides in a Web Collection: How to
               Estimate Size of a Web Collection

                       Mohammadreza Khelghati, Djoerd Hiemstra, Maurice Van Keulen


1.    INTRODUCTION                                                 ture (G-MCR) and Bar-Yossef et al. approaches from the lit-
                                                                   erature are implemented. Having studied these approaches,
With increasing amount of data in deep web sources (hid-           a number of ideas are suggested to improve their accuracy.
den from general search engines behind web forms), access-
ing this data has gained more attention. In the algorithms         In the approaches like MCR and CH which are based on
applied for this purpose, it is the knowledge of a data source     creating samples and the number of duplicates among them,
size that enables the algorithms to make accurate decisions        the idea of considering only the different samples is applied.
in stopping crawling or sampling processes which can be so         This can test if different samples can provide more infor-
costly in some cases [4]. The tendency to know the sizes           mation on the collection size. The similarity of samples is
of data sources is increased by the competition among busi-        considered as the basic modification idea for MCR and CH.
nesses on the Web in which the data coverage is critical.
In the context of quality assessment of search engines [2],        Different nature of Bar-Yossef et al. needs a different im-
search engine selection in the federated search engines, and       provement idea. Bar-Yossef et al. is based on a predefined
in the resource/collection selection in the distributed search     query pool. The number of queries in this pool which cover
field [6], this information is also helpful. In addition, it can   the collection data is estimated and this number directly
give an insight over some useful statistics for public sectors     affects the collection size estimation. In our experiments
like governments. In any of these mentioned scenarios, in          over Bar Yossef et al., it was noticed that defining the query
case of facing a non-cooperative collection which does not         pool can highly affect the estimation process. Based on this
publish its information, the size has to be estimated [5].         observation, a different query pool selection method is sug-
In this paper, the approaches in literature are categorized        gested. In this suggested approach, queries are divided into
and reviewed. The most recent approaches are implemented           different query pools based on their frequencies. These pools
and compared in a real environment. Finally, four methods          are indexed and easily accessible by the approach. By send-
based on the modification of the available techniques are in-      ing queries and investigating their results, it is decided if
troduced and evaluated. In one of the modifications, the           the pool is appropriate or not for the collection. This helps
estimations from other approaches could be improved rang-          choosing the most appropriate query pool for the collection.
ing from 35 to 65 percent.
Contributions. As the first contribution, an experimental          3.   RESULTS
comparison among a number of size estimation approaches            Having applied the Mhr, MCR, MCR-Regression, CH, CH-
is performed. Having applied these techniques on a num-            Regression and G-MCR approaches on the test set, the re-
ber of real search engines, it is shown which technique can        sults are illustrated in the Figure 1 [3]. These websites are
provide more promising results. As the second contribution,        chosen in a way to cover different subjects and have different
a number of modifications to the available approaches are          sizes. In this figure, to be able to compare the performance
suggested (Table 1 [3]).                                           of the approaches on different data collections of different
                                                                   sizes, the results are normalized by using the Relative Bias
2.    THE SUGGESTED APPROACH                                       metric. If an approach could estimate half of the actual
In this work, Heterogeneous and Ranked Model (Mhr), Mul-           size of a data collection, the corresponding relative bias for
tiple Capture Recapture (MCR), MCR Regression, Capture             that approach is −0.5 which is related to −50 percent in the
History (CH), CH Regression, Generalized Capture Recap-            figure.

                                                                   However, it is important to mention that the Bar-Yossef et
                                                                   al. approach implemented in this work was so costly in most
                                                                   of the cases that caused stopping the estimation process.
                                                                   This problem is introduced by the choices of the query pools
                                                                   made during the implementation phase of this approach.
                                                                   Among two pools suggested by Bar-Yossef et al. [1], the one
                                                                   aimed at real cases and not designed for training purposes
DIR 2013, April 26, 2013, Delft, The Netherlands.                  is implemented. Therefore, the results for Bar-Yossef et al.
                                                                   approach are missing in this part.
                               Table 1: Improvements Resulting From Modifications
                                           Mhr MCR MCR-Reg               CH     CH-Reg G-MCR
                           M-Bar-Yossef 36.25 63.67            67.36    44.74     54.70      62.77
                             M-MCR         -19.1     8.27      11.96    -10.6      -0.7       7.37
                           M-MCR-Reg -24.1           3.25       6.94    -15.6      -5.7       2.34
                             M-CH-1         1.35    28.77      32.46     9.84     19.79      27.86
                           M-CH-1-Reg       2.50    29.92      33.60    10.98     20.94      29.01
                             M-CH-2         0.81    28.23      31.92     9.30     19.26      27.33
                           M-CH-2-Reg       2.77    30.19      33.87    11.25     21.21      29.28
Note: This table provides the percentage of improvements that the modified approaches could result regarding the previously
available approaches; considering the average of all the performances on all the tested real data collections on the Web.
                                                                                         Sheet6


                               40
                                                                                                                                                                           the options to be applied. These approaches had close es-
                                                                                                                                                                           timations considering the average performances on all the
                               20
                                                                                                                                                                           tested websites.

                                                                                                                                           e
                                                                   ch


                                                                                                                                        ub
                                                                                                                 te
                                                   rg


                                                                                                              si
                                                                 ar


                                                                                                                                     ut


                                                                                                                                                             k
                                                 .o


                                                                                                            eb
                                                               Se


                                                                                                                                Yo


                                                                                                                                                          u
                                               ia


                                                                                                                                                       o.
                                                                                                           W
                                             ed


                                                                                                                                 n
                                                             ty


                                                                                                                                                        .c
                                                                                                                               io
                                                                                                            al
                                                                             ed


                                                                                                                                                      er
                                           ip


                                                              si


                                                                                                          on


                                                                                                                            at
                                                            er
                                         ik


                                                                                                                                                   st
                                                                           bm


                                                                                                                         uc
                                       .w


                                                          iv


                                                                                                       rs


                                                                                                                                                on
                                                        Un


                                                                        Pu


                                                                                                    Pe


                                                                                                                      Ed
                                     en


                                                                                                                                               M
                                0
                                                                                                                                                                           As future work, we aim at research on the most appropriate
                                                                                                                                                                           time to stop the sampling in estimation process. The alter-
Relative Bias in Percentage


                               -20
                                                                                                                                                                           native approaches could be continuing as far as the limita-
                                                                                                                                                                 Mhr       tions or to study questions like what is the adequate number
                               -40                                                                                                                               MCR
                                                                                                                                                                           of samples and the most appropriate sample size to provide
                                                                                                                                                                 MCR-Reg
                                                                                                                                                                           the most accurate estimation. As another future work, the
                                                                                                                                                                 CH
                               -60                                                                                                                                         potential further improvements could be mentioned. As an
                                                                                                                                                                 CH-Reg
                                                                                                                                                                           example, in the selection of pools in the M-Bar-Yossef et
                                                                                                                                                                 G-MCR
                               -80                                                                                                                                         al. approach, the selection procedure could be based on the
                                                                                                                                                                           queries from different domains. This classification might
                              -100
                                                                            Data Collection from the Web
                                                                                                                                                                           lead to higher accuracy of the size estimations.

                                                                                                                                                                           5.   ACKNOWLEDGEMENT
Figure 1: The Performance of the Approaches on                                                                                                                             This publication was supported by the Dutch national pro-
the Real Data Collections from the Web                                                                                                                                     gram COMMIT.
Note: The lines are added only to provide more readability
of the graph.
                                                                                                                                                                           6.   REFERENCES
                                                                                                                                                                           [1] Bar-Yossef, Z., and Gurevich, M. Efficient search
4.                                          CONCLUSION                                                                                                                         engine measurements. ACM Trans. Web 5, 4 (Oct.
Having studied the state-of-the-art in size estimation of non-                                                                                                                 2011), 18:1–18:48.
cooperative websites, the most recent approaches introduced                                                                                                                [2] Broder, A. Z., Fontoura, M., Josifovski, V.,
in the literature are implemented in this work. Hence, the                                                                                                                     Kumar, R., Motwani, R., Nabar, S. U.,
MCR, CH, G-MCR, Bar Yossef et al. and regression-based                                                                                                                         Panigrahy, R., Tomkins, A., and Xu, Y. Estimating
approaches are selected to be studied and compared. To pro-                                                                                                                    corpus size via queries. In CIKM (2006), pp. 594–603.
vide an appropriate comparison setting, two issues were re-                                                                                                                [3] Khelghati, M., Hiemstra, D., and van Keulen, M.
garded highly important. First, the test collection is definrd                                                                                                                 Size estimation of non-cooperative data collections. In
as a set of websites on the Web from different domains (such                                                                                                                   Proceedings of the 14th International Conference on
as job vacancies, wikis, articles, and personal websites) with                                                                                                                 Information Integration and Web-based Applications
different sizes. The second issue was the information avail-                                                                                                                   &#38; Services (New York, NY, USA, 2012), IIWAS
able for each approach. The number of sampling events                                                                                                                          ’12, ACM, pp. 239–246.
and the samples sizes were set to be the same for all the                                                                                                                  [4] Lu, J. Ranking bias in deep web size estimation using
approaches. Although this test environment could be im-                                                                                                                        capture recapture method. Data Knowl. Eng. 69, 8
proved by adding more real deep  Page 1 websites, it is believed                                                                                                               (Aug. 2010), 866–879.
that it could provide an appropriate basis for comparing the                                                                                                               [5] Shokouhi, M., Zobel, J., Scholer, F., and
available size estimation approaches.                                                                                                                                          Tahaghoghi, S. M. M. Capturing collection size for
                                                                                                                                                                               distributed non-cooperative retrieval. In SIGIR (2006),
Among all the studied approaches, the modified version of                                                                                                                      pp. 316–323.
Bar-Yossef et al. could provide 35 to 65 percent better es-
                                                                                                                                                                           [6] Xu, J., Wu, S., and Li, X. Estimating collection size
timations on size of the tested deep websites. However, the
                                                                                                                                                                               with logistic regression. In Proceedings of the 30th
M-Bar-Yossef et al. approach could not be implemented for
                                                                                                                                                                               annual international ACM SIGIR conference on
the websites which do not provide the access to the con-
                                                                                                                                                                               Research and development in information retrieval
tent of the search results. In the case of facing such web-
                                                                                                                                                                               (New York, NY, USA, 2007), SIGIR ’07, ACM,
sites, the Mhr approach, both modified versions of the CH
                                                                                                                                                                               pp. 789–790.
approach (M-CH-1 and M-CH-2) and their regressions (M-
CH-1-Regression and M-CH-2-Regression) could be among

</pre>