1. INTRODUCTION

How Much Data Resides in a Web Collection: How to Estimate Size of a Web Collection

Mohammadreza Khelghati

Djoerd Hiemstra

Maurice Van Keulen

0 0 2. THE SUGGESTED APPROACH tiple Capture Recapture (MCR), MCR Regression , Capture History (CH), CH Regression, Generalized Capture Recap-

ture (G-MCR) and Bar-Yossef et al. approaches from the literature are implemented. Having studied these approaches, a number of ideas are suggested to improve their accuracy. In the approaches like MCR and CH which are based on creating samples and the number of duplicates among them, the idea of considering only the di erent samples is applied. This can test if di erent samples can provide more information on the collection size. The similarity of samples is considered as the basic modi cation idea for MCR and CH. Di erent nature of Bar-Yossef et al. needs a di erent improvement idea. Bar-Yossef et al. is based on a prede ned query pool. The number of queries in this pool which cover the collection data is estimated and this number directly a ects the collection size estimation. In our experiments over Bar Yossef et al., it was noticed that de ning the query pool can highly a ect the estimation process. Based on this observation, a di erent query pool selection method is suggested. In this suggested approach, queries are divided into di erent query pools based on their frequencies. These pools are indexed and easily accessible by the approach. By sending queries and investigating their results, it is decided if the pool is appropriate or not for the collection. This helps choosing the most appropriate query pool for the collection.

1. INTRODUCTION

With increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping crawling or sampling processes which can be so costly in some cases [ 4 ]. The tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [ 2 ], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search eld [ 6 ], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [ 5 ]. In this paper, the approaches in literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modi cation of the available techniques are introduced and evaluated. In one of the modi cations, the estimations from other approaches could be improved ranging from 35 to 65 percent.

Contributions. As the rst contribution, an experimental comparison among a number of size estimation approaches is performed. Having applied these techniques on a number of real search engines, it is shown which technique can provide more promising results. As the second contribution, a number of modi cations to the available approaches are suggested (Table 1 [ 3 ]).

3. RESULTS

Having applied the Mhr, MCR, MCR-Regression, CH, CHRegression and G-MCR approaches on the test set, the results are illustrated in the Figure 1 [ 3 ]. These websites are chosen in a way to cover di erent subjects and have di erent sizes. In this gure, to be able to compare the performance of the approaches on di erent data collections of di erent sizes, the results are normalized by using the Relative Bias metric. If an approach could estimate half of the actual size of a data collection, the corresponding relative bias for that approach is 0:5 which is related to 50 percent in the gure.

However, it is important to mention that the Bar-Yossef et al. approach implemented in this work was so costly in most of the cases that caused stopping the estimation process. This problem is introduced by the choices of the query pools made during the implementation phase of this approach. Among two pools suggested by Bar-Yossef et al. [ 1 ], the one aimed at real cases and not designed for training purposes is implemented. Therefore, the results for Bar-Yossef et al. approach are missing in this part. 40 20 0 en.wikipedia.org ge tnea -20 irvcsneaePB-40 ii lt eaR

4. CONCLUSION

Having studied the state-of-the-art in size estimation of noncooperative websites, the most recent approaches introduced in the literature are implemented in this work. Hence, the MCR, CH, G-MCR, Bar Yossef et al. and regression-based approaches are selected to be studied and compared. To provide an appropriate comparison setting, two issues were regarded highly important. First, the test collection is de nrd as a set of websites on the Web from di erent domains (such as job vacancies, wikis, articles, and personal websites) with di erent sizes. The second issue was the information available for each approach. The number of sampling events and the samples sizes were set to be the same for all the approaches. Although this test environment could be improved by adding more real dPagee e1p websites, it is believed that it could provide an appropriate basis for comparing the available size estimation approaches.

Among all the studied approaches, the modi ed version of Bar-Yossef et al. could provide 35 to 65 percent better estimations on size of the tested deep websites. However, the M-Bar-Yossef et al. approach could not be implemented for the websites which do not provide the access to the content of the search results. In the case of facing such websites, the Mhr approach, both modi ed versions of the CH approach (M-CH-1 and M-CH-2) and their regressions (MCH-1-Regression and M-CH-2-Regression) could be among the options to be applied. These approaches had close estimations considering the average performances on all the tested websites.

As future work, we aim at research on the most appropriate time to stop the sampling in estimation process. The alternative approaches could be continuing as far as the limitaMhr tions or to study questions like what is the adequate number MCR of samples and the most appropriate sample size to provide MCR-Reg the most accurate estimation. As another future work, the CH potential further improvements could be mentioned. As an CH-Reg example, in the selection of pools in the M-Bar-Yossef et G-MCR al. approach, the selection procedure could be based on the queries from di erent domains. This classi cation might lead to higher accuracy of the size estimations.

5. ACKNOWLEDGEMENT

This publication was supported by the Dutch national program COMMIT.

[1] Bar-Yossef , Z. , and Gurevich , M. E cient search engine measurements . ACM Trans. Web 5 , 4 (Oct. 2011 ), 18 :1{ 18 : 48 .

[2] Broder , A. Z. , Fontoura , M. , Josifovski , V. , Kumar , R. , Motwani , R. , Nabar , S. U. , Panigrahy , R. , Tomkins , A. , and Xu , Y. Estimating corpus size via queries . In CIKM ( 2006 ), pp. 594 { 603 .

[3] Khelghati , M. , Hiemstra , D. , and van Keulen, M. Size estimation of non-cooperative data collections . In Proceedings of the 14th International Conference on Information Integration and Web-based Applications &# 38 ; Services (New York, NY, USA, 2012 ), IIWAS '12 , ACM, pp. 239 { 246 .

[4] Lu , J. Ranking bias in deep web size estimation using capture recapture method . Data Knowl. Eng . 69 , 8 (Aug. 2010 ), 866 { 879 .

[5] Shokouhi , M. , Zobel , J. , Scholer , F. , and Tahaghoghi , S. M. M. Capturing collection size for distributed non-cooperative retrieval . In SIGIR ( 2006 ), pp. 316 { 323 .

[6] Xu , J. , Wu , S. , and Li , X. Estimating collection size with logistic regression . In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 2007 ), SIGIR '07 , ACM, pp. 789 { 790 .