=Paper=
{{Paper
|id=None
|storemode=property
|title=Size Estimation of Non-Cooperative Data Collections
|pdfUrl=https://ceur-ws.org/Vol-986/paper_24.pdf
|volume=Vol-986
|dblpUrl=https://dblp.org/rec/conf/dir/KhelghatiHK13
}}
==Size Estimation of Non-Cooperative Data Collections==
How Much Data Resides in a Web Collection: How to
Estimate Size of a Web Collection
Mohammadreza Khelghati, Djoerd Hiemstra, Maurice Van Keulen
1. INTRODUCTION ture (G-MCR) and Bar-Yossef et al. approaches from the lit-
erature are implemented. Having studied these approaches,
With increasing amount of data in deep web sources (hid- a number of ideas are suggested to improve their accuracy.
den from general search engines behind web forms), access-
ing this data has gained more attention. In the algorithms In the approaches like MCR and CH which are based on
applied for this purpose, it is the knowledge of a data source creating samples and the number of duplicates among them,
size that enables the algorithms to make accurate decisions the idea of considering only the different samples is applied.
in stopping crawling or sampling processes which can be so This can test if different samples can provide more infor-
costly in some cases [4]. The tendency to know the sizes mation on the collection size. The similarity of samples is
of data sources is increased by the competition among busi- considered as the basic modification idea for MCR and CH.
nesses on the Web in which the data coverage is critical.
In the context of quality assessment of search engines [2], Different nature of Bar-Yossef et al. needs a different im-
search engine selection in the federated search engines, and provement idea. Bar-Yossef et al. is based on a predefined
in the resource/collection selection in the distributed search query pool. The number of queries in this pool which cover
field [6], this information is also helpful. In addition, it can the collection data is estimated and this number directly
give an insight over some useful statistics for public sectors affects the collection size estimation. In our experiments
like governments. In any of these mentioned scenarios, in over Bar Yossef et al., it was noticed that defining the query
case of facing a non-cooperative collection which does not pool can highly affect the estimation process. Based on this
publish its information, the size has to be estimated [5]. observation, a different query pool selection method is sug-
In this paper, the approaches in literature are categorized gested. In this suggested approach, queries are divided into
and reviewed. The most recent approaches are implemented different query pools based on their frequencies. These pools
and compared in a real environment. Finally, four methods are indexed and easily accessible by the approach. By send-
based on the modification of the available techniques are in- ing queries and investigating their results, it is decided if
troduced and evaluated. In one of the modifications, the the pool is appropriate or not for the collection. This helps
estimations from other approaches could be improved rang- choosing the most appropriate query pool for the collection.
ing from 35 to 65 percent.
Contributions. As the first contribution, an experimental 3. RESULTS
comparison among a number of size estimation approaches Having applied the Mhr, MCR, MCR-Regression, CH, CH-
is performed. Having applied these techniques on a num- Regression and G-MCR approaches on the test set, the re-
ber of real search engines, it is shown which technique can sults are illustrated in the Figure 1 [3]. These websites are
provide more promising results. As the second contribution, chosen in a way to cover different subjects and have different
a number of modifications to the available approaches are sizes. In this figure, to be able to compare the performance
suggested (Table 1 [3]). of the approaches on different data collections of different
sizes, the results are normalized by using the Relative Bias
2. THE SUGGESTED APPROACH metric. If an approach could estimate half of the actual
In this work, Heterogeneous and Ranked Model (Mhr), Mul- size of a data collection, the corresponding relative bias for
tiple Capture Recapture (MCR), MCR Regression, Capture that approach is −0.5 which is related to −50 percent in the
History (CH), CH Regression, Generalized Capture Recap- figure.
However, it is important to mention that the Bar-Yossef et
al. approach implemented in this work was so costly in most
of the cases that caused stopping the estimation process.
This problem is introduced by the choices of the query pools
made during the implementation phase of this approach.
Among two pools suggested by Bar-Yossef et al. [1], the one
aimed at real cases and not designed for training purposes
DIR 2013, April 26, 2013, Delft, The Netherlands. is implemented. Therefore, the results for Bar-Yossef et al.
approach are missing in this part.
Table 1: Improvements Resulting From Modifications
Mhr MCR MCR-Reg CH CH-Reg G-MCR
M-Bar-Yossef 36.25 63.67 67.36 44.74 54.70 62.77
M-MCR -19.1 8.27 11.96 -10.6 -0.7 7.37
M-MCR-Reg -24.1 3.25 6.94 -15.6 -5.7 2.34
M-CH-1 1.35 28.77 32.46 9.84 19.79 27.86
M-CH-1-Reg 2.50 29.92 33.60 10.98 20.94 29.01
M-CH-2 0.81 28.23 31.92 9.30 19.26 27.33
M-CH-2-Reg 2.77 30.19 33.87 11.25 21.21 29.28
Note: This table provides the percentage of improvements that the modified approaches could result regarding the previously
available approaches; considering the average of all the performances on all the tested real data collections on the Web.
Sheet6
40
the options to be applied. These approaches had close es-
timations considering the average performances on all the
20
tested websites.
e
ch
ub
te
rg
si
ar
ut
k
.o
eb
Se
Yo
u
ia
o.
W
ed
n
ty
.c
io
al
ed
er
ip
si
on
at
er
ik
st
bm
uc
.w
iv
rs
on
Un
Pu
Pe
Ed
en
M
0
As future work, we aim at research on the most appropriate
time to stop the sampling in estimation process. The alter-
Relative Bias in Percentage
-20
native approaches could be continuing as far as the limita-
Mhr tions or to study questions like what is the adequate number
-40 MCR
of samples and the most appropriate sample size to provide
MCR-Reg
the most accurate estimation. As another future work, the
CH
-60 potential further improvements could be mentioned. As an
CH-Reg
example, in the selection of pools in the M-Bar-Yossef et
G-MCR
-80 al. approach, the selection procedure could be based on the
queries from different domains. This classification might
-100
Data Collection from the Web
lead to higher accuracy of the size estimations.
5. ACKNOWLEDGEMENT
Figure 1: The Performance of the Approaches on This publication was supported by the Dutch national pro-
the Real Data Collections from the Web gram COMMIT.
Note: The lines are added only to provide more readability
of the graph.
6. REFERENCES
[1] Bar-Yossef, Z., and Gurevich, M. Efficient search
4. CONCLUSION engine measurements. ACM Trans. Web 5, 4 (Oct.
Having studied the state-of-the-art in size estimation of non- 2011), 18:1–18:48.
cooperative websites, the most recent approaches introduced [2] Broder, A. Z., Fontoura, M., Josifovski, V.,
in the literature are implemented in this work. Hence, the Kumar, R., Motwani, R., Nabar, S. U.,
MCR, CH, G-MCR, Bar Yossef et al. and regression-based Panigrahy, R., Tomkins, A., and Xu, Y. Estimating
approaches are selected to be studied and compared. To pro- corpus size via queries. In CIKM (2006), pp. 594–603.
vide an appropriate comparison setting, two issues were re- [3] Khelghati, M., Hiemstra, D., and van Keulen, M.
garded highly important. First, the test collection is definrd Size estimation of non-cooperative data collections. In
as a set of websites on the Web from different domains (such Proceedings of the 14th International Conference on
as job vacancies, wikis, articles, and personal websites) with Information Integration and Web-based Applications
different sizes. The second issue was the information avail- & Services (New York, NY, USA, 2012), IIWAS
able for each approach. The number of sampling events ’12, ACM, pp. 239–246.
and the samples sizes were set to be the same for all the [4] Lu, J. Ranking bias in deep web size estimation using
approaches. Although this test environment could be im- capture recapture method. Data Knowl. Eng. 69, 8
proved by adding more real deep Page 1 websites, it is believed (Aug. 2010), 866–879.
that it could provide an appropriate basis for comparing the [5] Shokouhi, M., Zobel, J., Scholer, F., and
available size estimation approaches. Tahaghoghi, S. M. M. Capturing collection size for
distributed non-cooperative retrieval. In SIGIR (2006),
Among all the studied approaches, the modified version of pp. 316–323.
Bar-Yossef et al. could provide 35 to 65 percent better es-
[6] Xu, J., Wu, S., and Li, X. Estimating collection size
timations on size of the tested deep websites. However, the
with logistic regression. In Proceedings of the 30th
M-Bar-Yossef et al. approach could not be implemented for
annual international ACM SIGIR conference on
the websites which do not provide the access to the con-
Research and development in information retrieval
tent of the search results. In the case of facing such web-
(New York, NY, USA, 2007), SIGIR ’07, ACM,
sites, the Mhr approach, both modified versions of the CH
pp. 789–790.
approach (M-CH-1 and M-CH-2) and their regressions (M-
CH-1-Regression and M-CH-2-Regression) could be among