<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Big Data or Right Data?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yahoo! Labs Barcelona</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Web Research Group</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univ. Pompeu Fabra Barcelona</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain rbaeza@acm.org</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Big data nowadays is a fashionable topic, independently of what people mean when they use this term. The challenges include how to capture, transfer, store, clean, analyze, lter, search, share, and visualize such data. But being big is just a matter of volume, although there is no clear agreement in the size threshold where big starts. Indeed, it is easy to capture large amounts of data using a brute force approach. So the real goal should not be big data but to ask ourselves, for a given problem, what is the right data and how much of it is needed.1 For some problems this would imply big data, but for the majority of the problems much less data is necessary. In this position paper we explore the trade-o s involved and the main problems that come with big data: scalability, redundancy, bias, noise, spam, and privacy.</p>
      </abstract>
      <kwd-group>
        <kwd>Scalability</kwd>
        <kwd>redundancy</kwd>
        <kwd>bias</kwd>
        <kwd>sparsity</kwd>
        <kwd>noise</kwd>
        <kwd>spam</kwd>
        <kwd>privacy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>According to Wikipedia, \big data is a collection of data sets so large and complex
that it becomes di cult to process using on-hand database management tools or
traditional data processing applications". However, what really means on-hand
database management tools or traditional data processing applications? Are we
talking about terabytes or petabytes? In fact, a de nition of volume threshold
based on current storing and processing capacities might be more reasonable.
This de nition then may depend on the device. For example, big in the mobile
world will be smaller than big in the desktop world.</p>
      <p>
        Big data can be used in many applications. In the context of the Web, it can
be used to do Web search, to extract information, and many other data mining
problems. Clearly for Web search, big data is needed as we need to search over
the whole content of the Web. Hence, in the sequel, we will focus on data mining
problems, using the Web as main example. This is now called wisdom of crowds
when the data comes from people [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The crucial di erence between Web search
and Web data mining, is that in the rst case we know what we are looking for,
while in the second we try to nd something unusual that will be the answer to
a (yet) unknown question.
1 We use data as singular, although using it as a plural is more formal.
      </p>
      <p>Recently we see a lot of data mining for the sake of it. This has been triggered
by the availability of big data. Sometimes is valid to ask ourselves what is really
interesting in a new data set. However, when people hammer a data set over
and over again just because it is available, newer results become usually less
meaningful. In some cases the results also belong to other disciplines (e.g. social
insights) and there is no contribution to computer science (CS), but still people
try to publish it in CS venues.</p>
      <p>Good data mining is usually problem driven. For this we need to answer
questions such as: What data we need? How much data we need? How the data
can be collected? Today collecting data might be cheap and hence big data can
be just an artifact of this step. After we have the data we need to worry about
transferring and storing the data. In fact, transferring one petabyte even over a
fast Internet link (say 100 Mbps) needs more than two years, too much time for
most applications. On the other hand, today several companies store hundreds
of petabytes and process dozens of terabytes per day.</p>
      <p>When we have the data in place, another set of questions appears: Is all the
data unique or we need to lter duplicates? Is the data trustworthy or there
is spam? How much noise there is? Is the data distribution valid or there is a
hidden bias that needs to be corrected? Which privacy issues must be taken care
of? Do we need to anonymize the data?</p>
      <p>After answering these questions we focus on the speci c mining task: Can
we process all our data? How well our algorithm scale? The ultimate question
will be related to the results and the usefulness of them. This last step is clearly
application dependent.</p>
      <p>Another subtle issue is that most of the time when we need to use big data,
the problem is to nd the right data inside our large data. Many times this
golden subset is hard to determine, as we need to discard huge amounts of data,
where we have to deal again with bias, noise or spam. Hence, another relevant
question is: How we process and lter our data to obtain the right data?</p>
      <p>Hence, handling large amounts of data poses several challenges related to
the questions and issues above. The rst obvious one is scalability, our last
step. Privacy is also very relevant as it deals with legal and ethical restrictions.
Other challenges come with the data content and its intrinsic quality, such as
redundancy, bias, sparsity, noise, or spam. We brie y cover all these issues in
the following sections.</p>
      <p>There are many other aspects of big data that we do not cover, such as the
complexity and heterogeneity of data, as they are outside the scope of this paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Scalability</title>
      <p>We can always collect and use more data thinking that more data will give
improved results. In many cases that is true, but then transferring, storing, and
processing larger amounts of data may not be feasible, as we challenge the
bandwidth of the communication channels, the disk space of the computer
infrastructure, and the performance of the algorithms used. As Internet bandwidth and
storage has become cheaper, scaling the communication and hardware does not
imply a proportional increase in cost. More data can also imply more noise as
we discuss later.</p>
      <p>On the other hand, the algorithms used to analyze the data may not scale.
If the algorithm is linear, doubling the data, without modifying the system
architecture, implies doubling the time. This might still be feasible, but for super
linear algorithms most surely it will not. In this case, typical solutions are to
parallelize and/or distribute the processing. As all big data solutions already
run on distributed platforms, increasing the amount of data requires increasing
the number of machines, which is clearly costly and proportional to the increase
needed.</p>
      <p>How else we can deal with more data? Well, another solution consists of
developing faster (sometimes approximate) algorithms, at the cost of probably
decreasing the quality of the solution. This becomes a clear option when the loss
in quality is inferior to the improvements obtained with more data. That is, the
time performance improvements should be larger than the loss in the solution
quality. This opens a new interesting trade-o challenge in algorithm design and
analysis for data mining problems.</p>
      <p>
        One interesting example of the trade-o mentioned above is in lexical
tagging. That is, recognize all named entities in a text. The best algorithms are
super-linear but in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Ciaramita and Altun, present the rst of a family of
linear time taggers that have high quality (comparable to the state of the art).2
To understand the trade-o we can do a back-of-the-envelope analysis. Let us
assume that we can achieve a higher quality result with an algorithm that runs
in time O(n log n) where n is the size of the text. Let us de ne q as the
extra quality and as Q the quality of the linear algorithm. Without doubts, the
number of correctly detected entities per unit of time should be larger for the
linear algorithm. In fact, if we consider the case when both algorithms use the
same running time, we can show that for enough text, that is n = O( q=Q),
where &gt; 1 is a constant, the number of correct entities found by the faster
algorithm will be larger. For some cases this will imply big data, but for many
other cases it will not (for example, if the better quality algorithm has quadratic
time performance).
      </p>
      <p>
        Another important aspect of scalability is the processing paradigm that we
can use to speed-up our algorithms. This is application dependent, as the
degree of parallelization depends on the problem being solved. For example, not
all problems are suitable for the popular map-reduce paradigm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Hence, more
research is needed to devise more powerful paradigms, in particular for the
analysis of large graphs. In some cases we need to consider the dynamic aspect of
big data, as in this case we may need to do online data processing that makes
scalability even more di cult. Map-reduce is also not suitable for this case and
one on-going initiative for scalable stream data processing is SAMOA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2 http://sourceforge.net/projects/supersensetag/.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Redundancy and Bias</title>
      <p>
        Data can be redundant. Worse, usually it is. For example, in any sensor network
that tracks mobile objects, there will be redundant data for all sensors that are
nearby. In the case of the Web this is even worse, as we have lexical redundancy
(plagiarism) estimated in 25% [
        <xref ref-type="bibr" rid="ref16 ref2">2, 16</xref>
        ] and semantic redundancy (e.g. same
meaning written with di erent words or in di erent languages), which makes up an
even a larger percentage of the content.
      </p>
      <p>
        In many cases when we use data samples, the sample can have some speci c
bias. Sometimes this bias can be very hard to notice or to correct. One of the
most well-known examples is click data in search engines, where the data is
biased by the ranking and the user interface (see for example [
        <xref ref-type="bibr" rid="ref10 ref7">10, 7</xref>
        ]). In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we
show evidence that Web publishers actually perform queries in order to nd some
content and republish. Thus, the conclusion is that part of the Web content is
biased by the ranking function of search engines. Hence, this a ects the quality
of search engines.
      </p>
      <p>Another interesting example of algorithm bias is tag recommendation.
Imagine that we can recommend tags to new objects contributed by people (e.g.
images). If we do so, in the long run, the recommendation algorithm will
generate most tags, not the people. Hence, the resulting tag space is not really a
folksonomy; it is a combination of a folksonomy and a machine-sonomy. This
is a problem not only because we do not have a folksonomy anymore, but also
because the algorithm will not be able to learn if there is not enough new data
coming from users.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Sparsity, Noise and Spam</title>
      <p>
        Many measures in the Web and other types of data follow a power law; so mining
big data works very well for the head of the power law distribution without
needing much data. This stops being true when the long tail is considered, because
the data is sparser. Serving long tail needs is critical to all users, as all users have
their shares of head and long tails needs as demonstrated in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Yet, it often
happens that not enough data covering the long tail is available when aggregated
at the user level. Also, there will always be cases where the main part of the data
distribution will bury the tail (for example, a secondary meaning of a query in
Web search). We explored the sparsity trade-o s regarding personalization and
privacy in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>We can always try to improve results by adding more data, if available. Doing
so, however, might not always be bene cial. For example, if the added data
increases the noise level, results get worse. We could also reach a saturation
point without seeing any improvements, so in this case more data is worthless.</p>
      <p>
        Worsen results can also be due to Web spam. That is, actions done by users
in the form of content, links or usage, that are targeted to manipulate some
measurement in the Web. The main example nowadays is Web spam to improve
the ranking of a given website in a Web search engine [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and there are a myriad
of techniques to deal with it [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. However this manipulation can happen at all
levels, from hotel ratings to even Google Scholar citation counts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Filtering
spam is a non trivial problem and is one of the possible bias sources of any data3.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Privacy</title>
      <p>Currently, most institutions that deal with personal data guarantee that this
data is not shared with third parties. They also employ as much secure
communication and storage as possible to promise their clients or users that personal
information cannot be stolen away. In some cases, such as in Web search engines,
they have devised data retention policies to reassure regulators, the media and,
naturally, their users, that they comply with all legal regulations. For example,
they anonymize usage logs after six months and they de-identify them after 18
months (that is, for example queries cannot be mapped to anonymized users).
One of the problematic twists of data, even more for big data, is that in many
cases a speci c user would prefer to forget past facts, especially in the context of
the Web. This privacy concern keeps rising, especially with the advent of social
networks.</p>
      <p>
        Companies that use any kind of data are accountable to regulators such as the
Federal Trade Commission (FTC) in the United States or should comply with the
Data Protection Directive legislated in 1995 by the European Union Parliament.
Indeed, the FTC has de ned several frameworks for protecting consumer privacy,
especially online [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Recently, the FTC commissioner even threatened to go to
congress if privacy policies do not \address the collection of data itself, not just
how the data is used" [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. For similar reasons, the European Union is currently
working on a new data protection directive that will supersede the old one.
      </p>
      <p>
        Numerous research e orts have been dedicated to data anonymization. A
favored one in large data sets is k-anonymity, introduced by Sweeney [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], which
proposes to suppress or generalize attributes until each entry in the repository
is identical to at least k 1 other entries. To motivate this concept Sweeney
shows that a few attributes are su cient to identify many characteristics of most
people. For example, a triple such as fZIP code, date of birth, genderg allows to
identify 87% of citizens in the USA by using publicly available databases. Today,
in most problems where we need to derive insights from big data, k-anonymity
is a de facto standard protection technique.
      </p>
      <p>
        However, sometimes anonymizing data is not enough. One of the main
examples appears in the context of Web search engines. In this case, users are
concerned that their queries expose certain facets of their life, interests,
personality, etc. that they might not want to share. This includes sexual preferences,
health issues or even some seemingly minor details such as hobbies or taste in
movies that they might not be comfortable sharing with everybody. Queries and
3 We distinguish between noise that comes from the data itself, e.g., due to the
measurement mechanism, from spam, which can be considered as arti cial noise added
by humans.
clicks on speci c pages indeed provide so much information that the entire
business of computational advertising rely on these. Query logs and click data reveal
so much about users that most search engines stopped releasing logs to the
research community due to the infamous AOL incident. Indeed, a privacy breach
in query logs was exposed by two New York Times journalists who managed
to identify one speci c user in the anonymized log [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They spotted several
queries, originating from a same user, referring to the same last name or speci c
locations. Eventually, they could link these queries to a senior woman who
conrmed having issued not only these queries, but also more private ones. While
not all users could necessarily be as easily identi ed, it revealed what many
researchers had realized a while back, namely that simply replacing a user name by
a number is not a guarantee of anonymity. Moreover, this incident exposed how
di cult was the problem, as it is very hard to guarantee any privacy disclosure
when you can cross your private data with a large number of public data sets.
Subsequent research has shown that several attributes, such as gender or age,
can be predicted approximately quite well [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Epilogue</title>
      <p>
        Today big data is certainly a trendy keyword. For this reason we have explored
many fundamental questions that we need to address when handling large
volumes of data. For the same reason this year the rst Big Data conferences are
being held, in particular the rst IEEE Big Data conference4. What is not clear
is the real impact that these conferences will have, and which researchers will be
attracted to them. As [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] states, could be a matter of size, e ciency, community,
or logistics. Time will tell.
      </p>
      <p>Acknowledgements
We thank Paolo Boldi for his helpful comments. This research was partially
supported by MICINN (Spain) through Grant TIN2009-14560-C03-01.</p>
      <sec id="sec-6-1">
        <title>4 www.ischool.drexel.edu/bigdata/bigdata2013.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          .
          <article-title>Modern Information Retrieval: The Concepts and Technology behind Search</article-title>
          .
          <source>Second Edition</source>
          . Addison-Wesley,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Pereira</given-names>
            <surname>Jr.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Ziviani</surname>
          </string-name>
          .
          <article-title>Genealogical trees on the Web: a search engine user perspective</article-title>
          .
          <source>In WWW</source>
          <year>2008</year>
          , Beijing, China,
          <year>Apr 2008</year>
          ,
          <fpage>367</fpage>
          -
          <lpage>376</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          .
          <article-title>Usage Data in Web Search: Bene ts and Limitations</article-title>
          . In Scienti c and
          <article-title>Statistical Database Management: 24th SSDBM, A. Ailamaki and S</article-title>
          . Bowers (eds).
          <source>LNCS 7338</source>
          , Springer, Chania, Crete,
          <volume>495</volume>
          {
          <fpage>506</fpage>
          ,
          <year>June 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Barbaro</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. Z.</given-names>
            <surname>Jr</surname>
          </string-name>
          .
          <article-title>A face is exposed for aol searcher no. 4417749</article-title>
          . The New York Times, Aug 9
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          . SAMOA:
          <article-title>Scalable Advanced Massive Online Analysis</article-title>
          . http:// samoa-project.
          <source>net/</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ciaramita</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Altun</surname>
          </string-name>
          .
          <article-title>Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger In Proceedings of Empirical Methods in Natural Language Processing</article-title>
          (EMNLP),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          and
          <string-name>
            <surname>Y. Zhang.</surname>
          </string-name>
          <article-title>A dynamic bayesian network click model for web search ranking</article-title>
          <source>In Proceedings of the 18th international conference on World wide web (WWW'09)</source>
          . pp.
          <volume>1</volume>
          {
          <issue>10</issue>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <surname>S</surname>
          </string-name>
          , Ghemawat.
          <source>MapReduce: Simpli ed Data Processing on Large Clusters. In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI'04)</source>
          . pp.
          <fpage>137</fpage>
          -
          <lpage>149</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>E.</given-names>
            <surname>Delgado</surname>
          </string-name>
          Lopez-Cozar,
          <string-name>
            <given-names>N.</given-names>
            <surname>Robinson-Garc</surname>
          </string-name>
          <string-name>
            <given-names>a</given-names>
            , and
            <surname>D.</surname>
          </string-name>
          Torres-Salinas.
          <article-title>Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting</article-title>
          . Arxiv: http://arxiv.org/abs/1212.0638,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>G.</given-names>
            <surname>Dupret</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          .
          <article-title>A user browsing model to predict search engine click data from past observations</article-title>
          .
          <source>In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <volume>331</volume>
          {
          <issue>338</issue>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Federal Trade Commission.
          <article-title>Protecting Consumer Privacy in an Era of Rapid Change. A Proposed Framework for Business and Policymakers</article-title>
          .
          <source>Preliminary FTC Sta Report</source>
          ,
          <year>December 2012</year>
          (http://www.ftc.gov/os/2010/12/ 101201privacyreport.pdf)..
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>S.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          , E. Gabrilovich, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Pang</surname>
          </string-name>
          .
          <article-title>Anatomy of the long tail: ordinary people with extraordinary tastes</article-title>
          .
          <source>In Proceedings of the third ACM international conference on Web search and data mining</source>
          ,
          <source>WSDM '10</source>
          , pages
          <fpage>201</fpage>
          {
          <fpage>210</fpage>
          , New York, NY, USA,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomkins</surname>
          </string-name>
          .
          <article-title>\i know what you did last summer": query logs and user privacy</article-title>
          .
          <source>In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management</source>
          , pages
          <volume>909</volume>
          {
          <fpage>914</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          .
          <article-title>Big Data Conferences, Here We Come!</article-title>
          .
          <source>IEEE Internet Computing</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>3</issue>
          , May/June 2013, pp.
          <fpage>3</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>J.</given-names>
            <surname>Mullin</surname>
          </string-name>
          .
          <article-title>FTC commissioner: If companies don't protect privacy, we'll go to congress</article-title>
          . paidContent.org,
          <source>the Economics of Digital Content</source>
          ,
          <year>Feb 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          .
          <article-title>Detecting duplicate web documents using click-through data</article-title>
          .
          <source>In Proceedings of the fourth ACM international conference on Web search and data</source>
          mining pp.
          <volume>147</volume>
          {
          <issue>156</issue>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>N.</given-names>
            <surname>Spirin</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Han.</surname>
          </string-name>
          <article-title>Survey on web spam detection: principles and algorithms</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter archive. Volume 13 Issue 2</source>
          , pp.
          <volume>50</volume>
          {
          <issue>64</issue>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>J.</given-names>
            <surname>Surowiecki</surname>
          </string-name>
          .
          <article-title>The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations</article-title>
          .
          <source>Random House</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney.</surname>
          </string-name>
          k-anonymity:
          <article-title>a model for protecting privacy</article-title>
          .
          <source>International Journal on Uncertainty, Fuzziness and Knowledge-based Systems</source>
          ,
          <volume>10</volume>
          (
          <issue>5</issue>
          ):
          <volume>557</volume>
          {
          <fpage>570</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>