<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hourly Traffic Prediction of News Stories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luís Marujo LTI/CMU</string-name>
          <email>Luis.Marujo@inesc-id.pt</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>USA INESC-ID/IST</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Portugal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Bugalho INESC-ID/IST</string-name>
          <email>Miguel.Bugalho@l2f.inesc-id.pt</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Portugal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>João P. Neto INESC-ID/IST</string-name>
          <email>Joao.Neto@inesc-id.pt</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Portugal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anatole Gershman LTI/CMU</string-name>
          <email>anatoleg@cs.cmu.edu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaime Carbonell LTI/CMU</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>The process of predicting news stories popularity from several news sources has become a challenge of great importance for both news producers and readers. In this paper, we investigate methods for automatically predicting the number of clicks on a news story during one hour. Our approach is a combination of additive regression and bagging applied over a M5P regression tree using a logarithmic scale (log10). The features included are social-based (social network metadata from Facebook), content-based (automatically extracted keyphrases, and stylometric statistics from news titles), and time-based. In 1st Sapo Data Challenge we obtained 11.99% as mean relative error value which put us in the 4th place out of 26 participants. 1 http://labs.sapo.pt/blog/2011/03/10/1st-sapo-data-challenge/ Predicting the popularity of a news story is a difficult task [9]. Popularity of a story can be measured in terms of the number of views, votes or clicks it receives in a period of time. Clickthrough rate (CTR) is the most popular way of measuring success [7, 10]. It is defined as the ratio of the number of times the user clicked on a page link and the total number of times the link was presented. Popularity of a news story is influenced by many factors, including the item's quality, social influence and novelty. The item's quality is mixture of fluency, rhetoric devices, vocabulary usage, readability level, and the ideas expressed which makes quality hard to measure [15]. The social influence consists on knowing about other people's choices and opinions [9]. Salganik et al. [15] show that item's quality is a weaker predictor of popularity than social influence. This partially explains the difficulty of predicting article popularity based solely on its content and novelty. Most popular portals such as Digg and Slashdot allow users to submit and rate news stories by voting on them. This information is often used by a collaborative filtering algorithm to predict popularity, select and order news items [8, 9, 16]. Typically, these models perform linear regression [18] on a logarithmically transformed data. In this paper, we present an approach for predicting the click rate based on a combination of content-based, social network-based, and time-based features. The main novelty of the approach is the type of content-based features we extract (e.g.: number of keyphrases), the inclusion of time-based features, and the prediction process that combines several regression methods to produce and estimate of a number of clicks per hour. This paper is organized as follows: Section 2 presents the dataset used to train and test our predictions system; the description of the proposed prediction methodology is presented in Section 3; the results are described in Section 4, and Section 5 contains conclusions and suggestions for future work.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Prediction</kwd>
        <kwd>News</kwd>
        <kwd>Clicks</kwd>
        <kwd>Sapo Challenge</kwd>
        <kwd>Traffic</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. INTRODUCTION
“Can we predict the number of clicks that a news story link
receives during one hour”? This was the main research question
proposed in the 1st Sapo Data Challenge – Traffic prediction1
launched by PT Comunicações. Sapo is the largest Portuguese
web portal. Its home page (http://www.sapo.pt) receives about 13
million daily page views and 2.5 million daily visits2. The home
page has several sections that link to different types of content,
such as news, videos, opinion articles, blog previews, etc.
Currently, the selection and ordering of news stories is done
mostly manually by the site editors. Obviously, manual solutions
do not scale and we need to find a method for automatically
predicting popularity of news stories.</p>
      <p>Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The dataset contains 13140 link-hour entries from 1217 news
story links gathered from several news sources over 15
consecutive days. Each link-hour entry records the number of
clicks on a particular link shown in Sapo Portal (Figure 1) during
a particular hour. Each entry contains 8 fields:
1.</p>
      <p>Line Number: number identifying the entry.</p>
      <sec id="sec-2-1">
        <title>Date + Time information: the date and daytime at</title>
        <p>which the hits took place as a string.</p>
        <p>Channel ID: a number identifying Sapo’s source
(channel) that produced the content. There are contents
from 18 different sources.</p>
        <p>Section (topic): there are 5 possible sections: general,
sport, economy, technology, and life.</p>
        <p>Subsection: each placeholder is further divided in five
subsections: “manchete”, “headlines” and “related”,
“footer” and “null”. This is an important parameter
because each subsection is visually smaller than the
previous ones when presented to the user (Figure 1).</p>
        <p>News ID: an integer identifying the content.</p>
        <p>Number of hits/clicks: the number of hits that the
linked content received, during one hour (see field 2
above).</p>
        <p>Title: the title of the news story.</p>
        <p>
          Example: [13116] [2011-03-08 23:00:00] [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] [geral] [manchete]
[1214] [401] [Barcelona segue para os quartos-de-final]
At first, there was a training set (95% of all entries) and then a test
set became available.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Prediction Methodology</title>
      <p>
        To address the prediction of clicks challenge, we adopted a
supervised learning framework, based on WEKA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],. It consists
of 2 steps: feature extraction, and regression.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Feature Extraction</title>
      <p>Figure 6 provides an overview of the feature extraction process
which starts with the initial (base) features taken from the dataset
entries and produces an enriched set of 3 types of features:
content-based, social network-based, and time-based. The
content-based features include: the number of web pages
containing the same title (F1), the number of occurrences of
certain key phrases in news articles (F3), and the stylometric
features of the title (F4).</p>
      <sec id="sec-4-1">
        <title>Content-based features (F1, F3, and F4)</title>
        <p>We use the news title as query in the Sapo RSS search interface to
get the total number of stories with the same title (F1).
Initial
Features</p>
        <p>Title</p>
        <p>URL extraction</p>
        <p>Facebook Stats</p>
        <p>Extraction
Text Extraction</p>
        <p>AKE
Title Features
Extraction</p>
        <p>
          All
Features
The main textual content is extracted using the boilerpipe library
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Our supervised Automatic Key-phrase Extraction (AKE) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
method is applied over the extracted news text from the training
set documents to create a list key phrases. The AKE system,
developed for European Portuguese Broadcast News [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] , is an
extended version of Maui-indexer toolkit [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] , which in turn is an
improved version of KEA [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The system can be easily adapted
to support other languages such as English.
        </p>
        <p>We filtered key phrases, extracted using the AKE system, with a
confidence level lower than 50%. Their number of occurrences is
used as features of an article (F3). The final list of key phrases
contains 34 key phrases, e.g.: Portugal, United States, market.
These key phrases are used to compare the content of news
stories.</p>
        <p>The stylometric features of the title (F4) include: the number of
words, maximum word length, minimum word length, the number
of quotes, the number of capital letters, and the number of named
entities identified by MorphAdorner Name Recognizer3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Social network-based features (F2)</title>
        <p>The social network-based features are metadata information
retrieved from Facebook (F2). The social-based features are
extracted by calling the Facebook API and retrieving Facebook
Metadata or statistics containing the URL of the article: the
number of shares, the number of likes, the number of comments,
and the total number of occurrences in Facebook data. We have
also extracted Twitter metadata, i.e., the number of tweets, but it
was excluded because Portuguese news tweets containing URLs
are rare.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Time-based features (F5)</title>
        <p>There are 3 time-based features: day, time and the number of
hours elapsed from the initial publication of the article (F5).
The number of hours elapsed from the initial publication of an
article (F5) is used as the initial approximation of its novelty.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3.2 Regression</title>
      <p>The goal of regression methods is to build a function f(x) that
maps a set of independent variables or features (X1, X2,..., Xn)
into a dependent variable or label Y. In our case, we aim to build
regression models using a training dataset to predict the number of
clicks in the test set.</p>
      <p>In this work we explored a combination of regression algorithms.
We explored linear regression, regression trees: REPTree and
M5P.To further improve the results, we combined the best
performing regression algorithm (M5P) with two
metaalgorithms: Additive Regression and Bagging.</p>
      <sec id="sec-5-1">
        <title>REPTree – regression-based tree</title>
        <p>
          The REPTree algorithm is fast regression tree learner based on
C4.5 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. It builds a regression tree using information variance,
reduced-error pruning (with back-fitting), and only sorting
numeric attributes once.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>M5P – regression-based tree</title>
        <p>
          The M5P algorithm is used for building regression-based
trees[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ][
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. M5P is a reconstruction of Quinlan's M5 algorithm
for inducing trees of regression models. M5P combines a
3 http://morphadorner.northwestern.edu
conventional decision tree with the possibility of linear regression
functions at the nodes. First, a decision-tree induction algorithm is
used to build a tree, but then, instead of maximizing the
information gain at each inner node, a splitting criterion is used
that minimizes the intra-subset variation in the class values down
each branch. The splitting procedure in M5P stops if the class
values of all instances that reach a node vary very slightly, or only
a few instances remain.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Bagging</title>
        <p>
          Bagging, also known as Bootstrap aggregating, is a machine
learning meta-algorithm proposed by Breiman [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] which is used
with many classification and regression techniques, such as
decision tree models to reduce the variance associated with the
predictions, thereby improving the results.
        </p>
        <p>The idea consists of using multiple versions of a training set;
each version is created by randomly selecting samples of the
training dataset, with replacement. For each subset, a
regression model is built by applying a previously selected
learning algorithm. The learning algorithm must remain the
same in all iterations. The final prediction is given by
averaging the predictions of all the individual classifiers.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Additive Regression (AR)</title>
        <p>
          Additive regression, or Stochastic Gradient boosting [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], is a
machine learning ensemble meta-algorithm or meta-classifier. It
enhances the performance of another regression classifier (base or
weak learner) ℎ! ! such as regression tree.
        </p>
        <p>This method is an iterative algorithm, which constructs additive
models (sum of weak learners – Equation 2) by fitting a base
learner to the current residue at each iteration. The residue is the
gradient of the loss function L(ti, si); where ti is the true value of
the sample and si = Hk-1(xi). The default loss function is least
squares. At each iteration, a subsample of data is drawn
uniformly at random, without replacement, from the full
training set. This random subsample is used to train the base
learner ℎ! ! to produce a model (Equation 2) for the current
iteration.</p>
        <p>!
!! ! =</p>
        <p>!! ℎ! !                                                                                                                                (2)  
!!!
where !! denotes the learning rate (expansion coefficients), k is
the number of iterations and x is the features vector.</p>
      </sec>
      <sec id="sec-5-5">
        <title>Combining Bagging and Additive Regression</title>
        <p>
          Because bagging is more robust than additive regression in noisy
datasets, but additive regression performs better in reduced or
noise-free data, Sotiris [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] proposed the combination of bagging
and additive regression and showed improvements for regression
trees. Our combination approach consists of using the bagging
meta-classifiers as the base classifiers of the additive regression.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Experimental Evaluation</title>
      <p>In this section, we describe the evaluation procedures used during
this work. We divided the evaluation in 2 stages: at first, the test
set was not available and we used the training set for the
experiments using 10-fold cross-validation. When the test set
became available, we trained on the whole training set and
evaluated on the test set. We evaluated our results using Mean
Absolute Error and Relative Absolute Error. Assume that pi is the
predicted number of clicks and ti is the true value, then:
!"#</p>
      <p>!
!"#
!
!
!!!
!
!!!
!"(!)</p>
      <p>!!
!"#$%&amp;'(  !""#": !"(!) = |!! − !!|                                                                                                (3)
 !"#$%&amp;'"  !""#": !" ! =</p>
      <p>                                                                                                              (4)
!"#"$%&amp;'()  !"#$%&amp;'(  !""#":  !"# =  
!"(!)                                             (5)
!"#"$%&amp;'()  !"#$%&amp;'"  !""#":  !"# =  
!"(!)                                               (6)
!"#$  !"#$%&amp;'(  !""#": !"# =  
                                                                                      (7)
!"#$  !"#$%&amp;'"  !""#": !"# =  
                                                                                        (8)
The challenges results were presented in Cumulative Absolute
Error and Cumulative Relative error. In this paper we opted to
provide their means to make the results comparable (cumulative
results depend on the number of examples). It is important to note
that relative absolute error displayed in the WEKA interface
differs from the equations above, and as a result it was not used in
this work. In the WEKA interface this measure divides the mean
absolute error by the corresponding error of the ZeroR classifier
on the data (i.e.: the classifier predicting the prior probabilities of
the classes or values observed in the data).</p>
      <p>Table 1 shows the results using several supervised
machinelearning techniques obtained in the first evaluation stage. This
evaluation was performed in the training set using 10 fold cross
validation. We tried both REPTree and linear regression, because
they have a faster training time. In addition, linear regression is
the most frequent method selected for popularity estimation.
Nevertheless, they were outperformed by M5P.</p>
      <p>M5P regression generates outlier predictions, i.e.: a negative
number of clicks or very large number of clicks. We have
considered two possibilities to solve this problem: either setting
all outliers to 1; or changing the negative outliers values to 1 and
the positive outliers to the maximum number of clicks. The first
solution gave better results. The lack of information justifies the
occurrence of these outliers. Therefore, we used the most frequent
value (1) given the number of clicks distribution (Figure 2).
The conversion of the number of clicks to a logarithmic scale
helped to approximate a linear distribution to which our methods
could be better suited. This fact is easy to understand by
comparing the distribution of clicks in Figure 2 and 3. In addition,
it also helped to eliminate outliers.</p>
      <p>Table 1 – Results obtained in the training set using all features
and 10 fold cross-validation (p-value ≈ 0.0).</p>
      <sec id="sec-6-1">
        <title>Configuration</title>
        <p>Linear Regression</p>
        <p>REPTree</p>
        <p>M5P</p>
        <p>AR+Bag.+M5P
AR+Bag.+M5P + (outliers</p>
        <p>values set to 1)
AR+Bag.+M5P+log10</p>
        <p>AR+Bag.+M5P+ln</p>
        <p>MAE</p>
        <p>The best performing features were the keyphrase based features.
They capture semantic information at a more detailed level than
topic information does. Information about locations, such as
Lisbon and Oporto (the capital of Portugal and the second most
important city in Portugal); sports clubs e.g.: Benfica, (Futebol
Clube do) Porto, politics e.g.: European Union, Greece, Europe;
economics, e.g.: fee, market; and technology, e.g.: computers.
The stylometric features extracted from the title (F4) were also
useful to reduce both MAE and MRE. The number of hours from
the initial publication (F5) conveys a very small improvement in
the results. We observed improvement with the Base features and
when we include bagging and additive regression.</p>
        <p>Without applying the logarithmic transformation, our system falls
into 7th place.</p>
        <p>We noticed the MRE for the test set was higher. Close
examination of the test set revealed that some of its characteristics
were different from the training set. For example, the average
number of clicks and the variance in the number of clicks were
significantly higher in the training set which caused the algorithm
overestimate the number of clicks in its predictions for the test set.
That may explain the discrepancy in the results.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Conclusions</title>
      <p>In this paper we described the problem of predicting the number
of clicks that a news story link receives during one hour. Real data
from the largest Portuguese web portal, Sapo, was used to train
and test our proposed prediction methodology. The data was made
available to all participants of the 1st Sapo Data Challenge
Traffic Prediction.</p>
      <p>Despite the fact that predicting item’s popularity per hour is a
very difficult task, our approach obtained results that are close to
the real number of clicks (12% MRE and 152 MAE). These
results yield us a 4th place (in all categories) in the challenge over
26 participants.</p>
      <p>
        Results have shown that the social-based and time-based features
had little correlation with the number of clicks in the portal. In
contrast, the content-based features had a very large impact. This
contradicts the results obtained by Lerman [
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ], which, perhaps,
can be explained by their use of Digg, a social media news site.
Regarding the method, the usage of logarithmic scale on the
number of clicks had the greatest impact on the final result,
especially on MRE (almost 20% improvement on the training set
using 10 fold cross validation). However, both the refinement of
the regression methods used and the constant setting of the
outliers obtained visible improvements. In fact for the MAE
results, improvements of 16 and 12 percentage points were
obtained for the regression refinement and outlier treatment,
which, in total, are more than the double of the improvement
obtained with the logarithmic scale transformation (that yielded a
gain of 13 percentage points).
      </p>
      <p>In future work, we will investigate ways to increase the use of
automatic key-phrase extraction, e.g.: including a larger set of
concepts that better capture the document contents and topics. We
will also explore if the inclusion of sentiment analysis features can
improve the accuracy of our predictions.</p>
    </sec>
    <sec id="sec-8">
      <title>6. ACKNOWLEDGMENTS</title>
      <p>The authors would like to thank PT Comunicações for providing
the dataset and creating the 1st Sapo Data Challenge – Traffic
Prediction. We want to thank Professors Mário Figueiredo and
Isabel Trancoso for fruitful comments. Support for this research
by FCT through the Carnegie Mellon Portugal Program and under
FCT grant SFRH/BD /33769/2009. This work was also partially
funded by European Commission under the contract
FP7-SME262428 EuTV, QREN SI IDT 2525, and SI IDT 5108. Support by
FCT (INESC-ID multiannual funding) through the PIDDAC
Program funds.</p>
    </sec>
    <sec id="sec-9">
      <title>7. REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>The Long Tail: Why the Future of Business is Selling Less of More</article-title>
          . Hyperion Books.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>1996</year>
          .
          <article-title>Bagging predictors</article-title>
          .
          <source>Learning</source>
          .
          <volume>24</volume>
          ,
          <issue>2</issue>
          ,
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Stochastic Gradient Boosting</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          .
          <volume>38</volume>
          ,
          <issue>4</issue>
          (
          <year>2002</year>
          ),
          <fpage>367</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>The WEKA data mining software: an update</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          .
          <volume>11</volume>
          ,
          <issue>1</issue>
          (
          <year>2009</year>
          ),
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kohlschutter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fankhauser</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nejdl</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Boilerplate detection using shallow text features</article-title>
          .
          <source>Proceedings of the third ACM international conference on Web search and data mining</source>
          ,
          <fpage>441</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Kotsiantis</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Combining Bagging and Additive Regression</article-title>
          .
          <source>Sciences-New York</source>
          ,
          <fpage>61</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>König</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gamon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Clickthrough prediction for news queries</article-title>
          .
          <source>Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (N.Y.,USA)</source>
          ,
          <fpage>347</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lerman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Social Information Processing in Social News Aggregation</article-title>
          .
          <source>IEEE Internet Computing: special issue on Social Search</source>
          .
          <volume>11</volume>
          ,
          <issue>6</issue>
          ,
          <fpage>16</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Lerman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hogg</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Using a model of social dynamics to predict popularity of news</article-title>
          .
          <source>Proceedings of the 19th international conference on World wide web (</source>
          <year>2010</year>
          ),
          <fpage>621</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>E.R.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Personalized News Recommendation Based on Click Behavior</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Intelligent User Interfaces (Hong Kong, China)</source>
          ,
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Marujo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viveiros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Neto</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Keyphrase Cloud Generation of Broadcast News</article-title>
          .
          <source>Interspeech</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Medelyan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrone</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Subject metadata support powered by Maui</article-title>
          .
          <source>Proceedings of JCDL '10</source>
          <string-name>
            <surname>(N.Y.</surname>
          </string-name>
          , USA),
          <volume>407</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1993</year>
          .
          <article-title>C4.5: programs for machine learning</article-title>
          .
          <source>Machine Learning</source>
          .
          <volume>240</volume>
          , (
          <year>1993</year>
          ),
          <fpage>235</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          <year>1992</year>
          .
          <article-title>Learning with continuous classes</article-title>
          .
          <source>5th Australian joint conference on artificial intelligence</source>
          ,
          <fpage>343</fpage>
          -
          <lpage>348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Salganik</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Watts</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Experimental study of inequality and unpredictability in an artificial cultural market</article-title>
          .
          <source>Science</source>
          .
          <volume>311</volume>
          ,
          <issue>5762</issue>
          (Feb.
          <year>2006</year>
          ),
          <fpage>854</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Szabó</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Huberman</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Predicting the popularity of online content</article-title>
          .
          <source>Communications of the ACM. 53</source>
          ,
          <issue>8</issue>
          ,
          <fpage>80</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          <year>1996</year>
          .
          <article-title>Induction of Model Trees for Predicting Continuous Classes</article-title>
          .
          <source>Poster papers of the 9th European Conference on Machine Learning.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Data Mining: Practical machine learning tools and techniques</article-title>
          . Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paynter</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutwin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nevill-Manning</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>KEA: Practical automatic keyphrase extraction</article-title>
          .
          <source>Proceedings of the 4th ACM conference on Digital libraries</source>
          ,
          <fpage>254</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>