<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>News Article Position Recommendation Based on The Analysis of Article's Content - Time Matters</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Parisa Lak</string-name>
          <email>parisa.lak@ryerson.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ceni Babaoglu</string-name>
          <email>cenibabaoglu@ryerson.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ayse Basar Bener</string-name>
          <email>ayse.bener@ryerson.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pawel Pralat</string-name>
          <email>pralat@ryerson.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Laboratory, Ryerson University</institution>
          ,
          <addr-line>Toronto</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>16</volume>
      <issue>2016</issue>
      <fpage>2</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>As more people prefer to read news on-line, the newspapers are focusing on personalized news presentation. In this study, we investigate the prediction of article's position based on the analysis of article's content using di erent text analytics methods. The evaluation is performed in 4 main scenarios using articles from di erent time frames. The result of the analysis shows that the article's freshness plays an important role in the prediction of a new article's position. Also, the results from this work provides insight on how to nd an optimised solution to automate the process of assigning new article the right position. We believe that these insights may further be used in developing content based news recommender algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Information systems ! Content ranking; Recommender
systems;</p>
      <p>
        Since 1990s the Internet has transformed our personal and
business lives and one example of such a transformation is
the creation of virtual communities [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, there are
challenges in the production, distribution and consumption
of this media content [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Nicholas Negroponte has contented
that moving towards being digital will a ect the economic
model for news selection and the users' interest play a bigger
role for news selection [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Therefore, users actively
participate in online personalized communities and they expect the
online news agency to provide as much personalized services
as possible. Such demand, on the other hand, puts pressure
on the news agency to employ the most recent technology
to satisfy their users.
      </p>
      <p>Our research partner, the news agency, is moving towards
providing a more personalized service to their subscribed
users. Currently, the editors make the decision on which
article to be placed in which section and to whom the article
should be o ered (i.e. the subscription type). This
decision is purely made based on their experience. Similarly,
the position of the news within the rst page of the
section is decided by the editors. The company would like to
rst automate the decision on article position process and
in the second step to provide personalized recommendations
to their users. They would like to position the news on each
page based on the historical behavior of each user available
through the analysis of user interaction logs.</p>
      <p>In this work, we investigate di erent solutions to
optimize and automate the process of positioning the new
articles. The results of this study may further be used towards
building personalized news recommendation algorithms for
subscribed users at di erent tiers. The high level research
question that we address in this study is:</p>
      <p>RQ- How to predict an article's position in a news website?
To address this question, we evaluate three key factors.
First, we compare three text analytics techniques to nd
the best strategy to analyze the content of the available
news articles. Second, we evaluate di erent classi cation
techniques to nd the best performing algorithm for article
position prediction. Third, we investigate the impact of the
time variable on the prediction accuracy. The main
contribution of this work is to provide insights to researchers and
practitioners on how to tackle a similar problem by
providing the results from a large scale real life data analysis.</p>
      <p>The rest of this manuscript is organized as follows:
Section 2 provides a summary of prior work in this area. Section
3 describes the data and speci es the details of the
analysis performed in this work. The results of the analysis are
provided in Section 4 that is followed by the discussion and
future direction in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>
        To automate the process of assigning the right position
to a news article, researchers provide di erent solutions.
In most of the previous studies, a new article's content is
analyzed using text analytics solutions. The result of the
analysis is then compared with the analysis of previously
published articles. The popularity of the current article is
predicted based on the similarity of this article with the
previously published articles. Popularity is considered with
different measures throughout literature. For example, Tatar
et al. predicted the popularity of the articles based on the
analysis of the comments provided by the users [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Another study, evaluated the article's popularity based on the
amount of attention received by counting the number of
visits [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Another article popularity measure that was used
in a recent work by Bansal et al. is based on the analysis
of comment-worthy articles. Comment-worthyness is
measured by the number of comments on a similar article [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>In the current work, we considered the popularity measure
to be a combination of some of the aforementioned measures.
Speci cally, we used measures such as article's number of
visits, duration of visit, the number of comments and
inuence of article's author to evaluate the previous article's
popularity. The popularity measure is then used towards
the prediction of article's position on the news website.</p>
      <p>
        To evaluate the content of the article and nd the relevant
article topics several text analytics techniques has been used
by di erent scholars [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Among all, we selected three
commonly used approaches in this study. The three approaches
are Keyword popularity, TF-IDF and Word2Vec that will
be explained in section 3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>METHODOLOGY</title>
      <p>In this section we specify the details of our data and we
outline the details of the methodology used to perform our
analysis. The general methodology framework that was used
in this study is illustrated in Figure 1.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Data</title>
      <p>One year historical data was collected from the news agency's
archive. Information regarding the articles published from
May 2014 to May 2015 was extracted from the agency's
cloud space. One dataset with the information regarding
the content of the articles as well as its author and its
publication date and time was extracted trough this process.
This dataset is then used to generate the keyword vector.</p>
      <p>As illustrated in Figure 1, another dataset was also
extracted from the news agency's data warehouse. The
information regarding the popularity of the article, such as
Author's reputation, Article's freshness and Article type were
included in this dataset. The dataset also contained the
news URL as article related information. This piece of
information provides the details regarding the article's section
and article's subscription type. The current position of the
article is also available in the second dataset. The popularity
of the article is then calculated based on available features
and the position of the article in the website. This
information along with the information from keyword vectors are
then used as an input to the machine learning algorithms.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Analysis</title>
      <p>We rst analysed the content of each article available in
the rst dataset, using three text analytics techniques.
Keyword Popularity, TF-IDF and word2vec were used to
perform these set of analyses.</p>
      <p>For the Keyword Popularity technique, we extracted the
embedded keywords in the article's content and generated
keyword weights based on the combination of two factors:
the number of visits for a particular keyword and the
duration of the keyword on the website. For instance, if the
article had a keyword such as "Canada", we evaluated the
popularity of "Canada" based on the number of times it
occurred in the selected section and the number of times an
article with the keyword "Canada" was visited previously.</p>
      <p>
        In TF-IDF technique, TF measures the frequency of a
keyword's occurrence in a document and IDF refers to
computing the importance of that keyword. The output from
this technique is a document-term matrix with the list of
the most important words along with their respective weight
that describe the content of a document [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We used nltk
package in python to perform this analysis over the content
of each article.
      </p>
      <p>
        The last text analytics technique used in this study is
word2vec. This technique was published by Google in 2013.
It is a two-layered neural networks that processes text[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].This
tool takes a text document as the input and produces a
keyword vector representation as an output. The system
constructs vocabulary from the training text as well as
numerical representation of words. It then measures the cosine
similarity of words and group similar words together. In
another words, this model provides a simple way to nd the
words with similar contextual meanings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        A set of exploratory data analysis was performed on the
second dataset to nd the most relevant features to de ne
article's popularity. Based on the result from this set of
analysis we removed the highly correlated features. The
popularity measure along with the position and the keyword
vector of each article is then used in 4 main classi cation
algorithms: support vector machine (SVM), Random forest,
k-nearest neighbors (KNN) and Logistic regression [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The
result of the analysis are only reported for the rst two
algorithms (i.e. SVM and Random Forest) as they were the
best performing algorithms among the four for our dataset.
      </p>
      <p>The steps to perform the prediction analysis also
illustrated in Figure1. As shown, the analysis is mainly
performed in two phases denoted as "Learning phase" and
"Prediction phase". In the learning phase the training dataset
is cleaned and preprocessed and the features to be used for
the evaluation of popularity are selected based on the
exploratory analysis. All observations (i.e. articles) in this
dataset are also labeled with their current positions. In the
prediction phase, the article content is analyzed and the
keyword vectors are created based on the three text analytics
techniques. Then, the popularity of the article is calculated
based on available features. The test dataset is then passed
through the classi er, which predicts the position of the
article. The accuracy of prediction is evaluated based on the
number of correctly classi ed instances to the total number
of observations and can be computed with Equation 1.</p>
      <p>Accuracy =</p>
      <p>T P + T N
T P + F P + T N + F N
100%
(1)</p>
      <p>The result of the analysis is reported in the following
section.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>The set of graphs in Figure 2 illustrates the percentage of
prediction accuracy trend for articles' positions in 4 di erent
scenarios using the two classi cation algorithms. The blue
graph shows the accuracy trend for RandomForest classi
cation algorithm, while the green graph reports the accuracy
for the SVM. The 4 scenarios are based on the training data
used in the machine learning algorithms. The rst points
from the left shows the accuracy for the scenario, when the
training set contains the articles from 2 months prior to the
publication of the test article. Similarly, the second point
from the left shows the scenario in which the training set
contains articles from 4 months prior to the publication of
the test article and so on for the 8 months and 12 months
scenario.</p>
      <p>Figure 2(a) shows the accuracy results for the articles
when their content (for both training and test dataset) is
analyzed based on Keyword Popularity technique. In this
graph we observe that the accuracy of the prediction
algorithm is related to the time frame factor used to build the
training set. More speci cally, both algorithms perform best
while the most recent articles are used in the training set. It
clearly shows that the performance of both SVM and
Random Forest is dependant on the time frame that is used to
de ne the training set.</p>
      <p>Figure 2(b) provides the accuracy for the analysis of the
prediction in the case when the articles are analyzed by
TFIDF technique. The result of the analysis for this content
analysis technique further con rms that the accuracy of
prediction is dependant on the time frame selected to de ne the
training set. For this type of article content analysis, SVM
always works superior to RandomForest in terms of
accuracy.</p>
      <p>Figure 2(c) shows the result of the prediction for the
articles that are evaluated by Word2Vec technique. The result
from this graph is di erent from the previous two graphs.
The accuracy for the most recent articles using SVM shows
to be lower from other scenarios, however the di erence
be(a) Keyword popularity
(b) TF-IDF
(c) Word2Vec
tween the accuracy of the other time dependent scenarios
are not shown to be large. Although, SVM shows a di erent
accuracy trend for this text analytics technique, the
accuracy results for the Random Forest algorithm seems to be
consistent with the results from prior analysis. Speci cally,
while using Woed2Vec and Random Forest algorithm, the
best performance is gained through the use of the most
recent articles in the training set. On the contrary, the result
for this text analytics technique and the use of SVM
algorithm works best, while using the older articles.
Nevertheless, SVM is not considered as the best performing algorithm
for this text analytics technique.</p>
      <p>To better illustrate the performance of each text analytics
techniques based on the time dependent scenarios Figure 3
is provided.</p>
      <p>Figure 3 shows the result from the best performing
algorithm for the three content analytics techniques within the 4
time dependent scenarios. The blue graph shows the
performance of SVM for TF-IDF technique and the green graph
and the red graph show the accuracy result for Random
Forest for Keyword popularity and Word2Vec, respectively.
This gure shows that for all the three content analysis
techniques, the best prediction performance is achieved while the
fresh articles are used for training purposes. The accuracy
is always dropped as old articles are added to the training
set in the 4 month scenario. In Word2vec technique, the
accuracy increases when the 8 month prior articles are used
for training. However, still the best performance is attained
while using more recent documents.</p>
      <p>Another observation from this analysis is that TF-IDF
technique provides the best text evaluation that further
generates higher prediction accuracy for article's position.</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION &amp; FUTURE DIRECTION</title>
      <p>Personalized news recommendation is a recently emerged
topic of study based on the introduction of the interactive
online news media. The decision on the news presentation
is made based on the assigned position of the article within
the news website. The position of the article can be assigned
based on the popularity of the article. The popularity of the
article can be predicted based on the analysis of its content
and the similarity of the article's content to the previously
published articles. Previous article's popularity is measured
based on di erent popularity measures. In this study, we
used a combination of article's popularity measure attributes
as well as the attributes from the analysis of the articles'
content to predict the position of a new article.</p>
      <p>We evaluated the impact of the three key factors on the
prediction of new article's position. The results from the
analyses provide evidence that all three factors under
investigation in this study plays a role in the accuracy of
prediction. One of the important ndings from this work is that
the result of the analysis of a new articles content should only
be compared with the recent articles. The analysis shows
that as the older articles are used as an input to the
prediction algorithm the accuracy of the system drops in almost all
cases. Also, the best performing prediction algorithm shows
to be dependent on the text analytics techniques used in the
analysis of the article's content. Regardless of the prediction
algorithm the best text analytics technique for the current
dataset is shown to be TF-IDF.</p>
      <p>The results from this study can cautiously be extended
to other datasets. To avoid the impact of sampling biases
we used 10 fold cross validation technique for our
prediction models. Also, the analysis of the large scale real life
data minimizes this threat to the validity of the result of
this study. In our future work, we will use the the results
from this study as well as the features detected through the
exploratory analysis to design a personalized news
recommendation system.
6.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors would like to thank Bora Caglayan, Zeinab
Noorian, Fatemeh Firouzi and Sami Rodrigue who worked
at di erent stages of this project. This research is supported
in part by Ontario Centres of Excellence (OCE) TalentEdge
Fellowship Project (TFP)-22085.
7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Das</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bhattacharyya</surname>
          </string-name>
          .
          <article-title>Content driven user pro ling for comment-worthy recommendations of news and blog articles</article-title>
          .
          <source>In Proceedings of the 9th ACM Conference on Recommender Systems</source>
          , pages
          <fpage>195</fpage>
          {
          <fpage>202</fpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Boczkowski</surname>
          </string-name>
          .
          <article-title>Digitizing the news: Innovation in online newspapers</article-title>
          . mit Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Franklin.</surname>
          </string-name>
          <article-title>The elements of statistical learning: data mining, inference and prediction</article-title>
          .
          <source>The Mathematical Intelligencer</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <volume>83</volume>
          {
          <fpage>85</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee and H.-</surname>
          </string-name>
          j. Kim.
          <article-title>News keyword extraction for topic tracking</article-title>
          .
          <source>In Networked Computing and Advanced Information Management</source>
          ,
          <year>2008</year>
          . NCM'
          <volume>08</volume>
          . Fourth International Conference on, volume
          <volume>2</volume>
          , pages
          <fpage>554</fpage>
          {
          <fpage>559</fpage>
          . IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Personalized news recommendation: a review and an experimental investigation</article-title>
          .
          <source>Journal of Computer Science and Technology</source>
          ,
          <volume>26</volume>
          (
          <issue>5</issue>
          ):
          <volume>754</volume>
          {
          <fpage>766</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Negroponte</surname>
          </string-name>
          .
          <article-title>Being digital</article-title>
          .
          <source>Vintage</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Pavlik</surname>
          </string-name>
          .
          <article-title>Journalism and new media</article-title>
          . Columbia University Press,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pentreath</surname>
          </string-name>
          .
          <article-title>Machine Learning with Spark</article-title>
          .
          <source>Packt Publishing Ltd</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tatar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leguay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Antoniadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Limbourg</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D. de Amorim</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Fdida</surname>
          </string-name>
          .
          <article-title>Predicting the popularity of online articles based on user comments</article-title>
          .
          <source>In Proceedings of the International Conference on Web Intelligence, Mining and Semantics, page 67. ACM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>