=Paper= {{Paper |id=Vol-1673/paper2 |storemode=property |title=News Article Position Recommendation Based on the Analysis of Article's Content - Time Matters |pdfUrl=https://ceur-ws.org/Vol-1673/paper2.pdf |volume=Vol-1673 |authors=Parisa Lak,Ceni Babaoglu,Ayse Basar Bener,Pawel Pralat |dblpUrl=https://dblp.org/rec/conf/recsys/LakBBP16 }} ==News Article Position Recommendation Based on the Analysis of Article's Content - Time Matters== https://ceur-ws.org/Vol-1673/paper2.pdf
      News Article Position Recommendation Based on The
          Analysis of Article’s Content - Time Matters

                       Parisa Lak                                Ceni Babaoglu                   Ayse Basar Bener
               Data Science Laboratory                      Data Science Laboratory            Data Science Laboratory
                 Ryerson University                           Ryerson University                 Ryerson University
                   Toronto, Canada                              Toronto, Canada                    Toronto, Canada
             parisa.lak@ryerson.ca                      cenibabaoglu@ryerson.ca ayse.bener@ryerson.ca

                                                                   Pawel Pralat
                                                            Data Science Laboratory
                                                              Ryerson University
                                                                Toronto, Canada
                                                              pralat@ryerson.ca


ABSTRACT                                                                   that moving towards being digital will affect the economic
As more people prefer to read news on-line, the newspa-                    model for news selection and the users’ interest play a bigger
pers are focusing on personalized news presentation. In                    role for news selection [7]. Therefore, users actively partici-
this study, we investigate the prediction of article’s position            pate in online personalized communities and they expect the
based on the analysis of article’s content using different text            online news agency to provide as much personalized services
analytics methods. The evaluation is performed in 4 main                   as possible. Such demand, on the other hand, puts pressure
scenarios using articles from different time frames. The re-               on the news agency to employ the most recent technology
sult of the analysis shows that the article’s freshness plays an           to satisfy their users.
important role in the prediction of a new article’s position.                 Our research partner, the news agency, is moving towards
Also, the results from this work provides insight on how to                providing a more personalized service to their subscribed
find an optimised solution to automate the process of as-                  users. Currently, the editors make the decision on which ar-
signing new article the right position. We believe that these              ticle to be placed in which section and to whom the article
insights may further be used in developing content based                   should be offered (i.e. the subscription type). This deci-
news recommender algorithms.                                               sion is purely made based on their experience. Similarly,
                                                                           the position of the news within the first page of the sec-
                                                                           tion is decided by the editors. The company would like to
CCS Concepts                                                               first automate the decision on article position process and
•Information systems → Content ranking; Recommender                        in the second step to provide personalized recommendations
systems;                                                                   to their users. They would like to position the news on each
                                                                           page based on the historical behavior of each user available
                                                                           through the analysis of user interaction logs.
Keywords                                                                      In this work, we investigate different solutions to opti-
Content-based Recommender System; Text Analytics; Rank-                    mize and automate the process of positioning the new arti-
ing models; Time-based Analysis                                            cles. The results of this study may further be used towards
                                                                           building personalized news recommendation algorithms for
1.    INTRODUCTION                                                         subscribed users at different tiers. The high level research
                                                                           question that we address in this study is:
   Since 1990s the Internet has transformed our personal and
business lives and one example of such a transformation is                   RQ- How to predict an article’s position in a news website?
the creation of virtual communities [2]. However, there are
challenges in the production, distribution and consumption                   To address this question, we evaluate three key factors.
of this media content [8]. Nicholas Negroponte has contented               First, we compare three text analytics techniques to find
                                                                           the best strategy to analyze the content of the available
                                                                           news articles. Second, we evaluate different classification
                                                                           techniques to find the best performing algorithm for article
                                                                           position prediction. Third, we investigate the impact of the
                                                                           time variable on the prediction accuracy. The main contri-
                                                                           bution of this work is to provide insights to researchers and
                                                                           practitioners on how to tackle a similar problem by provid-
                                                                           ing the results from a large scale real life data analysis.
CBRecSys 2016, September 16, 2016, Boston, MA, USA.                          The rest of this manuscript is organized as follows: Sec-
Copyright remains with the authors and/or original copyright holders
tion 2 provides a summary of prior work in this area. Section
3 describes the data and specifies the details of the analy-
sis performed in this work. The results of the analysis are
provided in Section 4 that is followed by the discussion and
future direction in Section 5.

2.    BACKGROUND
   To automate the process of assigning the right position
to a news article, researchers provide different solutions.
In most of the previous studies, a new article’s content is
analyzed using text analytics solutions. The result of the
analysis is then compared with the analysis of previously
published articles. The popularity of the current article is
predicted based on the similarity of this article with the pre-
viously published articles. Popularity is considered with dif-
ferent measures throughout literature. For example, Tatar            Figure 1: Step by step Prediction Methodology
et al. predicted the popularity of the articles based on the
analysis of the comments provided by the users [10]. An-
other study, evaluated the article’s popularity based on the       and the position of the article in the website. This informa-
amount of attention received by counting the number of vis-        tion along with the information from keyword vectors are
its [5]. Another article popularity measure that was used          then used as an input to the machine learning algorithms.
in a recent work by Bansal et al. is based on the analysis
of comment-worthy articles. Comment-worthyness is mea-
                                                                   3.2    Analysis
sured by the number of comments on a similar article [1].             We first analysed the content of each article available in
   In the current work, we considered the popularity measure       the first dataset, using three text analytics techniques. Key-
to be a combination of some of the aforementioned measures.        word Popularity, TF-IDF and word2vec were used to per-
Specifically, we used measures such as article’s number of         form these set of analyses.
visits, duration of visit, the number of comments and in-             For the Keyword Popularity technique, we extracted the
fluence of article’s author to evaluate the previous article’s     embedded keywords in the article’s content and generated
popularity. The popularity measure is then used towards            keyword weights based on the combination of two factors:
the prediction of article’s position on the news website.          the number of visits for a particular keyword and the du-
   To evaluate the content of the article and find the relevant    ration of the keyword on the website. For instance, if the
article topics several text analytics techniques has been used     article had a keyword such as ”Canada”, we evaluated the
by different scholars [4]. Among all, we selected three com-       popularity of ”Canada” based on the number of times it oc-
monly used approaches in this study. The three approaches          curred in the selected section and the number of times an
are Keyword popularity, TF-IDF and Word2Vec that will              article with the keyword ”Canada” was visited previously.
be explained in section 3.                                            In TF-IDF technique, TF measures the frequency of a
                                                                   keyword’s occurrence in a document and IDF refers to com-
                                                                   puting the importance of that keyword. The output from
3.    METHODOLOGY                                                  this technique is a document-term matrix with the list of
  In this section we specify the details of our data and we        the most important words along with their respective weight
outline the details of the methodology used to perform our         that describe the content of a document [9]. We used nltk
analysis. The general methodology framework that was used          package in python to perform this analysis over the content
in this study is illustrated in Figure 1.                          of each article.
                                                                      The last text analytics technique used in this study is
3.1    Data                                                        word2vec. This technique was published by Google in 2013.
   One year historical data was collected from the news agency’s   It is a two-layered neural networks that processes text[6].This
archive. Information regarding the articles published from         tool takes a text document as the input and produces a
May 2014 to May 2015 was extracted from the agency’s               keyword vector representation as an output. The system
cloud space. One dataset with the information regarding            constructs vocabulary from the training text as well as nu-
the content of the articles as well as its author and its pub-     merical representation of words. It then measures the cosine
lication date and time was extracted trough this process.          similarity of words and group similar words together. In an-
This dataset is then used to generate the keyword vector.          other words, this model provides a simple way to find the
   As illustrated in Figure 1, another dataset was also ex-        words with similar contextual meanings [9].
tracted from the news agency’s data warehouse. The infor-             A set of exploratory data analysis was performed on the
mation regarding the popularity of the article, such as Au-        second dataset to find the most relevant features to define
thor’s reputation, Article’s freshness and Article type were       article’s popularity. Based on the result from this set of
included in this dataset. The dataset also contained the           analysis we removed the highly correlated features. The
news URL as article related information. This piece of in-         popularity measure along with the position and the keyword
formation provides the details regarding the article’s section     vector of each article is then used in 4 main classification al-
and article’s subscription type. The current position of the       gorithms: support vector machine (SVM), Random forest,
article is also available in the second dataset. The popularity    k-nearest neighbors (KNN) and Logistic regression [3]. The
of the article is then calculated based on available features      result of the analysis are only reported for the first two al-
gorithms (i.e. SVM and Random Forest) as they were the
best performing algorithms among the four for our dataset.
   The steps to perform the prediction analysis also illus-
trated in Figure1. As shown, the analysis is mainly per-
formed in two phases denoted as ”Learning phase” and ”Pre-
diction phase”. In the learning phase the training dataset
is cleaned and preprocessed and the features to be used for
the evaluation of popularity are selected based on the ex-
ploratory analysis. All observations (i.e. articles) in this
dataset are also labeled with their current positions. In the
prediction phase, the article content is analyzed and the key-
word vectors are created based on the three text analytics                           (a) Keyword popularity
techniques. Then, the popularity of the article is calculated
based on available features. The test dataset is then passed
through the classifier, which predicts the position of the ar-
ticle. The accuracy of prediction is evaluated based on the
number of correctly classified instances to the total number
of observations and can be computed with Equation 1.

                           TP + TN
        Accuracy =                      × 100%              (1)
                      TP + FP + TN + FN
   The result of the analysis is reported in the following sec-
tion.                                                                                      (b) TF-IDF

4.   RESULTS
   The set of graphs in Figure 2 illustrates the percentage of
prediction accuracy trend for articles’ positions in 4 different
scenarios using the two classification algorithms. The blue
graph shows the accuracy trend for RandomForest classifi-
cation algorithm, while the green graph reports the accuracy
for the SVM. The 4 scenarios are based on the training data
used in the machine learning algorithms. The first points
from the left shows the accuracy for the scenario, when the
training set contains the articles from 2 months prior to the
publication of the test article. Similarly, the second point
from the left shows the scenario in which the training set                                (c) Word2Vec
contains articles from 4 months prior to the publication of
the test article and so on for the 8 months and 12 months
scenario.                                                          Figure 2: Prediction accuracy for SVM and Ran-
   Figure 2(a) shows the accuracy results for the articles         dom forest in 4 time frame scenarios (2, 4, 8 and
when their content (for both training and test dataset) is         12 months) using different article content analysis
analyzed based on Keyword Popularity technique. In this            techniques
graph we observe that the accuracy of the prediction algo-
rithm is related to the time frame factor used to build the
training set. More specifically, both algorithms perform best      tween the accuracy of the other time dependent scenarios
while the most recent articles are used in the training set. It    are not shown to be large. Although, SVM shows a different
clearly shows that the performance of both SVM and Ran-            accuracy trend for this text analytics technique, the accu-
dom Forest is dependant on the time frame that is used to          racy results for the Random Forest algorithm seems to be
define the training set.                                           consistent with the results from prior analysis. Specifically,
   Figure 2(b) provides the accuracy for the analysis of the       while using Woed2Vec and Random Forest algorithm, the
prediction in the case when the articles are analyzed by TF-       best performance is gained through the use of the most re-
IDF technique. The result of the analysis for this content         cent articles in the training set. On the contrary, the result
analysis technique further confirms that the accuracy of pre-      for this text analytics technique and the use of SVM algo-
diction is dependant on the time frame selected to define the      rithm works best, while using the older articles. Neverthe-
training set. For this type of article content analysis, SVM       less, SVM is not considered as the best performing algorithm
always works superior to RandomForest in terms of accu-            for this text analytics technique.
racy.                                                                 To better illustrate the performance of each text analytics
   Figure 2(c) shows the result of the prediction for the arti-    techniques based on the time dependent scenarios Figure 3
cles that are evaluated by Word2Vec technique. The result          is provided.
from this graph is different from the previous two graphs.            Figure 3 shows the result from the best performing algo-
The accuracy for the most recent articles using SVM shows          rithm for the three content analytics techniques within the 4
to be lower from other scenarios, however the difference be-       time dependent scenarios. The blue graph shows the perfor-
                                                                      The results from this study can cautiously be extended
                                                                   to other datasets. To avoid the impact of sampling biases
                                                                   we used 10 fold cross validation technique for our predic-
                                                                   tion models. Also, the analysis of the large scale real life
                                                                   data minimizes this threat to the validity of the result of
                                                                   this study. In our future work, we will use the the results
                                                                   from this study as well as the features detected through the
                                                                   exploratory analysis to design a personalized news recom-
                                                                   mendation system.

                                                                   6.   ACKNOWLEDGMENTS
                                                                     The authors would like to thank Bora Caglayan, Zeinab
                                                                   Noorian, Fatemeh Firouzi and Sami Rodrigue who worked
                                                                   at different stages of this project. This research is supported
Figure 3: Prediction accuracy based on the three                   in part by Ontario Centres of Excellence (OCE) TalentEdge
content analysis techniques for the 4 time frame sce-              Fellowship Project (TFP)-22085.
narios

                                                                   7.   REFERENCES
mance of SVM for TF-IDF technique and the green graph
and the red graph show the accuracy result for Random               [1] T. Bansal, M. Das, and C. Bhattacharyya. Content
Forest for Keyword popularity and Word2Vec, respectively.               driven user profiling for comment-worthy
This figure shows that for all the three content analysis tech-         recommendations of news and blog articles. In
niques, the best prediction performance is achieved while the           Proceedings of the 9th ACM Conference on
fresh articles are used for training purposes. The accuracy             Recommender Systems, pages 195–202. ACM, 2015.
is always dropped as old articles are added to the training         [2] P. J. Boczkowski. Digitizing the news: Innovation in
set in the 4 month scenario. In Word2vec technique, the                 online newspapers. mit Press, 2005.
accuracy increases when the 8 month prior articles are used         [3] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin.
for training. However, still the best performance is attained           The elements of statistical learning: data mining,
while using more recent documents.                                      inference and prediction. The Mathematical
   Another observation from this analysis is that TF-IDF                Intelligencer, 27(2):83–85, 2005.
technique provides the best text evaluation that further gen-       [4] S. Lee and H.-j. Kim. News keyword extraction for
erates higher prediction accuracy for article’s position.               topic tracking. In Networked Computing and Advanced
                                                                        Information Management, 2008. NCM’08. Fourth
                                                                        International Conference on, volume 2, pages 554–559.
5.   DISCUSSION & FUTURE DIRECTION                                      IEEE, 2008.
   Personalized news recommendation is a recently emerged           [5] L. Li, D.-D. Wang, S.-Z. Zhu, and T. Li. Personalized
topic of study based on the introduction of the interactive             news recommendation: a review and an experimental
online news media. The decision on the news presentation                investigation. Journal of Computer Science and
is made based on the assigned position of the article within            Technology, 26(5):754–766, 2011.
the news website. The position of the article can be assigned       [6] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
based on the popularity of the article. The popularity of the           J. Dean. Distributed representations of words and
article can be predicted based on the analysis of its content           phrases and their compositionality. In Advances in
and the similarity of the article’s content to the previously           neural information processing systems, pages
published articles. Previous article’s popularity is measured           3111–3119, 2013.
based on different popularity measures. In this study, we           [7] N. Negroponte. Being digital. Vintage, 1996.
used a combination of article’s popularity measure attributes       [8] J. V. Pavlik. Journalism and new media. Columbia
as well as the attributes from the analysis of the articles’            University Press, 2001.
content to predict the position of a new article.                   [9] N. Pentreath. Machine Learning with Spark. Packt
   We evaluated the impact of the three key factors on the              Publishing Ltd, 2015.
prediction of new article’s position. The results from the
                                                                   [10] A. Tatar, J. Leguay, P. Antoniadis, A. Limbourg,
analyses provide evidence that all three factors under inves-
                                                                        M. D. de Amorim, and S. Fdida. Predicting the
tigation in this study plays a role in the accuracy of predic-
                                                                        popularity of online articles based on user comments.
tion. One of the important findings from this work is that
                                                                        In Proceedings of the International Conference on Web
the result of the analysis of a new articles content should only
                                                                        Intelligence, Mining and Semantics, page 67. ACM,
be compared with the recent articles. The analysis shows
                                                                        2011.
that as the older articles are used as an input to the predic-
tion algorithm the accuracy of the system drops in almost all
cases. Also, the best performing prediction algorithm shows
to be dependent on the text analytics techniques used in the
analysis of the article’s content. Regardless of the prediction
algorithm the best text analytics technique for the current
dataset is shown to be TF-IDF.