=Paper=
{{Paper
|id=Vol-1673/paper2
|storemode=property
|title=News Article Position Recommendation Based on the Analysis of Article's Content - Time Matters
|pdfUrl=https://ceur-ws.org/Vol-1673/paper2.pdf
|volume=Vol-1673
|authors=Parisa Lak,Ceni Babaoglu,Ayse Basar Bener,Pawel Pralat
|dblpUrl=https://dblp.org/rec/conf/recsys/LakBBP16
}}
==News Article Position Recommendation Based on the Analysis of Article's Content - Time Matters==
News Article Position Recommendation Based on The
Analysis of Article’s Content - Time Matters
Parisa Lak Ceni Babaoglu Ayse Basar Bener
Data Science Laboratory Data Science Laboratory Data Science Laboratory
Ryerson University Ryerson University Ryerson University
Toronto, Canada Toronto, Canada Toronto, Canada
parisa.lak@ryerson.ca cenibabaoglu@ryerson.ca ayse.bener@ryerson.ca
Pawel Pralat
Data Science Laboratory
Ryerson University
Toronto, Canada
pralat@ryerson.ca
ABSTRACT that moving towards being digital will affect the economic
As more people prefer to read news on-line, the newspa- model for news selection and the users’ interest play a bigger
pers are focusing on personalized news presentation. In role for news selection [7]. Therefore, users actively partici-
this study, we investigate the prediction of article’s position pate in online personalized communities and they expect the
based on the analysis of article’s content using different text online news agency to provide as much personalized services
analytics methods. The evaluation is performed in 4 main as possible. Such demand, on the other hand, puts pressure
scenarios using articles from different time frames. The re- on the news agency to employ the most recent technology
sult of the analysis shows that the article’s freshness plays an to satisfy their users.
important role in the prediction of a new article’s position. Our research partner, the news agency, is moving towards
Also, the results from this work provides insight on how to providing a more personalized service to their subscribed
find an optimised solution to automate the process of as- users. Currently, the editors make the decision on which ar-
signing new article the right position. We believe that these ticle to be placed in which section and to whom the article
insights may further be used in developing content based should be offered (i.e. the subscription type). This deci-
news recommender algorithms. sion is purely made based on their experience. Similarly,
the position of the news within the first page of the sec-
tion is decided by the editors. The company would like to
CCS Concepts first automate the decision on article position process and
•Information systems → Content ranking; Recommender in the second step to provide personalized recommendations
systems; to their users. They would like to position the news on each
page based on the historical behavior of each user available
through the analysis of user interaction logs.
Keywords In this work, we investigate different solutions to opti-
Content-based Recommender System; Text Analytics; Rank- mize and automate the process of positioning the new arti-
ing models; Time-based Analysis cles. The results of this study may further be used towards
building personalized news recommendation algorithms for
1. INTRODUCTION subscribed users at different tiers. The high level research
question that we address in this study is:
Since 1990s the Internet has transformed our personal and
business lives and one example of such a transformation is RQ- How to predict an article’s position in a news website?
the creation of virtual communities [2]. However, there are
challenges in the production, distribution and consumption To address this question, we evaluate three key factors.
of this media content [8]. Nicholas Negroponte has contented First, we compare three text analytics techniques to find
the best strategy to analyze the content of the available
news articles. Second, we evaluate different classification
techniques to find the best performing algorithm for article
position prediction. Third, we investigate the impact of the
time variable on the prediction accuracy. The main contri-
bution of this work is to provide insights to researchers and
practitioners on how to tackle a similar problem by provid-
ing the results from a large scale real life data analysis.
CBRecSys 2016, September 16, 2016, Boston, MA, USA. The rest of this manuscript is organized as follows: Sec-
Copyright remains with the authors and/or original copyright holders
tion 2 provides a summary of prior work in this area. Section
3 describes the data and specifies the details of the analy-
sis performed in this work. The results of the analysis are
provided in Section 4 that is followed by the discussion and
future direction in Section 5.
2. BACKGROUND
To automate the process of assigning the right position
to a news article, researchers provide different solutions.
In most of the previous studies, a new article’s content is
analyzed using text analytics solutions. The result of the
analysis is then compared with the analysis of previously
published articles. The popularity of the current article is
predicted based on the similarity of this article with the pre-
viously published articles. Popularity is considered with dif-
ferent measures throughout literature. For example, Tatar Figure 1: Step by step Prediction Methodology
et al. predicted the popularity of the articles based on the
analysis of the comments provided by the users [10]. An-
other study, evaluated the article’s popularity based on the and the position of the article in the website. This informa-
amount of attention received by counting the number of vis- tion along with the information from keyword vectors are
its [5]. Another article popularity measure that was used then used as an input to the machine learning algorithms.
in a recent work by Bansal et al. is based on the analysis
of comment-worthy articles. Comment-worthyness is mea-
3.2 Analysis
sured by the number of comments on a similar article [1]. We first analysed the content of each article available in
In the current work, we considered the popularity measure the first dataset, using three text analytics techniques. Key-
to be a combination of some of the aforementioned measures. word Popularity, TF-IDF and word2vec were used to per-
Specifically, we used measures such as article’s number of form these set of analyses.
visits, duration of visit, the number of comments and in- For the Keyword Popularity technique, we extracted the
fluence of article’s author to evaluate the previous article’s embedded keywords in the article’s content and generated
popularity. The popularity measure is then used towards keyword weights based on the combination of two factors:
the prediction of article’s position on the news website. the number of visits for a particular keyword and the du-
To evaluate the content of the article and find the relevant ration of the keyword on the website. For instance, if the
article topics several text analytics techniques has been used article had a keyword such as ”Canada”, we evaluated the
by different scholars [4]. Among all, we selected three com- popularity of ”Canada” based on the number of times it oc-
monly used approaches in this study. The three approaches curred in the selected section and the number of times an
are Keyword popularity, TF-IDF and Word2Vec that will article with the keyword ”Canada” was visited previously.
be explained in section 3. In TF-IDF technique, TF measures the frequency of a
keyword’s occurrence in a document and IDF refers to com-
puting the importance of that keyword. The output from
3. METHODOLOGY this technique is a document-term matrix with the list of
In this section we specify the details of our data and we the most important words along with their respective weight
outline the details of the methodology used to perform our that describe the content of a document [9]. We used nltk
analysis. The general methodology framework that was used package in python to perform this analysis over the content
in this study is illustrated in Figure 1. of each article.
The last text analytics technique used in this study is
3.1 Data word2vec. This technique was published by Google in 2013.
One year historical data was collected from the news agency’s It is a two-layered neural networks that processes text[6].This
archive. Information regarding the articles published from tool takes a text document as the input and produces a
May 2014 to May 2015 was extracted from the agency’s keyword vector representation as an output. The system
cloud space. One dataset with the information regarding constructs vocabulary from the training text as well as nu-
the content of the articles as well as its author and its pub- merical representation of words. It then measures the cosine
lication date and time was extracted trough this process. similarity of words and group similar words together. In an-
This dataset is then used to generate the keyword vector. other words, this model provides a simple way to find the
As illustrated in Figure 1, another dataset was also ex- words with similar contextual meanings [9].
tracted from the news agency’s data warehouse. The infor- A set of exploratory data analysis was performed on the
mation regarding the popularity of the article, such as Au- second dataset to find the most relevant features to define
thor’s reputation, Article’s freshness and Article type were article’s popularity. Based on the result from this set of
included in this dataset. The dataset also contained the analysis we removed the highly correlated features. The
news URL as article related information. This piece of in- popularity measure along with the position and the keyword
formation provides the details regarding the article’s section vector of each article is then used in 4 main classification al-
and article’s subscription type. The current position of the gorithms: support vector machine (SVM), Random forest,
article is also available in the second dataset. The popularity k-nearest neighbors (KNN) and Logistic regression [3]. The
of the article is then calculated based on available features result of the analysis are only reported for the first two al-
gorithms (i.e. SVM and Random Forest) as they were the
best performing algorithms among the four for our dataset.
The steps to perform the prediction analysis also illus-
trated in Figure1. As shown, the analysis is mainly per-
formed in two phases denoted as ”Learning phase” and ”Pre-
diction phase”. In the learning phase the training dataset
is cleaned and preprocessed and the features to be used for
the evaluation of popularity are selected based on the ex-
ploratory analysis. All observations (i.e. articles) in this
dataset are also labeled with their current positions. In the
prediction phase, the article content is analyzed and the key-
word vectors are created based on the three text analytics (a) Keyword popularity
techniques. Then, the popularity of the article is calculated
based on available features. The test dataset is then passed
through the classifier, which predicts the position of the ar-
ticle. The accuracy of prediction is evaluated based on the
number of correctly classified instances to the total number
of observations and can be computed with Equation 1.
TP + TN
Accuracy = × 100% (1)
TP + FP + TN + FN
The result of the analysis is reported in the following sec-
tion. (b) TF-IDF
4. RESULTS
The set of graphs in Figure 2 illustrates the percentage of
prediction accuracy trend for articles’ positions in 4 different
scenarios using the two classification algorithms. The blue
graph shows the accuracy trend for RandomForest classifi-
cation algorithm, while the green graph reports the accuracy
for the SVM. The 4 scenarios are based on the training data
used in the machine learning algorithms. The first points
from the left shows the accuracy for the scenario, when the
training set contains the articles from 2 months prior to the
publication of the test article. Similarly, the second point
from the left shows the scenario in which the training set (c) Word2Vec
contains articles from 4 months prior to the publication of
the test article and so on for the 8 months and 12 months
scenario. Figure 2: Prediction accuracy for SVM and Ran-
Figure 2(a) shows the accuracy results for the articles dom forest in 4 time frame scenarios (2, 4, 8 and
when their content (for both training and test dataset) is 12 months) using different article content analysis
analyzed based on Keyword Popularity technique. In this techniques
graph we observe that the accuracy of the prediction algo-
rithm is related to the time frame factor used to build the
training set. More specifically, both algorithms perform best tween the accuracy of the other time dependent scenarios
while the most recent articles are used in the training set. It are not shown to be large. Although, SVM shows a different
clearly shows that the performance of both SVM and Ran- accuracy trend for this text analytics technique, the accu-
dom Forest is dependant on the time frame that is used to racy results for the Random Forest algorithm seems to be
define the training set. consistent with the results from prior analysis. Specifically,
Figure 2(b) provides the accuracy for the analysis of the while using Woed2Vec and Random Forest algorithm, the
prediction in the case when the articles are analyzed by TF- best performance is gained through the use of the most re-
IDF technique. The result of the analysis for this content cent articles in the training set. On the contrary, the result
analysis technique further confirms that the accuracy of pre- for this text analytics technique and the use of SVM algo-
diction is dependant on the time frame selected to define the rithm works best, while using the older articles. Neverthe-
training set. For this type of article content analysis, SVM less, SVM is not considered as the best performing algorithm
always works superior to RandomForest in terms of accu- for this text analytics technique.
racy. To better illustrate the performance of each text analytics
Figure 2(c) shows the result of the prediction for the arti- techniques based on the time dependent scenarios Figure 3
cles that are evaluated by Word2Vec technique. The result is provided.
from this graph is different from the previous two graphs. Figure 3 shows the result from the best performing algo-
The accuracy for the most recent articles using SVM shows rithm for the three content analytics techniques within the 4
to be lower from other scenarios, however the difference be- time dependent scenarios. The blue graph shows the perfor-
The results from this study can cautiously be extended
to other datasets. To avoid the impact of sampling biases
we used 10 fold cross validation technique for our predic-
tion models. Also, the analysis of the large scale real life
data minimizes this threat to the validity of the result of
this study. In our future work, we will use the the results
from this study as well as the features detected through the
exploratory analysis to design a personalized news recom-
mendation system.
6. ACKNOWLEDGMENTS
The authors would like to thank Bora Caglayan, Zeinab
Noorian, Fatemeh Firouzi and Sami Rodrigue who worked
at different stages of this project. This research is supported
Figure 3: Prediction accuracy based on the three in part by Ontario Centres of Excellence (OCE) TalentEdge
content analysis techniques for the 4 time frame sce- Fellowship Project (TFP)-22085.
narios
7. REFERENCES
mance of SVM for TF-IDF technique and the green graph
and the red graph show the accuracy result for Random [1] T. Bansal, M. Das, and C. Bhattacharyya. Content
Forest for Keyword popularity and Word2Vec, respectively. driven user profiling for comment-worthy
This figure shows that for all the three content analysis tech- recommendations of news and blog articles. In
niques, the best prediction performance is achieved while the Proceedings of the 9th ACM Conference on
fresh articles are used for training purposes. The accuracy Recommender Systems, pages 195–202. ACM, 2015.
is always dropped as old articles are added to the training [2] P. J. Boczkowski. Digitizing the news: Innovation in
set in the 4 month scenario. In Word2vec technique, the online newspapers. mit Press, 2005.
accuracy increases when the 8 month prior articles are used [3] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin.
for training. However, still the best performance is attained The elements of statistical learning: data mining,
while using more recent documents. inference and prediction. The Mathematical
Another observation from this analysis is that TF-IDF Intelligencer, 27(2):83–85, 2005.
technique provides the best text evaluation that further gen- [4] S. Lee and H.-j. Kim. News keyword extraction for
erates higher prediction accuracy for article’s position. topic tracking. In Networked Computing and Advanced
Information Management, 2008. NCM’08. Fourth
International Conference on, volume 2, pages 554–559.
5. DISCUSSION & FUTURE DIRECTION IEEE, 2008.
Personalized news recommendation is a recently emerged [5] L. Li, D.-D. Wang, S.-Z. Zhu, and T. Li. Personalized
topic of study based on the introduction of the interactive news recommendation: a review and an experimental
online news media. The decision on the news presentation investigation. Journal of Computer Science and
is made based on the assigned position of the article within Technology, 26(5):754–766, 2011.
the news website. The position of the article can be assigned [6] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
based on the popularity of the article. The popularity of the J. Dean. Distributed representations of words and
article can be predicted based on the analysis of its content phrases and their compositionality. In Advances in
and the similarity of the article’s content to the previously neural information processing systems, pages
published articles. Previous article’s popularity is measured 3111–3119, 2013.
based on different popularity measures. In this study, we [7] N. Negroponte. Being digital. Vintage, 1996.
used a combination of article’s popularity measure attributes [8] J. V. Pavlik. Journalism and new media. Columbia
as well as the attributes from the analysis of the articles’ University Press, 2001.
content to predict the position of a new article. [9] N. Pentreath. Machine Learning with Spark. Packt
We evaluated the impact of the three key factors on the Publishing Ltd, 2015.
prediction of new article’s position. The results from the
[10] A. Tatar, J. Leguay, P. Antoniadis, A. Limbourg,
analyses provide evidence that all three factors under inves-
M. D. de Amorim, and S. Fdida. Predicting the
tigation in this study plays a role in the accuracy of predic-
popularity of online articles based on user comments.
tion. One of the important findings from this work is that
In Proceedings of the International Conference on Web
the result of the analysis of a new articles content should only
Intelligence, Mining and Semantics, page 67. ACM,
be compared with the recent articles. The analysis shows
2011.
that as the older articles are used as an input to the predic-
tion algorithm the accuracy of the system drops in almost all
cases. Also, the best performing prediction algorithm shows
to be dependent on the text analytics techniques used in the
analysis of the article’s content. Regardless of the prediction
algorithm the best text analytics technique for the current
dataset is shown to be TF-IDF.