-

September

News Article Position Recommendation Based on The Analysis of Article's Content - Time Matters

Parisa Lak

parisa.lak@ryerson.ca 0

Ceni Babaoglu

cenibabaoglu@ryerson.ca 0

Ayse Basar Bener

ayse.bener@ryerson.ca 0

Pawel Pralat

pralat@ryerson.ca 0 0 Data Science Laboratory, Ryerson University , Toronto , Canada

2016

16 2016 2 5

As more people prefer to read news on-line, the newspapers are focusing on personalized news presentation. In this study, we investigate the prediction of article's position based on the analysis of article's content using di erent text analytics methods. The evaluation is performed in 4 main scenarios using articles from di erent time frames. The result of the analysis shows that the article's freshness plays an important role in the prediction of a new article's position. Also, the results from this work provides insight on how to nd an optimised solution to automate the process of assigning new article the right position. We believe that these insights may further be used in developing content based news recommender algorithms.

Information systems ! Content ranking; Recommender systems;

Since 1990s the Internet has transformed our personal and business lives and one example of such a transformation is the creation of virtual communities [ 2 ]. However, there are challenges in the production, distribution and consumption of this media content [ 8 ]. Nicholas Negroponte has contented that moving towards being digital will a ect the economic model for news selection and the users' interest play a bigger role for news selection [ 7 ]. Therefore, users actively participate in online personalized communities and they expect the online news agency to provide as much personalized services as possible. Such demand, on the other hand, puts pressure on the news agency to employ the most recent technology to satisfy their users.

Our research partner, the news agency, is moving towards providing a more personalized service to their subscribed users. Currently, the editors make the decision on which article to be placed in which section and to whom the article should be o ered (i.e. the subscription type). This decision is purely made based on their experience. Similarly, the position of the news within the rst page of the section is decided by the editors. The company would like to rst automate the decision on article position process and in the second step to provide personalized recommendations to their users. They would like to position the news on each page based on the historical behavior of each user available through the analysis of user interaction logs.

In this work, we investigate di erent solutions to optimize and automate the process of positioning the new articles. The results of this study may further be used towards building personalized news recommendation algorithms for subscribed users at di erent tiers. The high level research question that we address in this study is:

RQ- How to predict an article's position in a news website? To address this question, we evaluate three key factors. First, we compare three text analytics techniques to nd the best strategy to analyze the content of the available news articles. Second, we evaluate di erent classi cation techniques to nd the best performing algorithm for article position prediction. Third, we investigate the impact of the time variable on the prediction accuracy. The main contribution of this work is to provide insights to researchers and practitioners on how to tackle a similar problem by providing the results from a large scale real life data analysis.

The rest of this manuscript is organized as follows: Section 2 provides a summary of prior work in this area. Section 3 describes the data and speci es the details of the analysis performed in this work. The results of the analysis are provided in Section 4 that is followed by the discussion and future direction in Section 5.

BACKGROUND

To automate the process of assigning the right position to a news article, researchers provide di erent solutions. In most of the previous studies, a new article's content is analyzed using text analytics solutions. The result of the analysis is then compared with the analysis of previously published articles. The popularity of the current article is predicted based on the similarity of this article with the previously published articles. Popularity is considered with different measures throughout literature. For example, Tatar et al. predicted the popularity of the articles based on the analysis of the comments provided by the users [ 10 ]. Another study, evaluated the article's popularity based on the amount of attention received by counting the number of visits [ 5 ]. Another article popularity measure that was used in a recent work by Bansal et al. is based on the analysis of comment-worthy articles. Comment-worthyness is measured by the number of comments on a similar article [ 1 ].

In the current work, we considered the popularity measure to be a combination of some of the aforementioned measures. Speci cally, we used measures such as article's number of visits, duration of visit, the number of comments and inuence of article's author to evaluate the previous article's popularity. The popularity measure is then used towards the prediction of article's position on the news website.

To evaluate the content of the article and nd the relevant article topics several text analytics techniques has been used by di erent scholars [ 4 ]. Among all, we selected three commonly used approaches in this study. The three approaches are Keyword popularity, TF-IDF and Word2Vec that will be explained in section 3.

METHODOLOGY

In this section we specify the details of our data and we outline the details of the methodology used to perform our analysis. The general methodology framework that was used in this study is illustrated in Figure 1. 3.1

Data

One year historical data was collected from the news agency's archive. Information regarding the articles published from May 2014 to May 2015 was extracted from the agency's cloud space. One dataset with the information regarding the content of the articles as well as its author and its publication date and time was extracted trough this process. This dataset is then used to generate the keyword vector.

As illustrated in Figure 1, another dataset was also extracted from the news agency's data warehouse. The information regarding the popularity of the article, such as Author's reputation, Article's freshness and Article type were included in this dataset. The dataset also contained the news URL as article related information. This piece of information provides the details regarding the article's section and article's subscription type. The current position of the article is also available in the second dataset. The popularity of the article is then calculated based on available features and the position of the article in the website. This information along with the information from keyword vectors are then used as an input to the machine learning algorithms. 3.2

Analysis

We rst analysed the content of each article available in the rst dataset, using three text analytics techniques. Keyword Popularity, TF-IDF and word2vec were used to perform these set of analyses.

For the Keyword Popularity technique, we extracted the embedded keywords in the article's content and generated keyword weights based on the combination of two factors: the number of visits for a particular keyword and the duration of the keyword on the website. For instance, if the article had a keyword such as "Canada", we evaluated the popularity of "Canada" based on the number of times it occurred in the selected section and the number of times an article with the keyword "Canada" was visited previously.

In TF-IDF technique, TF measures the frequency of a keyword's occurrence in a document and IDF refers to computing the importance of that keyword. The output from this technique is a document-term matrix with the list of the most important words along with their respective weight that describe the content of a document [ 9 ]. We used nltk package in python to perform this analysis over the content of each article.

The last text analytics technique used in this study is word2vec. This technique was published by Google in 2013. It is a two-layered neural networks that processes text[ 6 ].This tool takes a text document as the input and produces a keyword vector representation as an output. The system constructs vocabulary from the training text as well as numerical representation of words. It then measures the cosine similarity of words and group similar words together. In another words, this model provides a simple way to nd the words with similar contextual meanings [ 9 ].

A set of exploratory data analysis was performed on the second dataset to nd the most relevant features to de ne article's popularity. Based on the result from this set of analysis we removed the highly correlated features. The popularity measure along with the position and the keyword vector of each article is then used in 4 main classi cation algorithms: support vector machine (SVM), Random forest, k-nearest neighbors (KNN) and Logistic regression [ 3 ]. The result of the analysis are only reported for the rst two algorithms (i.e. SVM and Random Forest) as they were the best performing algorithms among the four for our dataset.

The steps to perform the prediction analysis also illustrated in Figure1. As shown, the analysis is mainly performed in two phases denoted as "Learning phase" and "Prediction phase". In the learning phase the training dataset is cleaned and preprocessed and the features to be used for the evaluation of popularity are selected based on the exploratory analysis. All observations (i.e. articles) in this dataset are also labeled with their current positions. In the prediction phase, the article content is analyzed and the keyword vectors are created based on the three text analytics techniques. Then, the popularity of the article is calculated based on available features. The test dataset is then passed through the classi er, which predicts the position of the article. The accuracy of prediction is evaluated based on the number of correctly classi ed instances to the total number of observations and can be computed with Equation 1.

Accuracy =

T P + T N T P + F P + T N + F N 100% (1)

The result of the analysis is reported in the following section.

RESULTS

The set of graphs in Figure 2 illustrates the percentage of prediction accuracy trend for articles' positions in 4 di erent scenarios using the two classi cation algorithms. The blue graph shows the accuracy trend for RandomForest classi cation algorithm, while the green graph reports the accuracy for the SVM. The 4 scenarios are based on the training data used in the machine learning algorithms. The rst points from the left shows the accuracy for the scenario, when the training set contains the articles from 2 months prior to the publication of the test article. Similarly, the second point from the left shows the scenario in which the training set contains articles from 4 months prior to the publication of the test article and so on for the 8 months and 12 months scenario.

Figure 2(a) shows the accuracy results for the articles when their content (for both training and test dataset) is analyzed based on Keyword Popularity technique. In this graph we observe that the accuracy of the prediction algorithm is related to the time frame factor used to build the training set. More speci cally, both algorithms perform best while the most recent articles are used in the training set. It clearly shows that the performance of both SVM and Random Forest is dependant on the time frame that is used to de ne the training set.

Figure 2(b) provides the accuracy for the analysis of the prediction in the case when the articles are analyzed by TFIDF technique. The result of the analysis for this content analysis technique further con rms that the accuracy of prediction is dependant on the time frame selected to de ne the training set. For this type of article content analysis, SVM always works superior to RandomForest in terms of accuracy.

Figure 2(c) shows the result of the prediction for the articles that are evaluated by Word2Vec technique. The result from this graph is di erent from the previous two graphs. The accuracy for the most recent articles using SVM shows to be lower from other scenarios, however the di erence be(a) Keyword popularity (b) TF-IDF (c) Word2Vec tween the accuracy of the other time dependent scenarios are not shown to be large. Although, SVM shows a di erent accuracy trend for this text analytics technique, the accuracy results for the Random Forest algorithm seems to be consistent with the results from prior analysis. Speci cally, while using Woed2Vec and Random Forest algorithm, the best performance is gained through the use of the most recent articles in the training set. On the contrary, the result for this text analytics technique and the use of SVM algorithm works best, while using the older articles. Nevertheless, SVM is not considered as the best performing algorithm for this text analytics technique.

To better illustrate the performance of each text analytics techniques based on the time dependent scenarios Figure 3 is provided.

Figure 3 shows the result from the best performing algorithm for the three content analytics techniques within the 4 time dependent scenarios. The blue graph shows the performance of SVM for TF-IDF technique and the green graph and the red graph show the accuracy result for Random Forest for Keyword popularity and Word2Vec, respectively. This gure shows that for all the three content analysis techniques, the best prediction performance is achieved while the fresh articles are used for training purposes. The accuracy is always dropped as old articles are added to the training set in the 4 month scenario. In Word2vec technique, the accuracy increases when the 8 month prior articles are used for training. However, still the best performance is attained while using more recent documents.

Another observation from this analysis is that TF-IDF technique provides the best text evaluation that further generates higher prediction accuracy for article's position.

DISCUSSION & FUTURE DIRECTION

Personalized news recommendation is a recently emerged topic of study based on the introduction of the interactive online news media. The decision on the news presentation is made based on the assigned position of the article within the news website. The position of the article can be assigned based on the popularity of the article. The popularity of the article can be predicted based on the analysis of its content and the similarity of the article's content to the previously published articles. Previous article's popularity is measured based on di erent popularity measures. In this study, we used a combination of article's popularity measure attributes as well as the attributes from the analysis of the articles' content to predict the position of a new article.

We evaluated the impact of the three key factors on the prediction of new article's position. The results from the analyses provide evidence that all three factors under investigation in this study plays a role in the accuracy of prediction. One of the important ndings from this work is that the result of the analysis of a new articles content should only be compared with the recent articles. The analysis shows that as the older articles are used as an input to the prediction algorithm the accuracy of the system drops in almost all cases. Also, the best performing prediction algorithm shows to be dependent on the text analytics techniques used in the analysis of the article's content. Regardless of the prediction algorithm the best text analytics technique for the current dataset is shown to be TF-IDF.

The results from this study can cautiously be extended to other datasets. To avoid the impact of sampling biases we used 10 fold cross validation technique for our prediction models. Also, the analysis of the large scale real life data minimizes this threat to the validity of the result of this study. In our future work, we will use the the results from this study as well as the features detected through the exploratory analysis to design a personalized news recommendation system. 6.

ACKNOWLEDGMENTS

The authors would like to thank Bora Caglayan, Zeinab Noorian, Fatemeh Firouzi and Sami Rodrigue who worked at di erent stages of this project. This research is supported in part by Ontario Centres of Excellence (OCE) TalentEdge Fellowship Project (TFP)-22085. 7.

[1]

Bansal , M. Das , and C. Bhattacharyya . Content driven user pro ling for comment-worthy recommendations of news and blog articles . In Proceedings of the 9th ACM Conference on Recommender Systems , pages 195 { 202 . ACM, 2015 .

[2]

P. J.

Boczkowski . Digitizing the news: Innovation in online newspapers . mit Press, 2005 .

[3]

Hastie ,

Tibshirani ,

Friedman , and J. Franklin. The elements of statistical learning: data mining, inference and prediction . The Mathematical Intelligencer , 27 ( 2 ): 83 { 85 , 2005 .

[4]

Lee and H.- j. Kim. News keyword extraction for topic tracking . In Networked Computing and Advanced Information Management , 2008 . NCM' 08 . Fourth International Conference on, volume 2 , pages 554 { 559 . IEEE, 2008 .

[5]

Li ,

D.-D.

Wang ,

S.-Z.

Zhu , and

Li . Personalized news recommendation: a review and an experimental investigation . Journal of Computer Science and Technology , 26 ( 5 ): 754 { 766 , 2011 .

[6]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado , and

Dean . Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems , pages 3111 { 3119 , 2013 .

[7]

Negroponte . Being digital . Vintage , 1996 .

[8]

J. V.

Pavlik . Journalism and new media . Columbia University Press, 2001 .

[9]

Pentreath . Machine Learning with Spark . Packt Publishing Ltd , 2015 .

[10]

Tatar ,

Leguay ,

Antoniadis ,

Limbourg , M. D. de Amorim , and S. Fdida . Predicting the popularity of online articles based on user comments . In Proceedings of the International Conference on Web Intelligence, Mining and Semantics, page 67. ACM , 2011 .