<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Media Monitoring using News Recommenders</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Alberto Lavelli Fondazione Bruno Kessler</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Bernardo Magnini Fondazione Bruno Kessler</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Francesco Barile Free University of Bozen-Bolzano</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Francesco Ricci Free University of Bozen-Bolzano</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Marko Tkalcic Free University of Bozen-Bolzano</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Media Monitoring</institution>
          ,
          <addr-line>News Recommender Systems, Content Analysis</addr-line>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Roberto Zanoli Fondazione Bruno Kessler</institution>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Media monitoring (MM) services, also referred to as clipping
services, provide their customers with a daily selection of media
content that is of interest to them. Such content, here referred to as
documents, can be obtained from any kind of media, such as
newspapers and other print media, video and audio services, and web
and social media [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This monitoring service is used by companies
to analyse special topics of interest, in order to determine the impact
on the market and the value of their brands, but also to monitor
competitors and to protect their reputation and plan the company’s
policies [
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ]. There are several MM service providers, but very
few papers in the literature describe the adopted technological
solutions. Meltwater 1 is a monitoring service tracking keywords and
phrases on more then 300,000 online sources, ofering a
personalized dashboard that allows customers to perform diferent analyses
on the retrieved documents. Cision 2 and News Exposure 3 provide
monitoring on diferent kinds of media: online monitoring on
internet, broadcast monitoring on TV and radio transmissions, print
monitoring on newspapers and social monitoring that analyze the
social networks, also providing analytic tools. Mention 4 is a social
media monitoring, hence it focuses on web and social media
contents, providing tools to monitor in real-time customers’ mentions
and allowing also the tracking of competitors.
      </p>
      <p>Even though these tools use data mining techniques to analyze
the documents and support the customers with reports and
statistics, the selection of the documents related to the customers is
mostly based on keyword matching techniques by using keywords
that reflect the customers’ name, products and competitors.
However, keyword-based techniques are not precise and it is necessary
to manually inspect the keyword-filtered documents to remove
false positives: a time-consuming process. Therefore, the work-flow
in the customer company typically involves a human editor who
inspects, on a daily basis, each document provided by the MM service
and decides whether it is really relevant or not. In order to minimize
the daily work done by human editors, we have developed a
recommender system (RS) that operates after the keyword-filtering step
and before the final check of human editors. Our RS uses automatic
classification techniques in order to label the keyword-filtered
documents either as relevant or non-relevant and to ease the work of
the editors.</p>
      <p>
        The work presented here was done in conjunction with the
Italian company Euregio. The company provides the Infojuice system
(IJ), a MM service providing a tool for the qualitative and
quantitative analysis of the customers’ level of representativity on the
media. The product is able to collect documents in Italian, German
and English, from a range of diferent sources. More details on these
results can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>THE PROPOSED SOLUTION</title>
      <p>The proposed solution is a recommender system (RS) that
identiifes document recommendations by using one classifier for each
customer. The RS is used by the IJ system to support the editors
during their daily activity. Several factors have been taken into
consideration in the design of the RS. In particular, since the number
of customers is small and their interests quite diverse,
collaborative filtering was not applicable. Furthermore, the system collected
feedback of the editors is restricted to the actions of removing
documents from the lists produced by the keyword-based filtering
queries; hence, we have at disposal only negative feedback.
Therefore, we decided to implement a solution optimized diferently for
each customer by using automatic classification techniques that
leverage data generated by the editors’ actions in order to
distinguish relevant from non-relevant documents.</p>
      <p>
        We decided to perform the classification tasks with two classical
approaches: Multinomial Naïve Bayes (MNB) and Support Vector
Machines (SVM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In particular, regarding the SVM classifier,
we decided to use a linear kernel (LIN SVM) and an exponential
one, the radial basis function kernel (RBF SVM) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Regarding the
features, extracted from the documents and used by the classifiers,
they represent information derived from the Title, Text and Source
of the documents. The document’s Source (for instance a magazine)
is represented as a binary vector, with a binary feature associated
with each one of the available sources. For the textual components
of the documents, i.e., the Title and the Text, we use a Bag of Words
Model (tf-idf ) and a Word Embeddings Model (Fasttext [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>SYSTEM EVALUATION</title>
      <p>
        We evaluated the proposed RS in a set of ofline experiments using
a dataset containing 365,430 Italian documents and the editors’
actions performed over a time window of six months. We
considered five reference customers, whose editors performed removing
actions in all the six months considered. We evaluated several
different configurations specified by the classifier, hyper-parameters
optimization criteria (P and F1) and used feature sets combination.
The evaluation of a configuration requires two steps: (1) the
selection of the hyper-parameters for each classifier and optimization
criteria, and (2) the evaluation of the configurations by using the
best hyper-parameters determined at the previous step. Since for
many customers the number of relevant and not relevant documents
was substantially diferent in each test month, we used resampling
strategies for balancing these numbers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We found the optimal
values of the classifiers hyper-parameters by using grid search. Our
data is time stamped, so, random cross-validation is not
appropriate; validation data must be posterior to training data. We defined
two time correct train-validation splits and searched for the
hyperparameter values that maximise the averaged obtained scores (P
or F1) over these two splits. In the first split, the first two months
were used as training and the third month as validation; then the
ifrst three months were used as training and the fourth month as
validation. Once we have identified the best hyper-parameter
combination for a given configuration, we evaluated the performance
of that configuration using the fifth and the sixth month as test set,
referred to as test month 5 and test month 6 respectively. In the
ifrst case, we trained the classifier using the first four months as a
training set, while in the second case we used the first five months
as a training set.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS ANALYSIS</title>
      <p>As shown in Table 1, all the proposed algorithms outperform
(improve) the precision of the current Keyword-Based System (KBS),
which is the ratio of the number of documents the editor considered
as relevant over the total number of documents selected by KBS.
Moreover, high Precision (equal or close to 1.000) can be obtained if
this is used in hyper-parameters selection step. This means that the
system can find a set of relevant documents with few false positives
(i.e., documents identified as relevant are mostly truly relevant). We
also measured the computational time needed to train the classifiers
and to perform the recommendations on the real data of customers
and documents managed by the IJ system. We have observed that
the proposed approach can be executed on of-the-shelf hardware in
reasonable time and does not require to alter the current workflow.
5</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>We have presented a system that reduces the daily work of an
editor for generating press release selections from a large data set
of documents managed by a media monitoring (MM) system. In
the future we plan to improve the developed system. One line of
research will study techniques to identify the documents that are
clearly non-relevant, hence shrinking even more the grey area of
documents that the editor needs to inspect. Additionally, we plan
to evaluate the system’s performance in a multilingual scenario
by using documents of diferent languages (Italian, German, and
English) and language-independent features. Furthermore, we plan
to collect additional feedback from the editors and also from end
users of the system in order to generate personalised press releases.
Finally, an online evaluation will be performed to validate the
ofline results in a real scenario.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENT</title>
      <p>This work was supported by the autonomous province of
BolzanoBozen (Alto Adige-Südtirol) under the EUCLIP_RES project
(EUregio CrossLinguistic REcommender System).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Barile</surname>
          </string-name>
          , Francesco Ricci, Marko Tkalcic, Bernardo Magnini, Roberto Zanoli, Alberto Lavelli, and
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Speranza</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A News Recommender System for Media Monitoring</article-title>
          .
          <source>In IEEE/WIC/ACM International Conference on Web Intelligence (WI '19)</source>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          ),
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Aurélien</given-names>
            <surname>Géron</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc.".</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Christopher</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>Prabhakar</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            , and
            <given-names>Hinrich</given-names>
          </string-name>
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Text classification and Naïve Bayes</article-title>
          .
          <source>Introduction to information retrieval 1</source>
          ,
          <issue>6</issue>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Stephen</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Rappaport</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Listening solutions: A marketer's guide to software and services</article-title>
          .
          <source>Journal of Advertising Research</source>
          <volume>50</volume>
          ,
          <issue>2</issue>
          (
          <year>2010</year>
          ),
          <fpage>197</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Stavrakantonakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andreea-Elena</surname>
            <given-names>Gagiu</given-names>
          </string-name>
          , Harriet Kasper, Ioan Toma, and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Thalhammer</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>An approach for evaluation of social media monitoring tools</article-title>
          .
          <source>Common Value Management</source>
          <volume>52</volume>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>52</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Aixin</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ee-Peng Lim</surname>
          </string-name>
          , and Ying Liu.
          <year>2009</year>
          .
          <article-title>On strategies for imbalanced text classification using SVM: A comparative study</article-title>
          .
          <source>Decision Support Systems</source>
          <volume>48</volume>
          ,
          <issue>1</issue>
          (
          <year>2009</year>
          ),
          <fpage>191</fpage>
          -
          <lpage>201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Boyang</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marita</given-names>
            <surname>Vos</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Social media monitoring: aims, methods, and challenges for international companies</article-title>
          .
          <source>Corporate Communications: An International Journal</source>
          <volume>19</volume>
          ,
          <issue>4</issue>
          (
          <year>2014</year>
          ),
          <fpage>371</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>