<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A simple and efficient approach for the semi-automated curation for media reviews</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marc Rössler</string-name>
          <email>Marc.Roessler@Unicepta.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Hilgenhöner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Unicepta GmbH</institution>
          ,
          <addr-line>Salierring 47-53, 50677 Köln</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we discuss a robust and efficient approach to automatically curate news articles into daily media monitoring reports by using document classification. We start with a motivation and a description of the task and work out the characteristics and requirements specific for this use case. Also, we report on initial experiments with simple Naïve Bayes classifiers trained on manually labelled data and discuss them in terms of applicability to the use case. Furthermore, we present the next steps to improve the automated curation without sacrificing efficiency and robustness.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning</kwd>
        <kwd>Text Categorization</kwd>
        <kwd>Media Review Curation</kwd>
        <kwd>Multi-Label and Multi-Category Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In the business of media monitoring, clienTts expect to be continuously provided with
so-called media reviews for them to stay up to date with all topics relevant for their
business. A media review is basically a report consisting of a set of curated media
articles, sent out weekly, daily or even multiple times a day. Curation in that context
includes both the enrichment of the articles with various meta data relevant for that use
case and the organization of the content into rubrics or categories. In this paper, we
solely focus on the organization of the articles and completely ignore the enrichment
steps.</p>
      <p>The rubrics to organize the articles are usually hierarchical trees, with a depth of one,
two or rarely three levels. Examples of first level rubrics are “corporate news”, “news
about corporate products”, “general news within the industry”, while second level
rubrics can be e.g. a specific product.</p>
      <p>Unicepta GmbH currently produces up to 500 media reviews on a daily basis. They
are curated by human decisions in combination with a powerful, client specific filtering
based on Boolean searches. The incoming data stream consists of up to 3 million hits
per day that are filtered down based on search terms to roughly 50.000 documents. The
set of filtered documents is the data pool that is used to populate the approximately
7.000 rubrics used to organize the set of media reviews.</p>
      <p>This setup obviously has a potential to be more efficient by delegating some of the
human decisions to a Machine-Learning (ML) based algorithm. As the resulting media
review involves further human qualification and enrichment (e.g. requesting
topic-oriented abstracts for certain articles and topics), the resulting algorithm is not replacing
the human decision but aims at increasing the efficiency of the production process.
2</p>
      <p>Characteristics of the task, setup and requirements</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Assigning documents into a set of predefined categories is a long-known problem in
ML with many successful approaches to solve it. Most approaches are usually based on
supervised ML i.e. they require a set of annotated data to train a classifier in order to
predict the categories of a document.</p>
      <p>Among the ML approaches used are Naïve Bayes, Logistics Regression, Decision
Trees, SVMs and most recently also Neural Networks and/or word embeddings created
with Neural Networks.</p>
      <p>
        Linear SVMs have a prominent place as they showed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and still show superior
performance with a reasonable computational effort on this task for a long period of
time.
      </p>
      <p>
        The current paradigm of pre-trained models, methods like BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and XL-Net [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
outperform or achieve the state of the art in a variety of tasks including question
answering, named entity recognition, and natural language inference. Applying this
paradigm to text categorization is very interesting [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] despite the computational costs that
are significantly higher than any other approach.
      </p>
      <p>
        Four editions of a challenge on large-scale text classification have been conducted
from 2010-2014 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The challenge named LSHTC aimed at assessing the performance
of classification systems in large-scale multi-label and hierarchical classification of a
large number of categories.
      </p>
      <p>
        Our task also shares certain characteristics of “extreme multi-label classification”
(XMC - see e.g. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]), though on a smaller scale. XMC refers to the task learn a classifier
which can assign a small subset of relevant labels to an instance from a very large set
of target labels. An important statistical characteristic of the datasets in XMC is that a
large fraction of labels are tail labels, i.e., those which have very few training instances
that belong to them.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Our approach</title>
      <p>The input to process consists of news articles, both online and print content that was
OCRed. This input is preprocessed and transformed into a feature vector in the
following way. All content (headline, sub headline and content) is combined into a single
string that is used for language detection. After removing markup, reducing it to letters
and digits, case folding and Umlaut normalization, it is tokenized. All stop words are
filtered out and stemming is applied on the resulting tokens.</p>
      <p>
        To better account for duplicates and near duplicates within the training data, articles
with identical or very similar content are grouped. The similarity is computed based on
keywords extraction as described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For each group only one article is kept and all
labels from the group are assigned to it.
      </p>
      <p>The feature extraction is based on the 5000 most frequent words per language and
the TF*IDF is computed as weight per token. Additional features, especially metadata
of the articles are currently not reflected in the features set.</p>
      <p>We apply a multinomial naive Bayesian classifier in a binary relevance setup and
train one classifier per language and label. The training involves random subspace
sampling, i.e. multiple classifiers are trained on subsets of documents and features and ADA
boost is used to further improve performance. This approach is chosen for its low
computational costs, compared to SVMs or even neural networks.</p>
      <p>As evaluation metrics, we decided to use the average precision which basically
represents the Area Under the Curve in a recall/precision diagram. This metrics is not
indicative for setups where only the top-n documents ought to be selected as it takes into
account all predictions of the classifier and not just the top n.</p>
      <p>As overall metrics, we combine the average precision of all classifiers, weighted by
the number of predictions per classifier.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>For our experiments, we have focused on one prototypical media review. We used
approximately nine months of data which corresponds to 14.000articles assigned to 31
categories. We did not exclude any category, even though some only had a handful of
positive training instances. We hold out 20% of the data as test data for our experiments.
For the training of the binary classifiers, all articles assigned to another category was
marked as negative instance.</p>
      <p>In our experiments, we found that the training time of a Naïve Bayes setup, even
with many classes and many more features is still fast and we expect this to enable us
to retrain all classifiers multiple times a day on cloud hardware at very reasonable costs.
We also looked at the impact of the amount of training data and run a set of experiments
where we compared the performance starting with 2000 documents up to 14000
documents in steps of additional 2000 documents.
0.65
0.63
0.61
0.59
0.57
0.55
0.53
0.51
0.49
0.47
0.45
0
2000
4000
6000
8000
10000
12000
14000
We have shown that a simple and efficiently trainable approach to text categorization
brings up reasonable performance that seems at least good enough to support human
decisions for curating media reviews. Also, we can see the expected behavior in Fig. 1,
that more data leads to better results which comes with the consequence, that new or
weak populated rubrics always will suffer from poor performance in terms of
classification accuracy. We also studied the results of the individual classifiers to understand
the difference in performance. However, besides the observation that more training data
leads to better results, we did not identify an obvious pattern that explains the difference
in performance.</p>
      <p>
        The results are an encouraging starting point that offers many ways to significantly
improve the performance and to increase the efficiency of the production process.
When it comes to the feature engineering, it seems attractive to integrate word
embeddings as in BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or Word2Vec[8]. They can be used to extend the feature vector
and should especially support classes with very few training instances. Also, we will
compare Naïve Bayes with Logistic Regression as the computational costs are
comparable. To further address the imbalance of the classes, sampling methods will be
evaluated further. Finally, we are also keen to get feedback from our internal production
teams. This will help us to understand the variance in performance for the different
categorization tasks. In addition, carefully observing the way the teams work will likely
bring up information on additional useful features such as phrases, named entities and
other meta data and will also help us to better understand how to best integrate
automated curation into the production process.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Joachims</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text categorization with Support Vector Machines: Learning with many relevant features</article-title>
          . In: Nédellec C.,
          <string-name>
            <surname>Rouveirol</surname>
            <given-names>C</given-names>
          </string-name>
          .
          <article-title>(eds) Machine Learning:</article-title>
          <source>ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence)</source>
          , vol
          <volume>1398</volume>
          . Springer, Berlin, Heidelberg. (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Devlin</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>K</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Toutanova</surname>
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>BERT: pre-training of deep bidirectional transformers for language under-standing</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Yang</surname>
            <given-names>Z.</given-names>
          </string-name>
          , Dai
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Yang</surname>
          </string-name>
          <string-name>
            <given-names>Y.</given-names>
            , Carbonell J.G.,
            <surname>Salakhutdinov</surname>
          </string-name>
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Quoc</surname>
          </string-name>
          <string-name>
            <given-names>V.</given-names>
            <surname>Le</surname>
          </string-name>
          . S.:
          <article-title>XLNet: generalized autoregressive pretraining for language understanding</article-title>
          . arxiv/
          <year>1906</year>
          .08237. (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Adhikari</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ram</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>J.:</given-names>
          </string-name>
          <article-title>DocBERT: BERT for Document Class</article-title>
          . https://arxiv.org/abs/
          <year>1904</year>
          .08398. (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Partalas</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosmopoulos</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baskiotis</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artieres</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliouras</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaussier</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Androutsopoulos</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amini</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galinari</surname>
            <given-names>P.:</given-names>
          </string-name>
          <article-title>LSHTC: A Benchmark for Large-Scale Text Classification</article-title>
          . https://arxiv.org/abs/1503.08581 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Babbar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schölkopf</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Data scarcity, robustness and extreme multi-label classification</article-title>
          .
          <source>Machine Learning</source>
          , Volume
          <volume>108</volume>
          ,
          <string-name>
            <surname>Issue</surname>
          </string-name>
          8-
          <issue>9</issue>
          , pp
          <fpage>1329</fpage>
          -
          <lpage>1351</lpage>
          . https://doi.org/10.1007/s10994- 019-05791-5. (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Matsuo</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishizuka</surname>
            <given-names>M..</given-names>
          </string-name>
          <article-title>: Keyword Extraction from a Single Document using Word Cooccurrence Statistical Information</article-title>
          . In:
          <source>International Journal on Artificial Intelligence Tools</source>
          .
          <volume>13</volume>
          (
          <issue>1</issue>
          ), pp
          <fpage>157</fpage>
          -
          <lpage>169</lpage>
          . (
          <year>2004</year>
          ) Mikolov,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          :
          <article-title>Efficient estimation of word representations in vector space</article-title>
          . https://arxiv.org/abs/1301.3781. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>