<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AI supported Topic Modeling using KNIME-Workflows1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jamal Al Qundus</string-name>
          <email>jamal.al.qundus@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvio Peikert</string-name>
          <email>silvio.peikert@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Paschke</string-name>
          <email>adrian.paschke@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer-Institut FOKUS</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Topic modeling algorithms traditionally model topics as list of weighted terms. These topic models can be used effectively to classify texts or to support text mining tasks such as text summarization or fact extraction. The general procedure relies on statistical analysis of term frequencies. The focus of this work is on the implementation of the knowledge-based topic modelling services in a KNIME2 workflow. A brief description and evaluation of the DBPedia3based enrichment approach and the comparative evaluation of enriched topic models will be outlined based on our previous work. DBpedia-Spotlight4 is used to identify entities in the input text and information from DBpedia is used to extend these entities. We provide a workflow developed in KNIME implementing this approach and perform a result comparison of topic modeling supported by knowledge base information to traditional LDA. This topic modeling approach allows semantic interpretation both by algorithms and by humans.</p>
      </abstract>
      <kwd-group>
        <kwd>Topic Modeling</kwd>
        <kwd>Workflow</kwd>
        <kwd>Text Enrichment</kwd>
        <kwd>Knowledge Base</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recent developments related to Semantic Web made knowledge from the web
available as machine readable ontologies. Links and vocabulary mappings between public
ontologies enable algorithms to make use of knowledge from the web available as
linked open data. One of the most popular public knowledge repositories is DBpedia.
The DBpedia project extracts structured data from Wikipedia and makes it accessible
as knowledge base via a SPARQL interface. ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ])
      </p>
      <p>Topic modeling performs analysis on texts to identify topics. These topic models are
used to classify documents and to support further algorithms to perform context
adaptive feature, fact and relation extraction.
1 This work has been partially supported by the "Wachstumskern Qurator – Corporate Smart
Insights" project (03WKDA1F) funded by the German Federal Ministry of Education and
Research (BMBF).
2 https://www.knime.com/
3 https://wiki.dbpedia.org/
4 https://www.dbpedia-spotlight.org/</p>
      <p>Copyright © 2020 for this paper by its authors.</p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        While Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Pachinko Allocation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], or
Probabilistic Latent Semantic Analysis (PLSA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] traditionally perform topic modeling by
statistical analysis of co-occurring words, the approaches in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] integrate
semantics into LDA.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] propose methods to improve word-based topic modeling approaches
by introducing semantics from knowledge bases. This reduces perplexity issues arising
from ambiguous terms and produces topic models that directly link to the knowledge
base. Topic models created using a knowledge base are easier to understand by humans
than topic models created exclusively by means of statistics.
      </p>
      <p>
        This proof of concept work applies the method from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to perform knowledge base
supported topic modeling using DBpedia. The presented approach to topic modeling is
based on the semantics of entities identified in the document. The basic idea of LDA to
perform analysis based on term frequency is maintained. The extension of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is to
enrich the input using a knowledge base to perform LDA with semantics. Therefore,
DBpedia Spotlight API is used to recognize entities and additional information to these
entities is retrieved via the DBpedia API endpoint. During a preprocessing stage the
text is tagged with semantic annotations from the knowledge base and the tagged text
is used as input to the LDA algorithm. This results in improved topic models due to
more context and less ambiguities in the input.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Architecture</title>
      <p>The text to be examined is transferred to DBpedia Spotlight API. Spotlight returns a
JSON object containing all entities recognized in the text. Additional information to
these entities is retrieved using DBpedia API. The response for each entity is a set of
properties e.g. tags, Uri, type and hypernym (see Section 4 for details). A tagger
combines these sets with the corresponding entities in the text. The result is processed by
LDA. LDA performs topic modeling and provides the result in two formats (as a table
with weights and as an image visualization). The architecture of the processing pipeline
is illustrated in Fig. 1.</p>
    </sec>
    <sec id="sec-3">
      <title>KNIME</title>
      <p>
        The KNIME information miner is an open source modular platform for visualization
and selective execution of data pipelines. KNIME as a powerful data analysis tool that
enables simple integration of algorithms, data manipulation and visualization methods
in the form of modules or nodes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>There are a number of additional services in the KNIME ecosystem, e.g. KNIME
Server, which connects the different actors (services, teams and individuals) in a central
place and thus offers a platform for collaboration. KNIME Workflow Hub makes
workflows publicly available on the KNIME Examples Server. Members of the user
community can share workflows and receive ratings and comments from other users. In our
work with the KNIME analytical platform we have implemented and performed various
modelling methods to offer complete services around semantic analysis.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Workflow for Topic Modeling</title>
      <p>The workflow developed in this work consists of four stages: (1) Reading the text in
consideration and Entity recognition using DBpedia-Spotlight API. (2) Getting
properties of the entities included in the JSON. (3) Tagging the text by combining the entities
and the related properties gained from the previous phase. (4) Text cleaning and topic
modeling using the LDA algorithm. Fig. 2 gives an overview of the workflow
developed and its modularization into four stages.</p>
      <p>1
2
3
4</p>
      <p>The Reading stage includes Table Creator node, which provides the settings of the
parameters used to request entities from DBpedia Spotlight. We use confidence=0.5
and support=0 to get as many entities as possible from the text. File Reader node for
reading the text from a path. String Manipulation node to repair the text, e.g. to replace
double spaces. Column Appender node to combine the data provided by the Table
Creator and File Reader nodes. String Manipulation node prepares the URL request to
DBpedia Spotlight, which is then sent by the node Get Request. The output of this stage
is a table of the text entities recognized by DBpedia Spotlight as shown in Fig. 3.</p>
      <p>The Get properties stage contains Column Filter node to extract the entities
column from the table. Java Snippet node to filter Resources from the JSON. The String
to JSON, JSON to Table, Transpose and JSON to Table nodes to put the column
content in the format required for further processing. Column Filter node filters types
and surfaceForms. Java Snippet node sends a HTTP request containing a SPARQL
query to DBpedia API and retrieves entities and the related tags. Missing Value node
deletes null values and Column Filter node filters surface forms (entities) and tags
from the table created, which build the output of this stage as illustrated in Fig. 4.</p>
      <p>The tagging stage implements a loop taking the original input text and the
recognized entities with their tags to match these entities with their mentions in the
original text and enrich it by the tags as shown in Fig. 5. This loop consists of Recursive
Loop Start to begin the loop, Row Filter to get the rows one by one, Row Table to
Variable as a converter, String Manipulation as a tagger and Recursive Loop End to
get back into the loop in case there are still entries in the table. At the end of the loop
a String to Document node converts the text into a document format and forwards it
to the next stage.</p>
      <p>In the text cleaning and topic modeling stage, the produced document will be cleaned
by Column Filter5, Number Filter, Punctuation Erasure, Stop Word Filter, Case
Converter and Snowball Stemmer. That preprocessed text is passed to Topic Extractor node
that implements the LDA algorithm. LDA creates the topic model as list of weighted
terms, which are then visualized using Color Manager and Tag Cloud nodes.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>
        The focus of this work was on the implementation of the knowledge-based topic
modelling services in a Knime workflow. For a detailed description and evaluation of
the DBPedia based enrichment approach and the comparative assessment of enriched
topic models we refer to our earlier work in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this paper we demonstrate
the Knime-based proof-of-concept implementation by comparing the results of topic
modeling supported by knowledge base information to traditional LDA using the
following text:
      </p>
      <p>Barack Obama is only passing through Germany on his trip to Europe later this week and
does not plan to hold substantial talks with Angela Merkel. The White House views the
chancellor as difficult and Germany is increasingly being left out of the loop.</p>
      <p>This text is expanded with annotations from DBpedia as follows:
Barack Obama [Barack_Obama, Politician, Agent, President, Person, Politician] is only
passing through Germany [Germany, Republic, Place, Country, Person, PopulatedPlace,
Location], on his trip to Europe [Europe, Continent, Location, PopulatedPlace, Place, Continent]
later this week and does not plan to hold substantial talks with Angela Merkel [Angela_Merkel,
Politician, Agent, Person, OfficeHolder]. The White House [White_House, Residence,
Location, Building, Place…
5 Column Filter which is common needed, since the output of the most nodes includes, in addition
to the result, its input that is mostly not needed any more.</p>
      <p>The results reflect the expectations. LDA provides a naive topic model for the
original text comprising of weighted lemmatized terms from the input text with only one
term having a significantly higher weight than other terms of the model. The knowledge
base supported method creates a superior topic model also containing weighted
lemmatized terms from the knowledge base, which are not present in the input text. This
topic model enables semantic interpretation by algorithms as well as by humans.</p>
      <p>In particular the enriched topic model enables algorithms to infer from the topic
model linked to a knowledge base that the input text contains information about politics
and actions of relevant officials, while a classification based on the traditional LDA
topic model might result in a false classification as geographical text.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Summary and Future Prospects</title>
      <p>This proof of concept work developed a KNIME workflow to perform
comprehensive topic modeling using a knowledge base. The use of information from a knowledge
base is achieved by using DBpedia Spotlight API for entity recognition and DBpedia
API to retrieve entity properties. The presented results show that the developed
approach is applicable and delivers results containing more comprehensive insights into
a text than statistical topic models based on words only. The created topic models can
improve the results of various methods used for text mining tasks such as text
classification or fact and relation extraction.</p>
      <p>Topic modeling using knowledge bases is a step towards improved automated
methods for knowledge base population. Other methods in natural language processing
might also be extendable by applying the idea of annotating text with information from
knowledge bases. We expect improved results over word-based approaches for these
tasks in future work, especially when analyzing small corpora.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Allahyari</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kochut</surname>
          </string-name>
          , '
          <article-title>Automatic topic labeling using ontology-based topic models'</article-title>
          ,
          <source>in 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          , '
          <article-title>Latent dirichlet allocation'</article-title>
          ,
          <source>J. Mach. Learn. Res.</source>
          , vol.
          <volume>3</volume>
          , no.
          <source>Jan</source>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          , '
          <article-title>Pachinko allocation: DAG-structured mixture models of topic correlations'</article-title>
          ,
          <source>in Proceedings of the 23rd international conference on Machine learning</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>577</fpage>
          -
          <lpage>584</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          , '
          <article-title>Probabilistic latent semantic analysis'</article-title>
          ,
          <source>in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence</source>
          ,
          <source>1999</source>
          , pp.
          <fpage>289</fpage>
          -
          <lpage>296</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Hulpus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karnstedt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Greene</surname>
          </string-name>
          , '
          <article-title>Unsupervised graph-based topic labelling using dbpedia'</article-title>
          ,
          <source>in Proceedings of the sixth ACM international conference on Web search and data mining</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>465</fpage>
          -
          <lpage>474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Todor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lukasiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Athan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Paschke</surname>
          </string-name>
          , '
          <article-title>Enriching topic models with DBpedia', in OTM Confederated International Conferences" On the Move to Meaningful Internet Systems"</article-title>
          ,
          <year>2016</year>
          , pp.
          <fpage>735</fpage>
          -
          <lpage>751</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Berthold</surname>
          </string-name>
          et al.,
          <source>'KNIME-the Konstanz information miner: version 2</source>
          .0 and
          <string-name>
            <surname>beyond</surname>
            <given-names>'</given-names>
          </string-name>
          ,
          <source>ACM SIGKDD Explor. Newsl.</source>
          , vol.
          <volume>11</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Wojciech</given-names>
            <surname>Lukasiewicz</surname>
          </string-name>
          , Alexandru Todor, Adrian Paschke,
          <article-title>'Human Perception of Enriched Topic Models'</article-title>
          .
          <source>In: Business Information Systems - 21st International Conference, BIS 2018</source>
          , Berlin, Germany:
          <fpage>15</fpage>
          -
          <lpage>29</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>