<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word Embedding Based Extension of Text Categorization Topic Taxonomies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tobias Eljasik-Swoboda</string-name>
          <email>Tobias.Swoboda@fernuni-hagen.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Engel</string-name>
          <email>Felix.Engel@ftk.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kaufmann</string-name>
          <email>m.kaufmann@hslu.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Hemmje</string-name>
          <email>Matthias.Hemmje@ftk.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FTK e.V. Forschungsinstitut für Telekommunikation und Kooperation</institution>
          ,
          <addr-line>Martin-Schmeißer- Weg 4, 44227 Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lucerne University of Applied Sciences and Arts</institution>
          ,
          <addr-line>Technikumstrasse 21, 6048 Horw</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Hagen</institution>
          ,
          <addr-line>Universitätsstraße 47, 58084 Hagen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>15</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>Collaborative interdisciplinary research has the added difficulty, that researchers from different fields have different backgrounds and employ heterogeneous technical vocabularies. Certain problems could have already been solved in one field, but the solution is described in such a fashion, that it is difficult for researchers from another field to understand, yet alone to know the correct terms to search for. Text categorization (TC) is the act of automatically placing text into content-based categories. These categories can be interrelated forming hierarchical taxonomies of knowledge. Different from classic querying-based information retrieval (IR), TC-based IR allows for an exploration of topics without prior knowledge about them, by inspecting the individual topics and related documents within the taxonomies. TC also plays a major role in argumentation mining (AM), the automated extraction of arguments from large quantities of text. In AM, TC is used to identify argument structures within analyzed texts. Another potential use for TC in AM is the restriction of data sources to relevant topics because AM in too-large text corpora can be prohibitively time consuming. As mankind's knowledge constantly expands it is logical to conclude, that the taxonomies organizing this knowledge must expand as well. We propose a method to aid in extending existing topic taxonomies by using word embeddings. These extended topic taxonomies can then be used in the categorization of texts, and to filter argument-extraction sources. We additionally outline an alternative usage of these techniques in argumentation mining.</p>
      </abstract>
      <kwd-group>
        <kwd>Taxonomies</kwd>
        <kwd>word embedding</kwd>
        <kwd>text categorization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction and Motivation</title>
      <p>
        The main goal of argumentation mining (AM) is to automatically extract arguments
from generic texts to provide structured data for computational models of argument
and reasoning engines. To accomplish this goal, argumentation models are used.
These models form parts of individual arguments
        <xref ref-type="bibr" rid="ref11">(Lippi and Torroni, 2016)</xref>
        . According to
Habernal and Gurevych (2015), the prevailing model of arguments in AM is that of a
discourse structure consisting of several argument components, such as premises and
claims. Text categorization (TC) is the act of automatically assigning texts of arbitrary
length to a predefined set of categories
        <xref ref-type="bibr" rid="ref16">(Sebastiani, 2002)</xref>
        . When modeling sentences
within the mined text corpora as texts and argument components—using, for example,
premises and claims as categories—TC is the foundation for a plethora of AM
systems
        <xref ref-type="bibr" rid="ref13 ref13 ref15 ref17 ref6 ref6">(Mochales and Moens, 2011; Feng and Hirst, 2011; Rooney et al., 2012; Stab
and Gurevych, 2014)</xref>
        .
      </p>
      <p>
        ArgumenText is a practical implementation of an AM engine
        <xref ref-type="bibr" rid="ref18">(Stab et al., 2018)</xref>
        . It
employs a two-step mechanism in which a large collection of documents
(http://commoncrawl.org/, in Stab et al.’s experiment with 683 GiB) is first indexed
into an information retrieval (IR) engine. The user can then query the engine using
search terms. The resulting subset of documents is subsequently mined for arguments
(see Figure 1). This is done in order to reduce computation time because AM on this
scale takes too much time with access only to ordinary hardware. In order to query the
engine, the user must know the exact search terms to be used.
      </p>
      <p>
        Having a taxonomy of topics could allow the browsing of different facets of topics
without prior knowledge about their exact structure and common sub-topics. This
way, TC could be used as an alternative to the regular querying-based IR engine,
allowing browsing-based topic exploration. Such taxonomies also directly benefit
collaborative interdisciplinary research. Our research originates from the RecomRatio
project. The goal of RecomRatio is to provide medical professionals with treatment
recommendations that were extracted from current medical literature, arguments for
or against these treatments, and the analyzed medical literature itself. Therefore our
experiments have a strong medical focus. Before TC can be performed, one needs a
set of categories, C. This is obviously given for argument structures but could be
lacking when one models a topic taxonomy for exploration. The aim of our work is to
help in the creation of such a topic taxonomy by suggesting extensions to an existing
proto-taxonomy (see Figure 2). Because TC usually works in a supervised-learning
fashion, one also requires example text-to-category assignments. This need has been
remediated in newer unsupervised TC techniques
        <xref ref-type="bibr" rid="ref14 ref4 ref5">(Dai et al., 2017; Eljasik-Swoboda
et al., 2018)</xref>
        . These techniques are based on word embeddings (see section 2).
Following Dai et al.’s and Eljasik-Swoboda et al.’s examples, we propose a method to
suggest taxonomy extensions for existing topic taxonomies using unsupervised machine
learning while processing no data other than a large collection of example texts and
an existing initial topic taxonomy. Given natural language texts about the topic and an
initial proto-taxonomy as input, our system will then suggest sub-topics for a given
topic in this taxonomy. For example, when analyzing texts about melanoma, the
system will suggest sub-topics for melanoma in a taxonomy tree that models different
diseases. This example taxonomy could model cancer as a family of diseases and
have melanoma as a sub-category of cancer.
      </p>
      <p>
        Even though plenty of resources are available for the medical domain, we limit
ourselves to this because these resources might not be available for emerging
cuttingedge topics, as medical researchers are likely to first describe them using natural
language before making them machine readable in any fashion or agreeing upon their
technical vocabulary definitions
        <xref ref-type="bibr" rid="ref14">(Nawroth et al., 2018)</xref>
        . This makes our approach
transferable to any natural language and uniquely suited for emerging knowledge
domains because, during the adoption phase of TC for any application, additional—
especially manually compiled—information resources are difficult to obtain. Our
contribution is two-fold. First, we propose a novel unsupervised method to help in the
introduction of TC as an IR method for AM or any other application by proposing
new categories. Second, we analyze and discuss what influences the effectiveness of
our system—such as, for example, the utilized word embedding algorithm.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>State of the Art and Related Work</title>
      <p>
        In order to model any topic relationships, one needs a way to model the semantics of
individual terms. Ontologies are manually created encodings of semantics. They are
commonly used in all types of natural language processing applications. Given the
high amount of work put into developing ontologies, the ontologies are very precise
in capturing the semantic understanding of their creators
        <xref ref-type="bibr" rid="ref2">(Busse et al., 2015)</xref>
        . Their
drawback is that they need to be manually created. Blei et.al (2003) proposed another
fundamental approach to capturing semantics for words. Their latent dirichlet
allocation (LDA) statistically splits documents into topic distributions and divides topics
into term distributions. Each term is assigned a topic vector that comprises its
probability of being part of each topic. LDA is also referred to as topic modeling. Here,
terms are regarded as similar if they occur in the same document. Camiña (2010)
described multiple methods to generate taxonomies based on term similarities in LDA
topic distributions. The same is true for Kashyap et al.’s TaxaMiner experimentation
framework for automated taxonomy bootstrapping. Even though these methods are
appealing, we pursuit a different goal in trying to extend an existing proto-taxonomy
instead of starting from scratch.
      </p>
      <p>Another intriguing approach lies in word embeddings. These are unsupervised
learning methods that can capture semantic relatedness by analyzing large texts or
concatenations of multiple smaller texts. Word2Vec is a prominent implementation of
word embeddings that Mikolov et al. (2013) developed. Word2Vec consists of two
algorithms, continuous bag of words (CBOW) and skip-gram. Both produce
highdimensional coordinates for every word and operate by optimizing the cosine
similarity between each word. In CBOW, the similarity of terms that are surrounded by the
same context terms is maximized. In skip-gram, the similarity of the context terms
surrounding the same central terms is optimized. Words are considered to surround a
term if they are in a context window of n words before or after the term. Using this
pattern, semantic relatedness becomes encoded by similar offsets that capture multiple
dimensions of meaning. To the best of our knowledge, this has not been observed in
LDA-based term vectors. A reason for that can be the higher granularity of word
embeddings regarding what terms are in the other words’ contexts.</p>
      <p>Habernal and Gurevych (2015) utilized this in the context of AM by creating
clusters of terms commonly used in arguments in order to support the annotation of
arguments within text. Fu et al. (2014) also used word embeddings to extract
hypernym/hyponym relationships between terms in order to create an ontology. Their
experiments suggest that a simple hypernym/hyponym vector offset does not exist;
rather, one offset exists per class of terms. For example:</p>
      <p>v(shrimp) - v(prawn) ≈ v(fish) - v(goldfish) and v(laborer) - v(carpenter) ≈ v(actor)
- v(clown) but v(laborer) - v(carpenter) ≉ v(fish) - v(goldfish).</p>
      <p>
        Our objective is similar to that of Fu et al. (2014). Instead of extracting
hypernym/hyponym relationships between terms, we attempt to extend topic taxonomies
with sub-categories. These sub-categories are not necessarily hyponyms, as they
could also cover certain aspects of their parent categories. As we limit ourselves to
only the existing text and initial taxonomies, word embeddings are an optimal
foundation for our method. As previously mentioned, our topical focus is in the medical
domain. A cornerstone of medical literature is PubMed, the National Institutes of
Health’s U.S. National Library of Medicine database
        <xref ref-type="bibr" rid="ref19">(U.S. National Library of
Medicine, 1996)</xref>
        . PubMed includes a querying-based search engine and abstracts for most
indexed articles. The articles themselves are stored elsewhere, with their references
and DOIs available in PubMed. Additionally, articles are annotated with Medical
Subject Headings (MeSH)
        <xref ref-type="bibr" rid="ref20">(U.S. National Library of Medicine, 1999)</xref>
        . MeSH is
updated annually, currently defines 28,378 medical topics, and organizes these topics
58,025 times in 16 topical taxonomies such as anatomy and diseases. In these
taxonomies, every topic has one or multiple paths from the taxonomy root to its entry. Even
though these taxonomies form directed acyclic graphs (DAGs), some topics are listed
multiple times in the same taxonomy.
      </p>
      <p>Kaufmann et al. (2017) created the big-data management canvas (BMDC). The
fundamental insight is that the aim of any big-data project is the effectuation: the
creation of a benefit through the analysis of big amounts of data. The same is true for
any data- or text-mining endeavor. In order to not lose sight of this, the BDMC
planning method splits endeavors into five main fields of activities. These fields form a
loop of activities going from the datafication (which is the capturing of data for later
analysis) to the said effectuation. These are further split into a business aspect and a
technology aspect. The business aspect describes and plans what should be done
whereas the technology aspect describes and plans how it should be implemented. We
used this method during the planning and modeling of our system.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Model and Implementation</title>
      <p>As mentioned, we organized this research using the BDMC that Kaufmann et al.
(2017) proposed. The following sub-sections reflect the BDMC’s fields of activity.
This illustrates the workflow we propose for the extension of topic taxonomies. We
named our system Taxonomy Extension system for Emerging Knowledge (TEEK), as
its primary task is to capture emerging topics for usage in TC. We used the BDMC to
structure the creation of our prototype as well as the performed evaluation
experiments.
3.1</p>
      <sec id="sec-3-1">
        <title>Datafication</title>
        <p>The BDMC defines datafication as the act of transforming real-world events and
properties into usable data. It also closes the loop to the effectuation, as every
effectuation influences the world we live in and hence creates new data to capture. In our
envisioned application, the relevance feedback provided by the domain experts
curating the taxonomy is the datafication of this endeavor. If a domain expert agrees with
the system and adds a category to the system, the available taxonomy changes. The
datafication of our experiments is performed with the evaluation of the proposed
categories as described in section 3.4.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data Integration</title>
        <p>The BDMC field of data integration describes which data is used, how it is obtained,
and how it is centrally managed and stored. As mentioned before, our system works
on taxonomies and text files about a given knowledge domain. We performed
experiments for the medical terms neoplasms (cancers/tumors), melanoma, leukemia,
Herpesviridae, and Simplexvirus. Each of these terms has one or multiple entries in a
MeSH taxonomy. Melanoma, leukemia, and neoplasms are part of the diseases
taxonomy whereas Simplexvirus and Herpesviridae are part of the anatomy taxonomy.
Simplexvirus is a descendant of Herpesviridae whereas melanoma and leukemia are
descendants of neoplasms within MeSH.</p>
        <p>The finished system will have access to a multitude of documents from which it
can learn the relationships between terms in order to propose new topics. For our
prototype, we simulate this using PubMed. We queried PubMed for each of the
above-mentioned terms and used the export-to-XML function in order to download all
metadata and abstracts for a given topic. Word2Vec requires lowercase texts without
special characters. Because the resulting XML files were up to 31.01 GiB in size
(neoplasms), we implemented a buffered XML parser in Java that extracts the abstracts
from all articles in the individual result sets and stores them into simple text files,
removing all special characters. For easy integration in multiple applications, we
packaged the original C implementation of Word2Vec
(https://github.com/tmikolov/word2vec), into a Docker container, which we used to
run CBOW and skip-gram on these extracted text files. This means that we have two
word embedding files for each PubMed search term to experiment on. We
parameterized them to have 200 dimensions and use a five-word (before and after) context
window.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Data Analytics</title>
        <p>The data analytics field describes how the available data is analyzed. As previously
stated, the available data is a set of word embeddings and an initial taxonomy. Every
taxonomy— = {, , }—has a set of categories—C—and a set of labels for each
category, L. Additionally, the set of edges, E, between the categories form a DAG
with  ∈  at its root. The labels consist of one or multiple words.</p>
        <p>Word embeddings are high-dimensional vectors for all terms that the algorithm
encounters during training. We denote them as v(word). This way, a word embedding
vector can represent every category with single-word labels. If the label of one category
consists of multiple words, we compute its vector representation by calculating the
arithmetic mean of all the individual word vectors. Because T forms a DAG, every
 ∈  has a path !, … , !, where ! =  and ! = . With the word embeddings, every
node has a representation in vector space. The rest of the taxonomy is ignored (see
Figure 3).</p>
        <p>The task of suggesting new sub-categories for c is essentially that of extending the
path to !, … , !, !!!. Our approach for TEEK is to compute the most-likely next
category label vector by using the information provided by the existing path !, … , !. Once
this vector is computed, the 10 closest terms to this next vector in word embedding
space are calculated using cosine similarity. It is noteworthy that the closest term in all
our experiments is the label of c. The system therefore creates nine suggestions per
category.</p>
        <p>We experimented with two possible variations for this task. The first variation begins
with the computation of the offsets between individual categories on the path:
! =  ! − (!!!). In the next step, it adds the average offset between all categories
on the path to the vector of !:
(!!!) = (!) + !
!
!
!!! !</p>
        <p>This equation essentially adds the arithmetic mean of the individual offsets to the last
vector. We therefore refer to it as arithmetic vector-stream predictor (AVSP). Figure 4
portrays this approach in an example two-dimensional word embedding space.</p>
        <p>The second variation applies linear regression to the problem of finding the
hyperplane closest to all (). Using the following equations, this hyperplane is expressed as
function of path index j:</p>
        <p>This way, the next word embedding vector is found through the following equation:
 = !!!!</p>
        <p>!!!!
 = !
!
!
!!! (!)
 =
!
!!! !!! ∗(! !! !!)
!
!!! !!! !
 =  −  ∗ 
 !!!
=  +  ∗ ( + 1)
(5)
(6)</p>
        <p>Because this is standard linear regression applied to vectors instead of scalars, we
named our second approach regression vector-stream predictor (RVSP). Figure 4
portrays this using two-dimensional word embedding spaces. The word embedding space is
shown for each term with the distance from the root of the taxonomy forming an
additional dimension. The dark-red spheres represent the individual categories, the planes
symbolize the word embeddings that the categories are in, and the line represents the
hyperplane closest to all these points. The white sphere represents the suggested
categories, as it extends the taxonomy-depth dimension by one.</p>
        <p>Our prototype implements these two methods in Java after reading word
embeddings as text files and the taxonomy as an XML file from the file system. Both
approaches essentially create a direction in the word embedding space that reflects the
direction of the path from the taxonomy root to the individual topic category. This
captures Fu et al.’s (2014) finding regarding the lack of a common hyponymy
direction between different classes of terms. Here we use the available information
provided by the taxonomy structure to discern different dimensions appropriate for each
topic category.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Data Interaction and Data Effectuation</title>
        <p>The data interaction field describes how users interact with the data in order to benefit
from the data effectuation. In our case, a domain expert can review the category
suggestions before accepting them for usage in the topic taxonomy. After a suggested
topic is accepted, a TC algorithm can assign content to this category. This review and
acceptance component will be implemented through a Web interface that will show the
user the top nine suggestions to extend for a given taxonomy. This forms a type of
relevance feedback for the suggested categories. For our prototype, the results are stored in
a Microsoft Excel file to ease their review and validation by the medical professionals
that support us in this project (see section 3.2). The goal of our proposed system is the
automatic suggestion of additional topic categories in order to extend an existing
taxonomy. A TC-oriented IR system subsequently uses these categories to allow
uninformed IR. This uninformed IR can then used to narrow down the source material
used for AM. This narrowing down is crucial for performing AM in a timely fashion.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Result Interpretation</title>
      <p>The purpose of our evaluation is to discuss the usefulness of suggested sub-categories
for the given topic. Because our method can be parameterized differently, we can
investigate the effect of the selected parameters on the results. As mentioned before,
we use MeSH and PubMed as data sources for our experiments. The assessed topics
are not leafs in MeSH but rather are inner nodes that already have a set of sub-topics
in MeSH. The already existing sub-topics are hidden from our algorithm. This allows
for four types of true positive results for our system: First, suggested sub-categories
that are actual existing sub-categories of the investigated topic and, second, suggested
categories that are not already sub-categories in MeSH but would make sense as
subcategories according to publically available medical sources. Examples for these
findings are myeloblastic as a sub-class of leukemia and lentiginous melanoma.
Myeloblastic leukemia and lentiginous melanoma are types of their diseases that have been
described in literature but are not modeled as sub-categories in MeSH. We published
all detailed results including references for potentially meaningful sub-categories at:
https://github.com/SirTobiSwobi/TEEKeval . We regard different spellings of the
category name as correct sub-category suggestions and therefore TP because our
system correctly interpreted them as types of the category. Our system correctly
captured, that a misspelling of leukemia (like leukeamia) must be some kind of leukemia
because experts wrote about in the same way. We interpret the plural of a category as
fourth type of TP result, because the system correctly captured that sub-categories are
different types of the original category. For example, it recognized, that the
subcategories of melanoma are (different types of) melanomas. Albeit plural- and
different spelling results are not directly helpful in extending an existing taxonomy, they
aid in comparing the effectiveness of different parameters and approaches. This
allows us to compare which word embedding algorithm and extrapolation method
produce better results. Additional insights can be gained by using different source
material for the word embeddings. Leukemia and melanoma are descendants of neoplasms.
Therefore, we can compare the performance of word embeddings generated through
the PubMed abstracts to the more specific search term or through the larger amount of
abstracts using the more general search term. The same is true for Herpesviridae and
Simplexvirus.</p>
      <p>
        Although melanoma occurs three times in 2018’s MeSH, all other examined
medical terms have only one entry. This allows another investigation about how the path
length influences the performance of the system. Of these three entries, two entries
are six steps removed from the taxonomy root whereas one entry is only four steps
removed. The effectiveness of IR systems is usually measured in precision and recall
        <xref ref-type="bibr" rid="ref16">(Sebastiani, 2002)</xref>
        . They are not directly applicable because we do not perform
information retrieval but attempt to extend topic taxonomies. To compare different word
embedding algorithms, source material and path lengths for individual terms, we use a
modified version of precision. Results that are on the path between the root and the
term as well as other relative terms and completely unrelated terms are treated as
False Positive (FP). Because we know the existing proto-taxonomy, results on the
path or other relative terms could be filtered from the result set, so that the system
instead outputs the next closest term in word embedding space. We decide not to do
this for the sake of comparing different parameters. Precision is the ratio between TP
and FP with 1.0 meaning only TP and 0.0 meaning only FP. For the recall measure,
one needs to know all possible correct relevant sub terms, which nobody in our team
did. Therefore, we only measure precision for our system. Table 1 contains the
precision values for all our performed experiments.
      </p>
      <p>Term
Neoplasms
Melanoma 1
Melanoma 2
Melanoma 3
Melanoma 1
Melanoma 2
Melanoma 3
Leukemia
Leukemia
Herpesviridae
Simplexvirus
Simplexvirus
On average, the AVSP (37%) and the RVSP (35%) performed almost equally as well.
The RVSP found no TP in 6 out of 24 experiments. This only happened to the AVSP
in one experiment. The RVSP is almost on par, because it delivered better results in
other experiments. When comparing the average precision of the CBOW (38%) and
skip-gram (35%) word embedding algorithms, both delivered comparable results no
matter the extrapolation method. Albeit CBOW delivered on average slightly better
results, skip-gram produced the best single result (89%).</p>
      <p>Training the word embeddings on larger (high-level) text collections (E.g.
Neoplasms instead of melanoma) had the biggest impact on performance. Low-level
representations only yielded an average effectiveness of 22%, providing almost all cases
in which no TP were found, while high-level representations had an average
effectiveness of 47%. Metaphorically speaking this means that the more the system
“knows”, the better it is at generating sub-category suggestions. Another influencing
factor is the depth of the investigated term within the taxonomy. With depth 1, the
average precision was 44%, with depth 3 60%, and depth 4 54%. With depth 5, only
11% precision was achieved while depth 6 on average generated 24%. This means
that with a depth&gt;4, less than half the precision that with depth &lt; 4 was achievable.
We see two reasons for this behavior: The deeper a category is in a taxonomy, the
more intermediate steps the system can use for extrapolation of sub-categories. On the
other hand, the more specialized a topic is, the less likely there are many sub-topics.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and potential AM usage</title>
      <p>Our work shows a new way to extend existing topic taxonomies by using no other
information than texts about the knowledge domain and an initial topic taxonomy.
Therefore, domain experts do not need to perform any manual effort besides
accepting suggested categories. As initially explained, text categorization using topic
taxonomies supports uninformed information retrieval in any application and can
specifically be applied for narrowing down documents before AM. To do so, TC systems
require appropriate topic taxonomies. Our prototype aids in creating these taxonomies
with only little labeled data. This directly benefits ArgumenText, RecomRatio, and
other AM systems. Besides this filtering application, another potential utilization for
this approach in AR is the detection of pros and cons for individual terms. To utilize
this system as such, one could construct a taxonomy of topics or simply use an
existing one such as MeSH. Afterward, known pros are manually modeled as
subcategories for the topic categories. For each new topic-pro-leaf, the average offset or
regression-based extrapolation vector can be computed. Adding this vector to (!!!)
instead of (!) would allow a user to find other pros for the leaf topic. The same can
be done for cons. Using this technique, potential features for spotting pros and cons
for topics can be extracted. These can then be used in AM TC or by a user to
manually come up with arguments for or against something that are not already written down
in the texts mined for arguments. In addition to describing a new way to extend
taxonomies, we investigated how different parameters influence the approach. Albeit
CBOW slightly outperforms skip-gram, the latter achieved the best individual results.
Similarly, the AVSP slightly outperformed the RVSP on average. The RVSP however
had many more results without TPs. Upon investigation, we found that it delivered
many terms describing topics on the path from the root to the term in question. As
mentioned, these can easily be filtered in future works. The strongest influence on
performance comes from the texts that the word embeddings are based on. The more
text about more general topics is analyzed, the better the system performs. This means
that the hypothetical best results would come from word embeddings that are
generated through the use of all abstracts on PubMed. Due to resource constraints, we were
not able to practically test this. The taxonomy depth of the extended category also
plays an ambiguous role: The further the category is removed from the root, the more
intermediate steps can be used in the analysis. However, the more specific a topic is,
the less likely it is to have more sub-topics.</p>
      <sec id="sec-5-1">
        <title>Acknowledgements</title>
        <p>This work has been funded by the Deutsche Forschungsgemeinschaft (DFG) within
the project "Rationalizing Recommendations", Grant Number 376059226, as part of
the Priority Program "Robust Argumentation Machines (RATIO)" (SPP-1999).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Michael I. Jordan</given-names>
          </string-name>
          ,
          <year>2003</year>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>In Journal of machine Learning research</source>
          , pages
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Busse</surname>
          </string-name>
          , Bernhard Humm, Christoph Lübbert, Frank Moelter, Anatol Reibold, Matthias Rewald, Veronika Schlüter, Bernhard Seiler, Erwin Tegtmeier, and Thomas Zeh,
          <year>2015</year>
          . Actually, What Does “
          <article-title>Ontology” Mean? A Term Coined by Philosophy in the Light of Different Scientific Disciplines</article-title>
          .
          <source>In Journal of Computing and Information Technology (CIT)</source>
          ,
          <volume>23</volume>
          ,
          <issue>1</issue>
          , pages
          <fpage>29</fpage>
          -
          <lpage>41</lpage>
          . https://doi.org/10.2498/cit.1002508
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Steven</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Camiña</surname>
          </string-name>
          .
          <article-title>A comparison of Taxonomy Generation Techniques Using Bibliometric methods</article-title>
          : Applied to Research Strategy Formulation,
          <source>Composite Information Systems Laboratory (CISL) Working Paper</source>
          <year>2010</year>
          -
          <volume>01</volume>
          , Massachusetts Institute of Technology,
          <year>2010</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Xiangfeng</given-names>
            <surname>Dai</surname>
          </string-name>
          , Marwan Bikdash, and Bradley Meyer,
          <year>2017</year>
          .
          <article-title>From social media to public health surveillance: Word embedding based clustering method for twitter classification</article-title>
          .
          <source>In Proceedings of SoutheastCon</source>
          , pages -
          <fpage>7</fpage>
          , https://doi.org/10.1109/SECON.
          <year>2017</year>
          .7925400
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Eljasik-Swoboda</surname>
          </string-name>
          , Michael Kaufmann, and Matthias Hemmje,
          <year>2018</year>
          .
          <article-title>No Target Function Classifier Fast Unsupervised Text Categorization Using Semantic Spaces</article-title>
          .
          <source>In Proceedings of DATA</source>
          <year>2018</year>
          , https://doi.org/10.5220/0006847000350046
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Vanessa</given-names>
            <surname>Wei</surname>
          </string-name>
          Feng and
          <string-name>
            <given-names>Graeme</given-names>
            <surname>Hirst</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Classifying arguments by scheme</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume</source>
          <volume>1</volume>
          , pages
          <fpage>987</fpage>
          -
          <lpage>996</lpage>
          , Portland, Oregon. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Ruiji</given-names>
            <surname>Fu</surname>
          </string-name>
          , Jiang Guo, Bing Qin, Wanxiang Che,
          <string-name>
            <given-names>Haifeng</given-names>
            <surname>Wang</surname>
          </string-name>
          , and Ting Liu,
          <year>2014</year>
          .
          <article-title>Learning Semantic Hierarchies via Word Embeddings</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>119</fpage>
          -
          <lpage>1209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Habernal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Exploiting debate portals for semi-supervised argumentation mining in user-generated web discourse</article-title>
          .
          <source>In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2127</fpage>
          -
          <lpage>2137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Vipul</given-names>
            <surname>Kashyap</surname>
          </string-name>
          , Cartic Ramakrishnan, Christopher Thomas,
          <string-name>
            <given-names>Amit P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <year>2005</year>
          ,
          <article-title>TaxaMiner: An Experimentation Framework for Automated Taxonomy Bootstrapping</article-title>
          .
          <source>International Journal of Web and Grid Services</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>240</fpage>
          -
          <lpage>266</lpage>
          , available online: http://corescholar.libraries.wright.edu/knoesis/744
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Michael Kaufmann, Tobias Eljasik-Swoboda,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Nawroth</surname>
          </string-name>
          , Kevin Berwind, Marco Bornschlegl, and Matthias Hemmje,
          <year>2017</year>
          .
          <article-title>Modeling and Qualitative Evaluation of a Management Canvas for Big Data Applications</article-title>
          .
          <source>In Proceedings of DATA</source>
          <year>2017</year>
          , https://doi.org/10.5220/0006397101490156
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Marco</surname>
            <given-names>Lippi</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Torroni</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Argumentation Mining: State of the Art and Emerging Trends</article-title>
          .
          <source>In ACM Transactions on Internet Technology (TOIT)</source>
          ,
          <volume>16</volume>
          .2, pages
          <fpage>10</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Wen-tau
          <string-name>
            <surname>Yih</surname>
          </string-name>
          , and Geoffrey Zweig,
          <year>2013</year>
          .
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          , pages
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Raquel</given-names>
            <surname>Mochales</surname>
          </string-name>
          and
          <string-name>
            <surname>Marie-Francine Moens</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Argumentation mining</article-title>
          .
          <source>In Artificial Intelligence and Law</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ), pages
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Christian</surname>
            <given-names>Nawroth</given-names>
          </string-name>
          , Felix Engel,
          <source>Tobias Eljasik-Swoboda and Matthias Hemmje</source>
          ,
          <year>2018</year>
          .
          <article-title>Emerging Named Entity Recognition for Clinical Argumentation Support</article-title>
          . In Submitted to Proceedings
          <source>of DATA</source>
          <year>2018</year>
          , https://doi.org/10.5220/0006853200470055
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Niall</surname>
            <given-names>Rooney</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hui</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Fiona</given-names>
            <surname>Browne</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Applying kernel methods to argumentation mining</article-title>
          .
          <source>In Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference</source>
          , pages
          <fpage>272</fpage>
          -
          <lpage>275</lpage>
          .
          <article-title>Association for the Advancement of Artificial Intelligence</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Machine learning in automated text categorization</article-title>
          .
          <source>In ACM computing surveys (CSUR)</source>
          ,
          <volume>34</volume>
          .1, pages
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Stab</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Identifying argumentative discourse structures in persuasive essays</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>46</fpage>
          -
          <lpage>56</lpage>
          , Doha, Qatar, October. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Christian</surname>
            <given-names>Stab</given-names>
          </string-name>
          , Johannes Daxenberger, Chris Stahlhut,
          <string-name>
            <given-names>Tristan</given-names>
            <surname>Miller</surname>
          </string-name>
          , Benjamin Schiller, Christopher Tauchmann,
          <string-name>
            <surname>Steffen Eger</surname>
            , and
            <given-names>Iryna</given-names>
          </string-name>
          <string-name>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>ArgumenText: Searching Arguments in Heterogeneous Sources</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations</source>
          , pages
          <fpage>21</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. U.S. National Library of Medicine.
          <year>1996</year>
          . PubMed. Available at: https://www.ncbi.nlm.nih.gov/pubmed/ (accessed: May 28,
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. U.S. National Library of Medicine.
          <year>1999</year>
          .
          <article-title>Medical Subject Headings</article-title>
          . Public Domain Information. Available at: https://www.nlm.nih.gov/mesh/ (accessed: May 29,
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>