<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>What Company Does My News Article Refer to? Tackling Multiclass Problems With Topic Modeling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fraunhofer IAIS</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sankt Augustin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany max.luebbering@iais.fraunhofer.de</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Reading</institution>
          ,
          <addr-line>Reading</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Weierstrass Institute (WIAS)</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While it is technically trivial to search for the company name to predict the company a new article refers to, it often leads to incorrect results. In this article, we compare the two approaches bag-of-words with k-nearest neighbors and Latent Dirichlet Allocation with k-nearest neighbor by assessing their applicability for predicting the S&amp;P 500 company which is mentioned in a business news article or press release. Both approaches are evaluated on a corpus of 62k documents containing 84% news articles and 16% press releases. While the bag-of-words approach yields accurate predictions, it is highly ine cient due to its gigantic feature space. The Latent Dirichlet Allocation approach, on the other hand, manages to achieve roughly the same prediction accuracy (0.58 instead of 0.62) but reduces the feature space by a factor of seven.</p>
      </abstract>
      <kwd-group>
        <kwd>Text Classi cation</kwd>
        <kwd>Divergence</kwd>
        <kwd>Company Prediction</kwd>
        <kwd>Kullback-Leibler</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Labeling a news article with the company that it deals with, is not an easy
classi cation task. In this paper, we develop a model that a) classi es news
articles with hundreds of target labels (i.e., companies), b) does not require
retraining for every additional target label and c) supports anonymized news
articles (i.e., no company or stock symbol mentions).</p>
      <p>
        To solve this classi cation problem, we evaluate two alternative approaches,
namely bag-of-words with k-nearest neighbors and Latent Dirichlet Allocation
with k-nearest neighbor, on a large data set of news. The rst approach spans a
huge feature space as every feature of the bag-of-words model is directly passed
to the k-nearest neighbor classi er, leading to an overall minimally better
performance than the second approach, however, being highly ine cient. The second
3 Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
approach transforms the features of the bag-of-words to a new set of features
(topics in the training corpus), which { as shown in the evaluation { is smaller by
a factor of seven. The feature transformation is based on Blei's Latent Dirichlet
Allocation (LDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        While state-of-the-art approaches seem more applicable than LDA, they do
not ful ll all of the three model requirements stated above. Named entity
recognition with subsequent company name matching would fail for anonymized texts.
LSTMs (especially attention based LSTMs) would work for longer texts, but
require retraining. Even though models based on embedding representation could
ful ll all three requirements, they have poor clustering abilities. Pre-trained word
embedding models like Word2Vec [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and character level word embedding models
like Flair [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], can encode syntactic and semantic word-level properties. However
their clustering and classi cation capabilities (especially on document level) are
rather limited [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The two models are trained on word and character
substitutability respectively, which does not provide distinct topical clusters in its
vanilla form [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Thus requiring further downstream classi cation layers, that
need to be retrained with every additional target label.
      </p>
      <p>This article is structured in ve sections: In Section 2, we rst present text
classi cation methods known in the literature and subsequently detail our
strategies to combine some of them. Section 3 deals with the implementation of the two
proposed approaches from the previous section. The two approaches are further
evaluated on multiple data sets in Section 4. These results are discussed and put
into perspective in Section 5
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>Before presenting our two speci c approaches for predicting the company a news
article refers to in Section 2.2, we present the work other researchers have
performed on text classi cation.
2.1</p>
      <sec id="sec-2-1">
        <title>Related Work</title>
        <p>
          Topic based text classi cation has been covered in di erent research areas
already [
          <xref ref-type="bibr" rid="ref11 ref8">11,8</xref>
          ].
        </p>
        <p>
          A possible way to preprocess documents is to represent them by a topic model
as done by the authors of [
          <xref ref-type="bibr" rid="ref10 ref9">9,10</xref>
          ]. For a nite set of topics, each document's topic
representation is a categorical distribution over the topics.
        </p>
        <p>
          In contrast to [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], whose authors picked a speci c corpus of clinical reports,
the authors in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] chose universal datasets (e.g., a subset of the Wikipedia article
corpus) to train a topic model. The topic model then predicted the topics of a
labeled data set, which were then used as input to train multiple classi ers e.g.,
nave Bayes or support vector machines (SVMs). The model they trained were
able to classify short and sparse texts.
        </p>
        <p>
          The application of domain speci c corpora to text classi cation has been
shown by Sarioglu et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. They built a topic model from clinical reports in
order to represent them in a more compact feature space than one built from a
bag-of-words model. The representations were then used to classify CT4 imaging
reports using SVMs. Their results showed, that the topic model approach was
competitive with a bag-of-words approach, while reducing the number of features
signi cantly.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Design</title>
        <p>We train our two approaches on the company descriptions corpus D and evaluate
the models on the news article corpus A, as shown in Figure 1. Each company
description di and news article ai contains the respective document in its
humanreadable text representation. In both corpora each document is labeled by the
respective company name.</p>
        <sec id="sec-2-2-1">
          <title>Company descriptions</title>
          <p>D = [d1; d2; : : : ; dM ]</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>News articles</title>
          <p>A = [a1; a2; : : : ; aL]</p>
          <p>Our two approaches classify news articles and press releases5 by predicting
their similarity to a set of company descriptions, which are labeled by the
corresponding company name. In order to reduce complexity, our approaches will be
trained to predict only one company per news article and the set of companies
is limited to the companies (denoted by C) listed in the S&amp;P 500 index.</p>
          <p>Next, we explain our two classi cation approaches. The bag-of-words with
k-nearest neighbor (BOW KNN) approach is the trivial one, that we want
to beat by the second approach with respect to prediction accuracy and model
complexity. The second approach, namely Latent Dirichlet Allocation with
k-nearest neighbor (LDA KNN), is more sophisticated and can be seen as
an extension of the rst one. The advantage of this extension is its capability to
reduce the set of features tremendously, as we will see later in this article.</p>
          <p>As the k-nearest neighbor models in both approaches will be trained on
categorical distributions which are de ned on a simplex, Euclidean distance is
not applicable as a metric. This is why we use the Kullback Leibler divergence
(KL) to determine the similarity between two categorical distributions p and q
according to</p>
          <p>KL(pjjq) =</p>
          <p>X p(x) ln
x2X
q(x)
p(x)
2 [0; 1);
(1)
where X is the set of all categories (here, the set of all words in vocabulary v).
Note, that in contrast to the Euclidean norm, the Kullback Leibler divergence
4 computed tomography
5 In the following, for simplicity we refer to press releases as news articles as well.
is not a metric (in the mathematical sense) and is used as a similarity measure,
where a value of zero means maximum similarity.
2.3</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Bag-of-words with k-nearest neighbor</title>
        <p>Bag-of-words with k-nearest neighbor (BOW KNN) rst builds a vocabulary (set
of all distinct words in a corpus) from the company description corpus D and
counts the number of occurrences of each word for each document, which can be
represented by a matrix</p>
        <p>D~ M;jvj = BBB d~2..;1
@ .</p>
        <p>d~2;2</p>
        <p>... . . .
0 d~1;1 d~1;2
d~M;1 d~M;2
d~M;jvj
~
d1;jvj 1
~
d2;jvj CCC 2 RM jvj;
.
.
.</p>
        <p>
          A
(2)
where M is the number of company descriptions in the corpus, jvj is the length
of the vocabulary v and an element d~i;j represents how often word j appears in
company description i. Note, that usually jvj M . This absolute word
frequencies representation of the company descriptions is called bag-of-words
representation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In the following, we always normalize the bag-of-words representation
matrix of a corpus in a row-wise fashion. For the bag-of-word representation D~
of the company descriptions D this means, that every element in the matrix
is divided by its document's length: d~i;j ! Pjjvd=~ji1;jd~i;j . Since the normalization
ensures Pjjv=j1 d~i;j = 1 and 0 d~i;j 1, every document is represented by a
categorical distribution over the vocabulary.
        </p>
        <p>In the following, we will assert every bag-of-words model to be present in its
normalized form.</p>
        <p>As shown in Figure 2, the idea of BOW KNN is to build a bag-of-words
model from the company description corpus D (i.e, determine v) and represent
the corpus in terms of the bag-of-words model, leading to matrix D~ . These
representations are then used to train the k-nearest neighbor classi er (KNN
classi er).
train</p>
        <p>For news article prediction, the news article ai is represented in terms of the
same bag-of-words model and then passed to the k-nearest neighbor classi er,
which determines the k best matching company descriptions for the given article
ai. This process is shown in Figure 3.</p>
        <p>ai
{ Bag-of-words models often have a very large feature space (dimensionality =
jvj), which makes similarity calculations between documents expensive and
time consuming.
{ Unimportant features add a lot of noise to the similarity measurement, thus
leading to models that do not generalize well in practice.
While feature selection strategies like mutual information and 2 test statistic
have been successfully used for years to reduce the dimensionality by identifying
unimportant features, Latent Dirichlet Allocation (LDA), which is part of our
second approach, generates a complete new set of features, namely the topics of
a corpus.</p>
        <p>Using the Latent Dirichlet Allocation with k-nearest neighbor (LDA KNN)
approach we try to solve the previously listed shortcomings. We will analyze
how LDA KNN can reduce the huge feature space of dimensionality jvj spanned
by the bag-of-words model, while incurring only small losses in accuracy. This
approach takes the normalized bag-of-words representation d~i of the company
descriptions and trains a topic model using LDA.</p>
        <p>
          Each of these topics i of the topic model is a categorical distribution over
the vocabulary v, where the importance of a word for a topic is expressed by
the amount of probability assigned to the word. The modeling task in LDA is to
determine the categorical distribution for each topic i and the parameterization
of the Dirichlet distribution (see [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]), such that the log likelihood l( ; ) of
the company description corpus is maximized:
arg max l( ; ) = arg max
; ;
        </p>
        <p>M
X log(p(d~ij ; )):
i=1
(3)</p>
        <p>Note, that the parameter is the set of all topics and p(d~ij ; ) is the probability
of company description d~i given the model parameters and .</p>
        <p>We set the number of topics (hyperparameter) in the LDA model to match
the number of all companies in C. Since the company descriptions
belonging to the same company are most similar to each other in terms of
their words, we thereby create an arti cial bias towards representing
every company by exactly one topic, albeit articles of di erent companies
may be similar, e.g., articles from the same industry like a car manufacturer.</p>
        <p>Having trained the LDA model on the company description corpus, i.e.
estimated and , we predict the topic distribution for every company description
by the LDA model, as shown in Figure 4. This yields a normalized topic
prediction matrix D of dimensions M jtopicsj, with 0 di;j 1 and Pjjt=op0icsj Di;j = 1
for all i 2 f0; 1; : : : ; jtopicsjg. The matrix entry at position (i; j) denotes how
much the jth topic is represented in the ith company description in relation to
all the other topics.</p>
        <p>The k-nearest neighbor classi er is trained on D. As jcompaniesj jvj, the
feature space of the KNN classi er is reduced signi cantly compared to BOW
KNN.
train
19
&gt;
&gt;
CC&gt;&gt;= M
C
AC&gt;&gt;&gt;
&gt;
;</p>
        <p>KNN
with KL</p>
        <p>As shown in Figure 5, in order to predict its company, the news article is rst
represented in terms of the bag-of-words model, then by the LDA topic model
and nally passed to the KNN classi er. The KNN classi er picks the k most
similar company descriptions by calculating the Kullback-Leibler divergence
between the topic distribution ai of the news article and the topic distributions
of all company descriptions. We then predict the company, whose company
descriptions were selected most frequently among the k closest ones. Note, that
both approaches also allow to quantify a prediction's certainty. When all of the
k neighbors are the same, the model is 100%, whereas when two companies are
predicted four and six times out of k = 10, then the model is only 40% certain.
In this section, we explain the implementation of the methodology proposed in
the previous section by means of the general machine learning process: data
acquisition, preprocessing, and modeling.
To train our models, we need multiple company descriptions per company, which
is why we implemented a company description crawler, that retrieves the
company descriptions for every company in the S&amp;P 500 index from a total of eight
relevant nancial information providers like Yahoo! Finance or Google Finance.
In order to evaluate the models' performance in the eld, we crawled news
articles from di erent outlets. These news articles were unlabeled, i.e., did not
indicate the company name(s), and therefore, not applicable for evaluation. To
label them, our rst approach was to search Google Finance and Google News
by company name or company stock symbol. For each crawled news article URL
that matched one in our news article data set, we were able to label the news
article by the company name. This approach yielded only 666 labeled articles (see
Section 4.2), therefore, we labeled articles using a keyword search in a second
approach (trivial labeler). For each company, the number of its occurrence in a
news article is being counted. The article was labeled by the company contained
most often in the article. Manual inspection revealed many false negatives, which
is why we put further constraints on an article to be successfully labeled: The
top company has to appear at least three times in an article, the total number of
companies mentioned in the article has to be less than four and the top company
has to be mentioned at least two more times than the second most mentioned
company. Applying these constraints, we manually determined a proportion of
99% correctly labeled news articles in a randomly generated subset of 400 news
articles. All documents were retrieved in plain HTML, which is why we applied
a two layered preprocessing: 1) Parse HTML tree to extract the article text. 2)
The text is tokenized, POS tagged, lemmatized and stop-word cleaned.
3.2</p>
      </sec>
      <sec id="sec-2-4">
        <title>Modeling</title>
        <p>Having implemented the pipelines of both approaches using Python's scikit-learn
library6, we needed to hyperparameterize both pipelines, before training the
models. Since we achieved for di erent hyperparameterizations huge di erences
in the accuracy of our validation set, we decided to train multiple models on a
variety on parameter combinations in a grid search fashion. In Table 1, we list
all tested values for each parameter.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Bag-of-words with KNN LDA with KNN</title>
        <p>
          { Bag-of-words modeling
max_df:
[0.025, 0.05, 0.1, 0.2, 0.4]
min_df:
[
          <xref ref-type="bibr" rid="ref1 ref10 ref3 ref5 ref7">1, 3, 5, 7, 10</xref>
          ]
{ KNN modeling
n_neighbors:
[
          <xref ref-type="bibr" rid="ref3 ref8">3, 8, 15, 20</xref>
          ]
weights:
[distance, uniform]
metric:
[KL, KL']
{ Bag-of-words modeling
max_df:
[0.025, 0.05, 0.1, 0.2, 0.4]
min_df:
[
          <xref ref-type="bibr" rid="ref1 ref10 ref3 ref5 ref7">1, 3, 5, 7, 10</xref>
          ]
{ LDA modeling
n_components:
[480, 1000, 2000, 3000, 5000,
10000]
learning_method: batch
        </p>
        <p>Since words that are existing in nearly all or none of the articles are irrelevant
for the analysis, we set all considered tokens to be in a pre-de ned document
frequency range, when building the bag-of-words model. The hyperparameter
max_df limits vocabulary to the words, that have a lower document frequency,
than the given maximum. Accordingly, min_df de nes the lower bound. Note,
that if we pass absolute values to any of the two hyperparameters, then they
are not considered as document frequencies, but as absolute appearances in the
documents.</p>
        <p>KNN is parameterized such that for classi cation the k most similar
neighbors (n_neighbors) are being picked based on Kullback Leibler divergence
6 http://scikit-learn.org/stable/index.html
(metric)7. Classi cation is done by either weighting each of the closest neighbors
by their distance to the sample or by weighting them uniformly (weights).</p>
        <p>For LDA modeling, we have to pass the number of topics (n_components)
and picked batch as the learning method to go with.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>Accuracy Measure</title>
        <p>We evaluate the models' performance by the accuracy measure</p>
        <p>PjLj T Pl
a(model) = l=0 jXlj ; (4)</p>
        <p>L
j j
where L is the set of all classes present in the test data set X. The variable T Pl
is the number of instances in X having true class l, that have been also predicted
as l (true positives). The subset Xl of X contains the instances, that belong to
class l. This measure does not take the true negatives (TN) into account, as
these instances are almost always correctly predicted due to the large size of L
and thereby would lead to misleading accuracies.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Datasets</title>
        <p>After having trained the models on the company descriptions data set, their
performance is subsequently evaluated on the company descriptions data set
and three additional data sets:
{ GoogleSearch: News article set labeled by Google search
Each news article crawled by the the news article crawler (in total 2.2
million), was labeled by matching it against Google results when searching for
the company names. The resulting data set contains 666 news articles, while
55% of these news articles are press releases. The data set covers 275
companies of the S&amp;P 500 index.
{ TrivialLabeled: News article set labeled by trivial labeler
Since the rst data set turned out to be rather small, we decided to label
the 2.2 million news articles using keyword search for company names, as
explained in Section 3.1. The resulting data set contains 62,000 labeled news
articles of which 16% are press releases. It covers 458 companies of the S&amp;P
500 index while being highly imbalanced. To compensate for this, we limited
the number of news articles per company to 50. The nal data set covered
13,000 articles.
{ NoCompany: News article set without company names
For this data set, we used the TrivialLabeled data set and deleted all
company names and stock symbols. The prediction accuracy for this data set,
gives insights whether the models have learned information about a company
besides relying solely on, e.g., company name and stock symbol.
7 Note, that the library scikit-learn called the parameter \metric", even though it is
a similarity measure and not a metric in the mathematical sense.
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluation</title>
        <p>Having performed grid search, the results have been sorted by the accuracy for
the data set GoogleSearch, as shown in Figure 6. The dotted vertical line with
the highlighted points shows the parameter combinations which provided the
highest accuracy over the three data sets.</p>
        <p>There is a strong correlation between all three test sets, whereas the
company descriptions data set especially for the LDA KNN shows no correlation
at all. The data set TrivialLabeled, shows an overall better accuracy than the
data set NoCompany, while the accuracy rate is still high for well chosen
parameters. Therefore, the models exploit the knowledge about company names and
stock symbol to some degree. The exact accuracies for the two best classi ers are
1.0
0.8
y0.6
c
a
r
u
c
c
A0.4
0.2
0.0
1.0
0.8
y0.6
c
a
r
u
c
c
A0.4
0.2
0.0
0
0</p>
        <p>Data sets
Company descriptions
GoogleSearch
TrivialLabeler
NoCommpany</p>
        <p>Data sets
Company descriptions
GoogleSearch
TrivialLabeled
NoCommpany
50
100</p>
        <p>150 200 250
measurements (sorted by accuracy on data set GoogleSearch)
300
350
(a) BOW KNN grid search results
500</p>
        <p>1000 1500
measurements (sorted by accuraccy on data set GoogleSearch)
2000
(b) LDA KNN grid search results</p>
        <p>Fig. 6: Grid search results
given in Table 2. For all three data sets the best BOW KNN classi er exceeds
the best LDA KNN classi er by few percent points in terms of accuracy. The
best BOW KNN classi er's parameterization is max_df=0.1, min_df=1
(bag-ofwords modeling) and n_neighbors=20, weights=distance, metric=KL' (KNN
modeling). For the best LDA KNN classi er we got the following
parameterization: max_df=0.2, min_df=1 (bag-of-words modeling) and n_components=3000
(LDA modeling) and n_neighbors=20, weights=distance, metric=KL' (KNN
modeling). The accuracies of the BOW KNN model trained on a vocabulary of
size 3,102 is shown in the parentheses for each data set in Table 2.
We have shown, that both approaches work in the eld and exceed random
company guessing by at least 263 times on the data set TrivialLabeled after a
proper grid search parameter optimization. On all data sets, the BOW KNN
outperforms LDA KNN only by about three percent points in terms of accuracy
(Equation (4)).</p>
        <p>Additionally, the LDA part in LDA KNN reduces the feature space from
21,700 to 3,000 features which is a reduction by 86%. Training BOW KNN on
a similar feature space as the best LDA KNN model (i.e., 3; 000 features),
yields poor results in terms of accuracy, making LDA KNN clearly superior for
smaller feature spaces.</p>
        <p>As stated in Section 2.2, the intention was to create a bias (due to the
company descriptions) such that every company would be represented by exactly
one topic. Our grid search results, showed that this bias was not strong enough
and lead to an optimal number of topics, that was about six times higher than
expected.</p>
        <p>Furthermore we have shown, that both algorithms can learn information
about a company other than its plain company name or its stock symbol. This
is a huge advantage over, algorithms that perform a simple keywords search
for the company name. This insight is important when, e.g., texts have been
deliberately anonymized or only the products not its producer are named in an
article. In this case a simple keyword search by company or a NER approach
would not yield any results, while BOW KNN and LDA KNN still provide
valuable results.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Akbik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blythe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vollgraf</surname>
          </string-name>
          , R.:
          <article-title>Contextual string embeddings for sequence labeling</article-title>
          .
          <source>In: COLING</source>
          <year>2018</year>
          , 27th International Conference on Computational Linguistics. pp.
          <volume>1638</volume>
          {
          <issue>1649</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          :
          <article-title>Probabilistic topic models</article-title>
          .
          <source>Commun. ACM</source>
          <volume>55</volume>
          ,
          <issue>77</issue>
          {
          <fpage>84</fpage>
          (
          <year>2012</year>
          ), http: //doi.acm.
          <source>org/10</source>
          .1145/2133806.2133826
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3</source>
          , 993{
          <fpage>1022</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Craswell</surname>
          </string-name>
          , N.:
          <article-title>Query expansion with locally-trained word embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1605.07891</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Task-oriented word embedding for text classi cation</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics</source>
          . pp.
          <year>2023</year>
          {
          <year>2032</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Schutze, H.: Introduction to Information Retrieval, p.
          <fpage>117</fpage>
          . Cambridge University Press (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaithyanathan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Thumbs up?: Sentiment classi cation using machine learning techniques</article-title>
          .
          <source>In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing -</source>
          Volume
          <volume>10</volume>
          . pp.
          <volume>79</volume>
          {
          <fpage>86</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2002</year>
          ), https://doi.org/10.3115/1118693. 1118704
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Phan</surname>
            ,
            <given-names>X.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horiguchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Learning to classify short and sparse text &amp; web with hidden topics from large-scale data collections</article-title>
          .
          <source>In: Proceedings of the 17th International Conference on World Wide Web</source>
          . pp.
          <volume>91</volume>
          {
          <fpage>100</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2008</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/1367497.1367510
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sarioglu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yadav</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          :
          <article-title>Topic modeling based classi cation of clinical reports</article-title>
          .
          <source>In: 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop</source>
          . pp.
          <volume>67</volume>
          {
          <issue>73</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sriram</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuhry</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demir</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferhatosmanoglu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demirbas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Short text classi cation in twitter to improve information ltering</article-title>
          .
          <source>In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . pp.
          <volume>841</volume>
          {
          <fpage>842</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/ 1835449.1835643
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>