<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural semi-supervised learning for multi-labeled short-texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johnny Torres</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmen Vaca</string-name>
          <email>cvaca@espol.edu.ec</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ESPOL Polytechnic University Department of Electrical and Computer Engineering</institution>
          ,
          <addr-line>FIEC</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The massive data generated by users in online platforms, such as social networks, create challenges for text classification tasks based on supervised learning. Supervised learning often requires a lot of feature engineering or a significant amount of annotated data to achieve good results. However, the scarcity of annotated data is a critical issue, and manual annotation can be both costly and time-consuming. Semi-supervised learning requires far less annotated data and achieve similar performance as supervised approaches. In this paper, we introduce a semi-supervised neural architecture for muti-label settings, that combines deep learning representation and k-means clustering. The results show that the semi-supervised approach can leverage large-scale unlabeled data and achieve better results compared to baseline unsupervised as well as supervised methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The classification or grouping of short texts is critical in various tasks in text mining
and the retrieval of information in the context of social networks or data generated by
users on the web. Specifically, these tasks aim to categorize or group similar texts, so
that texts with the same label or group are similar to each other and different from texts
in other categories or groups. Traditional classification or grouping models often use a
sparse representation for text data, such as the bag of words (BOW) or TF-IDF [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        However, the characteristics of the short texts create some problems for both
conventional unsupervised and supervised models. Usually, the number of unique words in
each short text is small (90% of the texts instances in the HappyDB dataset have less
than 23 words), and as a result, the problem of lexical shortage generally leads to poor
grouping quality [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        An alternative to address lexical shortages is to enrich text representations by
extracting characteristics and relationships with sources such as Wikipedia [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or
ontologies [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]; however, this approach requires written knowledge, which also depends on
the language. Other alternative is to code texts in distributed dense vectors [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] with
neural networks [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Another problem is the definition of the labels for a specific task and the number of
manually annotated instances for each label. Unsupervised methods learn the categories
from the data, but the resulting groupings may not be related to the expected labels.
Supervised methods have predefined labels but often require a considerable number of
labeled instances to learn to categorize. Semi-supervised approaches offer an alternative
to solve these problems by using a small amount of labeled data according to predefined
classes, at the same time take advantage of the massive unlabelled data availability [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        This paper investigates the research question: How can a semi-supervised approach
learn to categorize short texts in a multi-label taxonomy using a small set of labeled
data and leveraging the availability of large amounts of unlabeled data? To that end,
we build upon neural semi-supervised k-means clustering that modifies the
conventional objective function and adds a penalty term for labeled data [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We extended
the neural semi-supervised clustering and applied to multi-label settings. The results
show that semi-supervised k-means outperform other baseline unsupervised models for
multi-label classification tasks.
      </p>
      <p>The rest of the paper is structured as follows: a) we review related work and the
k-means clustering, b) we describe the neural semi-supervised clustering for multi-label
setting, c) we analysis the experimental results, and d) finally, we outline the
conclusions and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Previous work on semi-supervised clustering methods analyzes methods based on:
constraints and representation. The constraint-based approaches use a small percentage of
labeled data to restrict the clustering process [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Instead, the representation-based
methods first learn a data representation model that satisfies the labeled data, and then
use it to group both labeled and unlabeled data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The hybrid approaches try to integrate both methods in a unified framework [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ];
However, the use of linear projection for learning by representation has limitations to
achieve good performance. Recent methods use deep neural architectures to learn text
representations that overcome the limitation of linear models [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. However, the
separation of the learning process of the data representation model and the clustering model
restrict the benefits and is more similar to the techniques representation-based. In this
work, the proposed model builds on an approach that combines into an integrated
framework both the representation of deep learning and the clustering method [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Unsupervised Learning</title>
      <p>
        In unsupervised learning, k-means is an algorithm for clustering data used in many
applications, including text mining tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The k-means algorithm divides the data into
a K number of clusters, so that minimizes the distance of each point to the centroids of
the clusters, assigning it to the nearest cluster. The input to the clustering model are the
set of short texts fs1; s2; s3; :::; sN g represented by the data points fx1; x2; x3; :::; xN g,
where xi is a sparse or dense vector.
      </p>
      <p>The k-means algorithm defines a set of binary variables rnk 2 f0; 1g for each data
point xn, where k 2 f1; :::; Kg specifies the cluster assigned. For example, rnk = 1 if
xn is assigned to cluster k, and rnj = 0 for j 6= k. The objective function in k-means is
defined as:</p>
      <p>Junsup =</p>
      <p>
        N K
X X rnkkxn
n=1 k=1
kk
2
where k is the centroid of the k-th cluster. The k-means algorithm learns the values
of frnkg and f kg such that optimizes Junsup. To minimize the objective function,
kmeans utilizes the gradient descent approach with an iterative procedure [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Each iteration involves two steps: assign clusters and estimate centroids. In the
assign clusters step, k-means minimizes Junsup with respect to frnkg by keeping fixed
f kg. In this case, Junsup is a linear function for frnkg, so that we optimize each data
point separately by merely assigning the n-th data point to the closest cluster centroid.</p>
      <p>In the estimate centroids step, k-means minimizes Junsup with respect to f kg by
keeping frnkg fixed. In this case, Junsup is quadratic function of f kg, and we
minimize it by setting to zero the derivative for f kg, as follows:
(1)
(2)
(3)
Then, we can solve f kg as</p>
      <p>k) = 0
k =</p>
      <p>PN
n=1 rnkxn
PN</p>
      <p>n=1 rnk</p>
      <p>Thus, k corresponds to the mean of all the data points assigned to the cluster k.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Neural Semi-supervised Clustering</title>
      <p>
        The classical k-means algorithm uses unlabeled data to solve the clustering problem
based on an unsupervised learning approach; however, the clustering results may not be
consistent with the expected labels. We extend the semi-supervised approach in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
which injects some supervised information into the learning process to produce useful
and coherent clusters. Similar to the classic k-means algorithm, the training steps for
the neural semi-supervised k-means are:
1. Initialize f kg and f ( ).
2. Repeat until convergence:
(a) assign clusters: Assigns each short-text to its nearest cluster centroid based on
its neural representation.
(b) estimate centroids: Estimates the clusters’ centroid based on the cluster
assignments from previous step.
(c) update parameters: Updates the neural networks parameters according to the
objective function by keeping fixed the centroids and cluster assignments.
We represent each short text entry si as a sequence of word indices and together with the
initial centroids form the input data to the semi-supervised neuronal clustering model.
Then, the embedding layer maps each word in the sequence as a dense vector x = f (s),
using word embeddings initialized randomly or from pre-trained embeddings [
        <xref ref-type="bibr" rid="ref13 ref14">14, 13</xref>
        ].
In this approach, rather than training the text representation model independently, the
semi-supervised clustering integrates it with the k-means algorithm training process.
The neural semi-supervised clustering uses a small number of labeled instances to guide
the clustering process and minimizes the objective function defined as:
Jsemi =
      </p>
      <p>C
X
c=1
f</p>
      <p>N K
X X rnkkf (sn)
n=1 k=1
+ (1
X [l + kf (sn)
j6=gn</p>
      <p>
        L
) Xfkf (sn)
n=1
kk
2
gn k2+
gn k
2
kf (sn)
j k2]+gg
(4)
where f(s1; y1); (s2; y2); :::; (sL; yL)g denote the labeled data, and the unlabeled
data is fsL+1; sL+2; :::; sN g. The label yi specify the cluster for each short-text si. The
outer sum iterates over the number of labels C defined in the taxonomy; thus, extending
the original objective function in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The objective function contains two terms:
1. The first term is the objective function in the classic k-means algorithm (Eq. 1),
and the second term penalizes depending how far are the predicted clusters from
the ground-truth clusters for labeled data. The factor 2 [0; 1] is used to tune the
importance of unlabeled data.
2. The second term contains two sub-terms:
(a) The first sub-term penalizes depending on the distance between each labeled
instance and its correct cluster centroid, where gn = G(yn) indicates the
cluster ID given by the label yn. The mapping function G( ) uses the Hungarian
algorithm [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
(b) The second sub-term specifies a hinge loss function with a margin l, where
[x]+ = max(x; 0). This term incurs in some loss if the distance to the ground
truth centroid is larger (by a margin l) than the distances to the wrong centroids.
4.3
      </p>
      <sec id="sec-4-1">
        <title>Model Training</title>
        <p>The parameters in Jsemi are: the clusters’ assignment for each text frnkg, the clusters’
centroids f kg, and the neural network weights f ( ). The goal is to find the values of
frnkg, f kg, and parameters in f ( ) that minimizes Jsemi. Based on the k-means
algorithm, the semi-supervised model iteratively minimizes Jsemi with respect to frnkg,
f kg, and parameters in f ( ).</p>
        <p>
          First, the model initializes the clusters’ centroids f kg with the k-means method [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ],
and also initializes randomly the parameters in the neural network. Then, the model
iteratively carry out three steps (assigns clusters, estimates the centroids, and updates the
parameters) until Jsemi converges.
        </p>
        <p>The assign clusters step minimizes Jsemi with respect to frnkg by keeping fixed
f ( ) and f kg to assign a cluster ID for each data point. The second term in Eq. (4) has
no relation with frnkg. Thus, the model only needs to minimize the first term, by setting
the nearest cluster centroid to each text, i.e., is identical to the assign clusters step in
the k-means algorithm. In this step, the model also calculates the mappings between the
ground-truth clusters specified by fyig and the cluster assignments for the labeled data.</p>
        <p>The estimate centroids step minimizes Jsemi with respect to f kg by keeping frnkg
and f ( ) fixed, which corresponds to the estimate centroids step in the classic k-means
algorithm. It aims to estimate the cluster centroids f kg based on the cluster
assignments frnkg from the assign cluster step. In the Eq. 4, the second term considers each
labeled instance in the process of estimating cluster centroids. By solving @Jsemi=@ k =
0, we get
k =</p>
        <p>PN n=1 wnkf (sn)
n=1 rnkf (sn) + PL</p>
        <p>PN n=1 wnk</p>
        <p>n=1 rnk + PL
wnk = (1
)(In0k + X</p>
        <p>00</p>
        <p>Inkj
In0k = (k; gn)</p>
        <p>00
Inkj = (k; j)
j6=gn
where (x1; x2)=1 if x1 is equal to x2, otherwise (x1; x2)=0; and (x) = 1 if
x is true, otherwise (x) = 0. In the numerator of Eq. 5, the first term represents the
contributions from all data points, and the weight of sn for k is rnk. The second term
represents labeled data, and wnk is the weight of an instance sn for k.</p>
        <p>
          The update parameters step minimizes Jsemi with respect to f ( ) by keeping frnkg
and f kg fixed, with no counterpart in the k-means algorithm. The primary goal is to
learn the parameters of the text representation model. The training uses Jsemi as the
loss function and employs Adam algorithm to optimize it [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
5
5.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <sec id="sec-5-1">
        <title>Experimental Setting</title>
        <p>
          We evaluate the models on the HappyDB dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], comprising individual accounts of
happy moments. The aim is to predict agency and social labels that indicate the context
(5)
(6)
of happy moments. For training, we use a small subset of labeled data and a large subset
unlabeled dataset [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The table 1 summarizes the number of labeled and unlabeled
text instances for training, as well as the number of text instances in the test set. For
the experiments, the splitting strategy is to randomly sample 80% of labeled instances
for training (training set) and remaining 20% instances for validation (validation set).
The unsupervised and semisupervised models use the unlabeled instances for training
(unlabeled set). We train the models using k-fold cross-validation (k = 10) on the
training set and report the results for the validation set using the metric F1.
Neural architectures introduce several hyper-parameters like the output dimension of
the text representation models, while semi-supervised k-means clustering has in Eq.
4. This subsection analyzes the impact of some of the hyper-parameters and determines
the configuration for further experiments.
        </p>
        <p>Embeddings dimension To evaluate the effectiveness of the output dimension in text
representation models, we perform experiments with embeddings size of f50, 100, 200,
300, 500, 1000g, while maintaining the all other parameters fixed. Figure 1 show that
the score F1 drops if the size is 100 and the curve falls if the size is 500. Based on
the results, we use 300 as the embeddings size.</p>
        <p>100
200
300
400</p>
        <p>500</p>
        <p>Embeddings dimension
Alpha We evaluate the effect of in Eq. 4, which indicates the importance of unlabeled
data in the performance of the model. We test with values of: f0.00001, 0.0001,
0.001, 0.01, 0.1g, and maintain the other parameters fixed. The figure 2 shows that
the performance decay for small values. By increasing the value of , we notice
progressive improvements and reach a peak F1 score with = 0:1. Further experiments
use the value of = 0:1 as it maximizes F1.
Fig. 3. Influence of the size of labeled data used for training.</p>
        <p>10 5
10 4
10 3
10 2</p>
        <p>10 1</p>
        <p>Labeled set size This parameter controls the influence of the size of the labeled data.
We evaluate the ratio of labeled data for training between [1%; 10%], and kept the other
parameters fixed. The figure 3 illustrates the performance improvement as the size of
labeled data increases and confirms the importance of labeled data for training.</p>
        <p>0:7
e
r
co 0:65
S
1
F
0:6
0:55</p>
        <p>Pre-training This option measures the effect of the pre-training embeddings for neural
architectures. We use pre-trained embeddings in the models as a classification task with
labeled data, and we then use the weights (excluding the top layer) to initialize
semisupervised clustering. We evaluate several pre-trained embeddings such as Word2Vec,
Glove, FastText. Figure 4 shows that pre-trained embeddings achieve superior
performance compared to random embeddings; for further experiments we use FastText.
no
word2vec
fasttext</p>
        <p>glove
This subsection compares the proposed semi-supervised approach with unsupervised
and supervised models.</p>
        <p>Unsupervised learning: All unsupervised models use k-means for clustering. We
cluster with k = 2 to map the values for each label (0; 1). For learning representation, we
use the following methods:
– BOW: represents each short-text as sparse vector based on term frequency (TF).
– TF-IDF: similar to BOW, uses a sparse vector to represent each short-text based on
term frequency-inverse document frequency.
– AVG-EMB: uses word embeddings vectors to represent each short-text and then
calculate the average.</p>
        <p>Supervised learning: We evaluate several supervised models for the classification
task, the representation learning used depends on each model as described next:
– LR: uses sparse vector representation that feeds a logistic regression classifier.
– FastText: uses dense word vectors representation (embeddings layer), followed by
a Global Average Pooling layer, which averages the word embeddings, and then
uses a Dense layer with sigmoid activation to predict the labels.
– CNN: uses dense word vectors representation (word embeddings layer) followed
by a Dropout layer, then a convolutional layer, and an output layer with sigmoid
activation.
– LSTM: similar to CNN, but the word embeddings layer feeds a recurrent LSTM
layer, which is more suitable for sequence modeling such as texts.
– BiLSTM: uses two LSTM networks to model the texts sequences in both directions,
followed by a Dropout layer with rate 0.5, and then a dense layer with sigmoid
activation.
– CNN-LSTM: leverage the advantage of CNN layer to capture salient features and
sequence modeling capability of LSTM.</p>
        <p>Table 2 summarizes the scores of the models on the test set. The models fall into
three categories (type): unsupervised, supervised, and semi-supervised. The metrics are
precision, recall, and F1. We report the scores for each label: agency and social. The last
three columns show the total weighted score of the metrics for each model. The results
show that the supervised systems outperform unsupervised models by a large margin,
which highlights the importance of labeled data. Among the supervised learning, deep
neural models perform better than the baseline method (LR); however only with a small
margin. Finally, the semi-supervised model shows promising results, as it achieves score
close to the supervised models.
This work builds on the neural semi-supervised clustering that integrates a neural
representation learning for short-texts, and the k-means clustering into a unified framework.
To that end, the model utilizes a small percentage of labeled data to guide the intention
for clustering. We extended the model to use it in muti-labeled clustering of short-texts.
The results show that the proposed neural semi-supervised clustering is more effective
than baselines unsupervised, and it close to the supervised models. Therefore, the
results show the potential to overcome critical issues, such as of scarcity of labeled data,
and leverage the availability of massive unlabeled data.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arthur</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vassilvitskii</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>k-means++: The advantages of careful seeding</article-title>
          .
          <source>In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms</source>
          . pp.
          <fpage>1027</fpage>
          -
          <lpage>1035</lpage>
          . Society for Industrial and Applied Mathematics (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evensen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golshan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopatenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suhara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Happydb: A corpus of 100,000 crowdsourced happy moments</article-title>
          .
          <source>In: Proceedings of LREC 2018</source>
          .
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan (May
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bair</surname>
          </string-name>
          , E.:
          <article-title>Semi-supervised clustering methods</article-title>
          .
          <source>Wiley Interdisciplinary Reviews: Computational Statistics</source>
          <volume>5</volume>
          (
          <issue>5</issue>
          ),
          <fpage>349</fpage>
          -
          <lpage>361</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanathan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Clustering short texts using wikipedia</article-title>
          .
          <source>In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <fpage>787</fpage>
          -
          <lpage>788</lpage>
          . ACM (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bilenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mooney</surname>
            ,
            <given-names>R.J.:</given-names>
          </string-name>
          <article-title>Integrating constraints and metric learning in semisupervised clustering</article-title>
          .
          <source>In: Proceedings of the twenty-first international conference on Machine learning</source>
          . p.
          <fpage>11</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Christopher</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prabhakar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinrich</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Introduction to information retrieval</article-title>
          .
          <source>An Introduction To Information Retrieval</source>
          <volume>151</volume>
          (
          <issue>177</issue>
          ),
          <volume>5</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basu</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey of clustering with instance level</article-title>
          .
          <source>Constraints</source>
          <volume>1</volume>
          ,
          <issue>2</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dhillon</surname>
            ,
            <given-names>I.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Information theoretic clustering of sparse cooccurrence data</article-title>
          .
          <source>In: Data Mining</source>
          ,
          <year>2003</year>
          .
          <article-title>ICDM 2003</article-title>
          . Third IEEE International Conference on. pp.
          <fpage>517</fpage>
          -
          <lpage>520</lpage>
          . IEEE (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fodeh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Punch</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          :
          <article-title>On ontology-driven document clustering using core semantic features</article-title>
          .
          <source>Knowledge and information systems 28(2)</source>
          ,
          <fpage>395</fpage>
          -
          <lpage>421</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jaidka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mumick</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chhaya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The CL-Aff Happiness Shared Task: Results and Key Insights</article-title>
          .
          <source>In: Proceedings of the 2nd Workshop on Affective Content Analysis @ AAAI (AffCon2019)</source>
          . Honolulu,
          <string-name>
            <surname>Hawaii</surname>
          </string-name>
          (
          <year>January 2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Schu¨ tze, H.:
          <article-title>Scoring, term weighting and the vector space model</article-title>
          .
          <source>Introduction to information retrieval 100</source>
          ,
          <fpage>2</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puhrsch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Advances in pre-training distributed word representations</article-title>
          .
          <source>In: Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Munkres</surname>
          </string-name>
          , J.:
          <article-title>Algorithms for the assignment and transportation problems</article-title>
          .
          <source>Journal of the society for industrial and applied mathematics 5(1)</source>
          ,
          <fpage>32</fpage>
          -
          <lpage>38</lpage>
          (
          <year>1957</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Nasrabadi</surname>
            ,
            <given-names>N.M.</given-names>
          </string-name>
          :
          <article-title>Pattern recognition and machine learning</article-title>
          .
          <source>Journal of electronic imaging 16(4)</source>
          ,
          <volume>049901</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ittycheriah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Semi-supervised clustering for short text via deep representation learning</article-title>
          .
          <source>arXiv preprint arXiv:1602.06797</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hao</surname>
          </string-name>
          , H.:
          <article-title>Short text clustering via convolutional neural networks</article-title>
          .
          <source>In: VS@ HLT-NAACL</source>
          . pp.
          <fpage>62</fpage>
          -
          <lpage>69</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>