<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A comparison on the classi cation of short-text documents using Latent Dirichlet Allocation and Formal Concept Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Noel Rogers</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Longo</string-name>
          <email>luca.longo@dit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing, Dublin Institute of Technology</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the increasing amounts of textual data being collected online, automated text classi cation techniques are becoming increasingly important. However, a lot of this data is in the form of short-text with just a handful of terms per document (e.g. Text messages, tweets or Facebook posts). This data is generally too sparse and noisy to obtain satisfactory classi cation. Two techniques which aim to alleviate this problem are Latent Dirichlet Allocation (LDA) and Formal Concept Analysis (FCA). Both techniques have been shown to improve the performance of short-text classi cation by reducing the sparsity of the input data. The relative performance of classi ers that have been enhanced using each technique has not been directly compared so, to address this issue, this work presents an experiment to compare them, using supervised models. It has shown that FCA leads to a much higher degree of correlation among terms than LDA and initially gives lower classi cation accuracy. However, once a subset of features is selected for training, the FCA models can outperform those trained on LDA expanded data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years the amount of short text data available online has exploded. A
big part of this is down to the rise of social media with a lot of this data taking
the form of tweets, Facebook posts or comments on media sites like YouTube
for example. However, the sparse, noisy nature of short text makes automatic
classi cation a di cult task. Typically a classi er could take tf-idf as inputs in
the form of a Term-Document-Matrix (TDM) where entry tij relates the
frequency with which term j appears in document i with the overall occurrances
of the term across the document corpus [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] but for short-text the amount of
information contained in a TDM is too sparse to facilitate accurate prediction.
As a result we need to reduce this level of sparsity by adding weights in the TDM
for words which do not already appear in the document. This could be done by
incorporating external knowledge bases [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] or by using metadata to add extra
features to compensate for the sparsity within the actual text [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Both rely
on data external to the textual content so as an alternative the co-occurrence
of words within the document corpus can be used to perform the necessary
expansion. Two such techniques which adopt this approach are Latent Dirichlet
Allocation (LDA) and Formal Concept Analysis (FCA). An investigation into
the application of these two techniques to text classi cation will be the primary
focus of this work.
      </p>
      <p>The rest of this document is organised as follows. Firstly, a brief review
of related literature is provided, with particular emphasis on the applications of
LDA and FCA to the problem of short-text classi cation. Section 3 then outlines
the design of an experiment with the aim of comparing the improvements in
classi cation accuracy due to each technique. An analysis of the results of this
experiment are provided before we nish the paper with conclusions drawn from
these results and provide suggestions for future work.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Latent Dirichlet Allocation</title>
        <p>
          Latent semantic analysis was developed to nd the latent topics in a set of
documents by looking at eigenvectors and used these as a means of dimensionality
reduction [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. This was extended to instead use conditional probabilities as a
means of modelling the underlying topics, rst introduced in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The key idea
is that a document can be considered as a mixed distribution over a number
of topics. So, supposing there are k possible topics, then the probability that a
given word w will instantiate some term t, is given by
p (w = t) =
        </p>
        <p>
          X p (w = t j z = k) p (z = k)
k
(1)
By convention we denote k = p (w j z = k) as the word distribution for a topic k,
and d = p (z) as the distribution over topics for a given document d. Combining
the distributions for all values of k and d respectively yields two matrices denoted
and . Generalising to new documents not in the original corpus is non-trivial,
so an additional assumption was taken in the seminal work of Blei, Andrew
and Jordan which introduced a Dirichlet prior, leading to LDA [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. A Dirichlet
distribution is simply a family of distributions parameterised by a vector, , of
real values. In the case of LDA, the family of distributions correspond to k
and the values of can be thought of as a prior count on the number of times
a topic k is observed in a document. The same Dirichlet assumption can be
extended to the distributions of words within topics, parameterised by a vector
. LDA is a generative model - we can generate a document word by word by
rst randomly sampling from the topic distribution and then selecting a word,
conditioned on the selected topic [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. To generate an LDA model
Markovchain Monte Carlo moethods such as Gibbs sampling can be employed, for a
detailed example see [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. For the model parameters, the number of topics that
should be generated may be known in advance but typically there needs to be
a way to nd an optimum value. There is no hard and fast rule for this, though
there are heuristics based on information theory such as measuring the perplexity
on a hold-out test sample and then nding the topic number that minimises this.
Perplexity gives a measure of how well the model predicts the distribution on
the test documents and is computed as per equation 2 where M is the number
of documents in the test set, wd represents document d and Nd is the number
of words in document d [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>perplexity = exp
(2)
PM
d=1 log p (wd)
PM
d=1 Nd
!</p>
        <p>
          There have been a large number of examples applying LDA to text
classication problems, with Twitter proving a popular data source for focusing on
short-text problems [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. For other applications LDA is simply one step in a
more complex work ow to aid in achieving high classi cation accuracies [
          <xref ref-type="bibr" rid="ref16 ref6">16, 6</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Formal Concept Analysis</title>
        <p>
          We provide here a very brief overview of the subject of FCA. For a more detailed
introduction to the topic see [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. FCA was born out of a mathematical attempt
to add formal de nitions and structure to the notion of a concept. Intuitively a
concept is a unit of thought consisting of a set of objects belonging to it (Called
the extent ) and the properties or attributes that they share (The intent ). To
formally de ne these ideas, start with a set of objects, X, and a set of attributes,
Y , pertaining to elements of X. A binary relation, I, encodes for which elements
in X have particular attributes of Y . The notation hx; yi 2 I means that the
object x has the attribute y. The collection hX; Y; Ii is called a formal context [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>A formal concept then, is a pair hA; Bi where A X, B
A = fx 2 X j 8y 2 B; hx; yi 2 Ig and B = fy 2 Y j 8x 2 A; hx; yi 2 Ig
Y with</p>
        <p>
          The sets A and B are the extent and intent of the concept respectively.
The collection of all such concepts for a given context hX; Y; Ii is denoted by
B (X; Y; I). By ordering concepts using sub / super-set relations a partial
ordering can be added to the set of concepts. The key theorem, taken from the seminal
paper of Wille which initially produced this framework, is that B (X; Y; I) forms
a lattice when equipped with this partial ordering [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. When applied to
shorttext classi cation, the typical approach is to treat documents as the objects and
the words appearing within them as the attributes. In this way a corpus of
documents can be mapped to a concept lattice to determine the relationships between
words [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The most relevant work for this paper is that of Boutari, Carpetino
and Nicolussi [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Here, FCA is used as a text expansion technique to improve
both supervised and unsupervised classi cation of short texts. Their main focus
is on identifying proximity measures between concepts in the lattice that can
be used to expand a TDM with weights from closely related concepts. In order
to formalise this the authors developed ve di erent metrics to generate these
weights with the resulting matrices used as the input to K-Nearest Neighbour
and K-Means classi ers for comparison.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiment Design</title>
      <p>
        The key focus of this study is on comparing LDA and FCA as sparsity reduction
techniques. In order to determine their comparative performance, classi ers will
need to be trained on inputs derived from each technique and their accuracies
compared - for this both neural networks and SVM have been chosen. A baseline
model will be trained on the unprocessed input TDM. The key steps are shown
in gure 1.
To reduce the possibility of the speci c patterns in the distribution of the dataset
from impacting the results of the study, the experiment will be replicated using
two distinct datasets. The rst is the Google Snippets1 corpus, rst employed
in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. It consists of snippets of search terms, typically between ten and forty
words long, which comprise the documents. Each document is also assigned one
of eight class labels. The dataset is already split into training and test subsets.
The second dataset chosen is the Reuters-215782 collection. This is one of the
most widely utilised datasets within the text classi cation domain, employed
for example in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This corpus consists of 21,578 di erent news articles
along with additional metadata such as the author, date and title. For this study
the articles themselves are too long so just the titles will be extracted with each
1 jwebpro.sourceforge.net/data-web-snippets.tar.gz
2 archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
considered a distinct document as per the approach taken in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A number of
recommended subsets and splits are included with the dataset, for the purposes
of this experiment a subset will be taken consisting of 78 classes, a training set
of 7,733 documents and a test set of 3,561. A summary of this is provided in
table 1b. As pre-processing steps we rst convert to lowercase and remove any
punctuation or non alpha-numeric characters. All stop words then (For example
\the" or \and") will be extracted. These do not contribute anything to the topics
or concepts contained in a document so they will represent noise in the data.
Note that the core focus of the experiment is on expansion of short text by
enriching with topics or concepts derived from the entire corpus. Words which
only appear in a single document are of little use in this regard, since they cannot
form relationships with words from other documents. Therefore any words that
only appear in a single document will be removed. Once this processing has
been completed, a sparse TDM T is generated for each dataset. Each element
tij represents the inverse-frequency with which word j appears in document i.
LDA and FCA will be employed to address the sparsity of T .
3.2
      </p>
      <p>
        LDA
The rst point of note is that the LDA step can be tuned by a number of
hyperparamaters - the Dirichlet priors and and the number of topics. Ideally
a range of values would be tried for each so that the optimal value could be found
but this will not be feasible for this study so a good approximation for each needs
to be taken upfront. The approach taken will follow the same as that employed
in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Their recommendation is to take = 50=NT and = 200=W where NT
is the number of topics and W is the number of words. To derive the number
of topics to use, perplexity values will be calculated from the test data as per
equation 2. The range of values taken will be from 1 to 246, incrementing by 5
each time. Based on the outcome of this test a single model will be chosen to
proceed with. The key outputs from this model are two probability matrices
one giving the distribution of words within each topic and the other giving the
distribution of topics over documents. These correspond to and respectively,
as de ned in section 2.1. It follows then from equation 1 that the probabilities
for each word appearing in each document are given by . This new matrix
has the same dimensions as T and replaces T as the input for the training step.
3.3
The starting point is to note that T can be considered a formal context - if
tij 6= 0 then word j appears in document i. As such we can form a concept
lattice from the documents and words and related concepts from this will be
used to add non-zero terms to T . Once the concept lattice is formed a proximity
measure can be derived, encoding how closely related two concepts are. For this
we choose
      </p>
      <p>P roximity = 1</p>
      <p>
        SD
max SD
where SD is the shortest distance between two points in the graph [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Given any
pair of words, equation 3 allows the similarity between them to be computed
yeilding a symmetric matrix S where each term sij is the proximity between
words i and j. Now let d be a vector representing one of the documents i.e. d
corresponds to a row of T . The aim is to obtain a new representation, d0, that
takes advantage of the word proximities to reduce the sparsity of the original
representation. The value for the ith word should take into account both the
proximity between word i and each other word but also the frequency with
which those other words appear in d yielding the following equation
Extending this over the whole document gives d0 = d S. It follows then that
the expanded term document matrix, T 0, is simply T S.
      </p>
      <p>
        For the execution of this we adopt a tool3 implementing the InClose algorithm
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to obtain the concept set. To construct the lattice from these concepts, a
simple algorithm was employed to add edges where a given concept was a lower
neighbour of another. The matrix S was generated using a breadth- rst search
over the lattice. Note that the source and sink nodes (Corresponding to the
empty and universal sets in the concept) need to be rst removed so that concepts
cannot be linked via these nodes. It could happen that two unrelated terms end
up with an unnaturally high proximity value on account of both being directly
connected to either the source or sink.
3.4
      </p>
      <sec id="sec-3-1">
        <title>Modelling</title>
        <p>
          We brie y highlight the choices made for neural network parameters. As we
have a clasi cation problem the activation function selected is softmax. For the
hidden layer, recti ed linear units or ReLU will be used [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. In order to help
the model generalise well and avoid over tting, dropout layers will be added
in between each layer of the network [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The nal consideration is to the
architecture of the model - the number and width of hidden layers and the
connections between them. Since the purpose of this study is not to investigate
3 sourceforge.net/projects/inclose/
(3)
(4)
neural network architectures, the simplest setup will be chosen, namely a single
hidden layer, of width W and with all units connected. One additional
preprocessing step that will be required before training the neural network is to
normalise the input data. The normalised features, or z scores, are computed
by subracting the mean and dividing by the std. deviation of each feature. For
the SVM, a simple linear kernel has been chosen. The only other consideration
required with the algorithm then is on how to deal with the multiple class labels.
In order to handle this the 'one-versus-rest' approach will be taken [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
4
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and analysis</title>
      <p>LDA
The topic values taken for each LDA model were determined by computing the
perplexity associated to each topic number and selecting the minimum. From
gures 2a and 2b these topic numbers are 181 and 161.</p>
      <p>(a) Reuters perplexity values</p>
      <p>(b) Snippets perplexity values
The initial experiments yielded very poor accuracy on the FCA enhanced
Snippets dataset for neural networks and it was found that there was a high degree
of correlation between the input features. As a result of this two further runs of
the experiment were performed, the rst removing correlated features above a
threshold of 0.8. This conservative value gave little improvement and prompted
a further run where the top 10% of features were selected based on the outcome
of an ANOVA test.</p>
      <p>To help understand the cause of the high correlations, the distributions of
weights in the FCA and LDA enhanced TDMs for the Snippets dataset are
shown (Figures 3a and 3b). For the LDA weights, the majority of terms are
close to zero so it is still just a small subset of terms that contribute most to
the classi cation. Contrast this with FCA; here the weights form a near normal
distribution around a mean of 0.5. The impact is that even totally unrelated
terms are still contributing to signi cant weight increases. Across both datasets,
the greatest distance between any pair of concepts was 12 leading to a small range
of values that the proximities could take. We will revist this issue in section 5
with suggestion for how future work can combat this problem.</p>
      <p>(a) FCA TDM weights</p>
      <p>(b) LDA TDM weights
Neural Network models The performance of each classi er was determined
by comparing precision, recall and F-measure values (Denoted P , R and F1). A
full breakdown of the results of each run of the experiment are given in tables 2a
to 2c. Graphs of the F1 scores can also be seen in gures 4a and 4b for the Reuters
and Snippets experiments respectively. As already highlighted, the initial FCA
results are quite poor on the Snippets dataset. For Reuters however, FCA is
already outperforming both the baseline (BL) and LDA. Removing correlated
features does not lead to signi cant change in the results 2b but in the nal run
of the experiment, following the selection of just 10% of features using ANOVA,
it can be seen that FCA has outperformed across the board.</p>
      <p>
        SVM models The high correlations which impeded the FCA trained neural
networks did not have the same negative e ect on the SVM models. Across both
datasets the highest accuracies are achieved on the rst run, before features are
removed. Comparing FCA and LDA for the SVM models, the highest overall F1
values are again achieved by FCA (0.69 versus 0.62 on the snippets data and
0.78 versus 0.56 on Reuters). Comparing the best F1 scores for each dataset
across all 3 runs shows the FCA achieving a 3-5% increase on the baseline and a
5-15% increase over LDA. The highest scoring combination across both datasets
is FCA + SVM with no need for additional feature engineering steps.
As a nal point the statistical signi cance of the the obtained results has been
evaluated. We used McNemars test statistic [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The statistic is given by
where, for two models a and b, n01 corresponds to cases misclassi ed by a and
not b and those missed by b and not a give n10. From table 3, it can be seen
that the results comparing the LDA and FCA models are statistically signi cant
with a p-value &lt; 0.01.
      </p>
      <p>(a) F1 scores for Reuters data
(b) F1 scores for Snippets data
Three di erent iterations were run - without feature engineering, with the
removal of correlated features and incorporating ANOVA for feature selection. Two
classi ers were trained on the resulting feature sets - neural networks and SVM,
with FCA showing a 5% increase in the Snippets dataset and a 15%
improvement on the Reuters data. The LDA models remained consistent throughout
but failed to even outperform the baseline models on either dataset. Analysis
was performed to understand the initial poor scores and a high degree of feature
correlation was discovered. As the focus was on term expansion techniques, no
parameter engineering was performed on any of the neural network models and
only a simple linear kernel was employed for the SVM. Varying the network
architectures, dropout weights or learning rates or employing more sophisticated
kernels could have improved the results for individual models but these steps
were not performed.</p>
      <p>To strengthen the results the experiment was repeated on two datasets. The
correlation problem that FCA initially introduced may not have been picked up
had only the Reuters dataset been applied. LDA is widely used in text analysis
but we have shown that for this particular task FCA is more suitable. We have
also shown though that FCA also adds a high degree of correlation between
terms. One of the drawbacks in FCA is the computational resources needed to
build the concept lattice and term similarity matrix. The density of the lattice
was highlighted as a reason for the high correlations so a trade-o in
computing the full lattice or term similarities would help mitigate both the resources
required and reduce the correlations.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>
        This experiment has compared the relative bene ts of LDA and FCA for to
addressing the sparsity of short-text. We list now some potential future avenues
of work arising from this experiment. This project only focused on \standard"
FCA, however fuzzy FCA, as described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], could be examined. In fuzzy FCA,
rather than attributes simply being absent / present for a given object, a weight
between 0 and 1 is applied to each one - precisely the form that a tf-idf TDM
takes. The outcome of the FCA model is the term-term similarity matrix and
this is the key component in this step. One measure was utilised in this work,
however there are alternative methods of deriving concept similarity from a
lattice, not just the geometric distance of shortest paths. Further work could look
at alternatives such as set based approaches (Measuring the size of the intent /
extent intersections) or combinations of these with geometric distance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One
issue identi ed with FCA was the high degree of correlations that were observed.
We looked at evaluating term-similarities between a concept and all other others
but a more restrictive approach, looking just at a small neighbourhood around
each concept, might fare better. Within this neighbourhood the proximities could
be computed as before, with all concepts outside this neighbourhood having
being set to 0 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. An alternative approach that could yield the same outcome is
to instead use iceberg lattices [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is simply a concept lattice which has been
pruned by introducing a required minimal support for concept inclusion. The
removal of edges from the lattice would lead to a wider spread in proximities
between concepts.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirsch</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A Tool for Creating and Visualising Formal Concept Trees</article-title>
          .
          <source>In: Proceedings of the Fifth Conceptual Structures Tools &amp; Interoperability Workshop</source>
          . pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          .
          <string-name>
            <surname>Annecy</surname>
          </string-name>
          , France (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Belohlavek</surname>
          </string-name>
          , R.:
          <article-title>Introduction to formal concept analysis</article-title>
          . Palacky University, Department of Computer Science, Olomouc p.
          <volume>47</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrew</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          ,
          <issue>993</issue>
          {
          <fpage>1020</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dunson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Probabilistic topic models</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          <volume>27</volume>
          (
          <issue>6</issue>
          ),
          <volume>55</volume>
          {
          <fpage>65</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Boutari</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carpineto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicolussi</surname>
          </string-name>
          , R.:
          <article-title>Evaluating term concept association measures for short text expansion: Two case studies of classi cation and clustering</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          <volume>672</volume>
          ,
          <issue>163</issue>
          {
          <fpage>174</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Short text classi cation improved by learning multigranularity topics</article-title>
          .
          <source>In: IJCAI</source>
          . pp.
          <volume>1776</volume>
          {
          <issue>1781</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>De Maio</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fenza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senatore</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Hierarchical web resources retrieval by exploiting fuzzy formal concept analysis</article-title>
          .
          <source>Information Processing and Management</source>
          <volume>48</volume>
          (
          <issue>3</issue>
          ),
          <volume>399</volume>
          {
          <fpage>418</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dietterich</surname>
          </string-name>
          , T.G.:
          <article-title>Approximate statistical tests for comparing supervised classi cation learning algorithms</article-title>
          .
          <source>Neural Computation</source>
          <volume>10</volume>
          (
          <issue>7</issue>
          ),
          <year>1895</year>
          {
          <year>1923</year>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Eklund</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ducrou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Concept similarity and related categories in information retrieval using formal concept analysis</article-title>
          .
          <source>International Journal of General Systems</source>
          <volume>41</volume>
          (
          <issue>8</issue>
          ),
          <volume>826</volume>
          {
          <fpage>846</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Ch1 &amp; Ch2: Contexts, Concepts, and Concept Lattices. Formal Concept Analysis: Methods and Applications in Computer Science (</article-title>
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Gri ths, T.L.,
          <string-name>
            <surname>Steyvers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding scienti c topics</article-title>
          .
          <source>Proceedings of the National Academy of Sciences of the United States of America 101 Suppl</source>
          ,
          <volume>5228</volume>
          {
          <fpage>35</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Probabilistic latent semantic indexing</article-title>
          .
          <source>Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          pp.
          <volume>50</volume>
          {
          <issue>57</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davison</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Empirical study of topic modeling in twitter</article-title>
          .
          <source>Proceedings of the rst workshop on social media</source>
          analytics pp.
          <volume>80</volume>
          {
          <issue>88</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kelleher</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacNamee</surname>
          </string-name>
          , B.,
          <string-name>
            <surname>D'Arcy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies</article-title>
          . The MIT Press (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Knerr</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Personnaz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dreyfus</surname>
          </string-name>
          , G.:
          <article-title>Single-layer learning revisited: a stepwise procedure for building and training a neural network. Neurocomputing: algorithms, architectures</article-title>
          and applications
          <volume>68</volume>
          (
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          ),
          <volume>71</volume>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>A Deep Architecture for Matching Short Texts</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Phan</surname>
            ,
            <given-names>X.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horiguchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Learning to Classify Short and Sparse Text &amp; Web with Hidden Topics from Large-scale Data Collections</article-title>
          .
          <source>Proceeding of the 17th international conference on World Wide Web - WWW '</source>
          08 pp.
          <volume>91</volume>
          {
          <issue>100</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Poelmans</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dedene</surname>
          </string-name>
          , G.:
          <article-title>Formal concept analysis in knowledge processing: A survey on applications</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>40</volume>
          (
          <issue>16</issue>
          ),
          <volume>6538</volume>
          {
          <fpage>6560</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>: Machine learning in automated text categorization</article-title>
          .
          <source>ACM computing surveys (CSUR) 34(1)</source>
          ,
          <volume>1</volume>
          {
          <fpage>47</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
          </string-name>
          , S.C.:
          <article-title>Genetic algorithm for text clustering based on latent semantic indexing</article-title>
          .
          <source>Computers and Mathematics with Applications</source>
          <volume>57</volume>
          (
          <fpage>11</fpage>
          -
          <lpage>12</lpage>
          ),
          <year>1901</year>
          {
          <year>1907</year>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sriram</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuhry</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demir</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferhatosmanoglu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demirbas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Short Text Classi cation in Twitter to Improve Information Filtering</article-title>
          .
          <source>Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval SE - SIGIR '</source>
          10 pp.
          <volume>841</volume>
          {
          <issue>842</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: prevent NN from over tting</article-title>
          .
          <source>Journal of Machine Learning Research 15</source>
          ,
          <year>1929</year>
          {
          <year>1958</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Steyvers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Gri ths, T.:
          <article-title>Latent semantic analysis: a road to meaning, chapter probabilistic topic models</article-title>
          .
          <source>Laurence Erlbaum</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <article-title>Restructuring lattice theory: an approach based on hierarchies of concepts</article-title>
          .
          <source>In: Ordered sets</source>
          , pp.
          <volume>445</volume>
          {
          <fpage>470</fpage>
          . Springer (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perkins</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>A heuristic approach to determine an appropriate number of topics in topic modeling</article-title>
          .
          <source>BMC Bioinformatics</source>
          <volume>16</volume>
          (
          <issue>Suppl 13</issue>
          ),
          <source>S8</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>