<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>\How short is a piece of string?" The Impact of Text Length and Text Augmentation on Short-text Classi cation Accuracy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Austin McCartney</string-name>
          <email>austin.mccartney@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svetlana Hensman</string-name>
          <email>svetlana.hensman@dit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Longo</string-name>
          <email>luca.longo@dit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing, Dublin Institute of Technology</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent increases in the use and availability of short messages have created opportunities to harvest vast amounts of information through machine-based classi cation. However, traditional classi cation methods have failed to yield accuracies comparable to classi cation accuracies on longer texts. Several approaches have previously been employed to extend traditional methods to overcome this problem, including the enhancement of the original texts through the construction of associations with external data supplementation sources. Existing literature does not precisely describe the impact of text length on classi cation performance. This work quantitatively examines the changes in accuracy of a small selection of classi ers using a variety of enhancement methods, as text length progressively decreases. Findings, based on ANOVA testing at a 95% con dence interval, suggest that the performance of classi ers using simple enhancements decreases with decreasing text length, but that the use of more sophisticated enhancements risks over-supplementation of the text and consequent concept drift and classi cation performance decrease as text length increases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Traditional techniques for machine classi cation of texts rely on statistical
methods which, in turn, rely on a su ciency of meaningful data (words) within the
texts to allow classi cation. In the case of short texts, the performance of such
classi ers is reported as being poor in comparison with performance on longer
texts, the inference being that insu cient data is present within the target texts.
One approach to the improvement of classi er performance has been the
augmentation or enhancement of the short text by the addition of synonyms, or
other semantically linked words, to the body of the original text prior to
classication. The implicit hope in such supplementation is that the additional words
are conceptually related to the words in the original text and will therefore
amplify the underlying meaning and context of the original. Despite quite extensive
coverage in published literature of the general area of short test classi cation,
very little speci c information has been available relating to the deterioration of
classi er performance on shorter texts; the exact nature of the relationship
between text length and classi er performance has been unclear and, consequently,
no common de nition of how short a target text may be before it can be
considered troublesome is available. An attempt will be made to address the question
of how text-length, message enhancement and accuracy interact, through the
repeated classi cation of enhanced texts of controlled lengths. Three common
classi ers will be used to rule out the possibility of results speci c to a single
classi er.</p>
      <p>The remainder of this paper will be laid out as follows: Section 2 will review
the published literature relating to relevant, similar, work. Section 3 will discuss
the design and execution of the experiments used in this study. Section 4 will
present the results of experiments and further statistical analysis. Section 5 will
close with conclusions and suggestions for future related work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        A variety of di erent techniques have been proposed to enhance or enrich short
texts by the addition of extra features designed to make matching, clustering and
classi cation easier. Some of these methods rely on the exploitation of external
taxonomies, typically Wikipedia or Probase, whereas others use semantic nets
such as Wordnet. Song, Ye, Du, Huang and Bie [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] present a survey of short
text classi cation, rst giving an overview of the special conditions which attach
to short text as a problem, and then outlining all the major avenues of current
research. They divide approaches into three broad families; semantic approaches
(including LSA), semi-supervised classical methods (e.g. SVM, nave-Bayes) and
ensemble methods, which can combine from the other two families.
      </p>
      <p>
        Work presented by Bollegala, Matsuo and Ishizuka [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] incorporates
semantic information extracted from web-based search engines and this is contrasted
with the same operation using Wordnet: the authors point out that, typically, a
static resource such as Wordnet will fail to produce good results when trying to
judge similarity in the presence of colloquialisms. This use of an explicit external
taxonomy such as Wordnet can be contrasted with much work which makes use
of the implicit taxonomy inherent in the organisation and content of reference
sources such as Wikipedia and Probase, as in the work of Banerjee, Ramanathan,
and Gupta [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where the titles of Wikipedia articles containing terms of
interest are used as features to supplement the sparse text data, or in the work of
Wang, Wang, Li, and Wen [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] in which they coin the term \bag-of-concepts"
to stress the semantic aspect of the additional features that they had mined
from the probabilistic semantic network Probase. Wikipedia is once again the
favoured external source of \world knowledge" in Gabrilovich and Markovitch [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
in which they state, \pruning the inverted index (concept selection) is vital in
eliminating noise", but, unfortunately, they provide no further detail on their
\ablation" process. Gabrilovich and Markovitch go on to claim double digit
improvements over the then state-of-the-art methods on certain datasets. Genc,
Sakamoto and Nickerson [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] compare three disparate techniques to demonstrate
the utility of Wikipedia as an implicit taxonomic source. In a manner similar
to, but subtly di erent from, Gabrilovich and Markovitch [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] they use the target
text to mine relevant Wikipedia pages, and then calculate the distances between
Wikipedia pages using a simple shortest path graph traversal metric to assign
distances between target texts. Their second technique is to simply measure the
String Edit Distance, (SED), between texts using the Levenshtein metric. Their
nal design uses Latent Semantic Analysis, (LSA), coupled with a cosine
distance metric. Their results suggest that the Wikipedia method out-performed
both SED and LSA on most sets, and was inferior on none of the tested datasets.
      </p>
      <p>
        Departing from the common themes above, Sun [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] takes a distinctly
different direction to the main approaches outlined above, and trims short texts
even further in an attempt to retain only key words. Trimming is accomplished
using familiar term-frequency / inverse-document-frequency methods coupled
with a novel clarity measure, and is followed with a classi cation implemented
through a Lucene search to nd similar documents from a corpus: the classes of
the returned documents are used as the class for document under classi cation.
Sun reports that results match MaxEnt classi ers. A trend in the short text
enhancement literature becomes apparent over time: early work concentrated on
well-structured external resources such as Wordnet but, with time, the favoured
approach became the more unstructured Wikipedia-type model. Although
frequent reference is made to the di culty of classifying short text, as for example
in Song, Ye, Du, Huang and Bie [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], all bar one of the reviewed articles omit
any reference to the quantitative impact of the shortness of the text or any
definition of how short a text must be to be considered \short". Yuan, Cong and
Thalmann [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] in their paper, which is concerned, primarily, with contrasting
various smoothing methods as applied to nave-Bayes, conclude only that
classiers perform more poorly with single word texts than with multi-word texts. It
is this gap in existing research which underpins the motivation for the current
work.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>
        The fundamental design of this project's experimental work centres on
measuring binary classi cation performance on enhanced variants of messages of
known speci c lengths dependent on message contents having either positive or
negative sentiment. The decision to choose binary sentiment classi cation as the
reference task was motivated by the the fact that although it represents a
realworld application it remains relatively free of additional complexity that might
complicate analysis of results. The di erences in classi cation performance of
three common classi ers, across message lengths and across enhancement
methods, as measured by the F1 score for accuracy of classi cation, were analysed to
determine if message length or enhancement has any statistically valid impact
on classi cation performance.The experimental data was a corpus of 1.8 million
pre-classi ed and pre-cleaned micro-blog (twitter) posts of all lengths obtained
from the Sentiment1401 sentiment analysis project run by Stanford University
and described by Go, Bhayani and Huang [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
3.1
      </p>
      <sec id="sec-3-1">
        <title>Data Preparation</title>
        <p>The original data set from the Sentiment140 project was split into subsets by
exact message-length, each subset containing 5000 tweets, all of exactly the same
length and having an even balance between tweets having positive and negative
sentiment. There were twelve length categories, as measured by the total number
of characters in the original message, as follows: 138, 110, 80, 50, 45, 40, 35, 30,
25, 20, 15 characters, and a nal set of tweets of length 10 characters.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data Pre-processing</title>
        <p>Each tweet message in each of the length-determined subsets was pre-treated
with nine text enhancement techniques to produce a total of ten variants of each
message, including the original message. Three approaches to enhancement were
used: basic, Wordnet-based and Wikipedia-based.</p>
        <p>Basic Enhancements Basic enhancements consist of operations such as the
removal of stop words, punctuation and twitter hashtags, the lemmatization of
the text and the creation of bigrams. Speci c basic enhancements were:
Original - the original text of the tweet from the Sentiment140 dataset.
Cleaned - the original text having punctuation and stop words removed, and
twitter speci c strings (e.g. hashtags, URLs) replaced with standard tokens.
Lemmatised - the cleaned set (above) lemmatised using the NLTK python
library.</p>
        <p>Bigrams - Appending all bigrams from the lemmatised tweet back to the
lemmatized tweet.</p>
        <p>
          Wordnet Wordnet [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] is a semantically focused English language dictionary.
It bears a resemblance to an extended thesaurus but, importantly from the
perspective of this work, it contains not only synonyms, but also hypernyms and
hyponyms. Speci c Wordnet enhancements were:
        </p>
        <p>Synonyms - enhanced by appending all available wordnet synonyms for each
word in the lemmatised tweet to the lemmatized tweet.</p>
        <p>Hypernyms - enhanced by appending all available wordnet hypernyms for
each word in the lemmatised tweet back to the lemmatized tweet.
Hyponyms - enhanced by appending all available wordnet hyponyms for each
word in the lemmatised tweet back to the lemmatized tweet.
1 http://help.sentiment140.com/for-students/
Wikipedia / DBpedia DBpedia is a static, structured, database derived
from information contained in the on-line encyclopaedia Wikipedia. DBpedia
returns, in XML format, the Wikipedia taxonomic metadata for the most
relevant Wikipedia pages when a given word or bigram is searched. This metadata
includes page titles, Wikipedia categories and Wikipedia classes. These
metadata each have a \label" which is a text descriptor, possibly containing multiple
words, of the page title, category or class. Speci c Wikipedia enhancements were:
Wiki Words - enhanced by appending all available words in all the labels
contained in the top ve Wikipedia hits for each word in the lemmatised
text back to the lemmatised text.</p>
        <p>Wiki Phrases - enhanced by appending all available labels, each treated as
an indivisible string (n-gram), from the top ve Wikipedia hits for each word
in the lemmatised text back to the lemmatised text.</p>
        <p>Wiki Bigrams - enhanced by appending all available labels, each treated
as an indivisible string (n-gram), from the top ve Wikipedia hits for each
bigram in the lemmatised text back to the lemmatised text.</p>
        <p>It may be noted that these three approaches to enhancement can be
categorised into one of two classes: the basic enhancements do not supplement the
text with any external data if we discount the substitution of a word with its own
lemma, and so they can be considered \non-additive", whereas the Wordnet and
Wikipedia/DBpedia approaches rely primarily on the addition of external data
which, it is implicitly hoped, is in some way conceptually linked to the words
in the original text, thereby amplifying the underlying meaning of the text. The
latter methods may be considered \additive".
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Modelling</title>
        <p>
          The three common classi ers used in the experiment were:
Nave-Bayes [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Support Vector Machine (SVM) [5], [13]</title>
      </sec>
      <sec id="sec-3-5">
        <title>Latent Semantic Analysis (LSA) [6], [15]</title>
        <p>No attempt, beyond the most basic, was made to optimise or tune classi er
performance and any reference to the comparative performance of classi ers is
made in an informal sense. The use of multiple classi ers was undertaken only in
order to demonstrate the general applicability of the ndings, if any, and to rule
out any e ect that may arise from the use of any speci c classi er: re ecting this
purpose, the three classi ers chosen were used in their most basic con gurations
and used the built-in routines from the scikit-learn python library. Each of the
120 resultant data sets of 5000 tweets (10 enhancements for each of 12 text
lengths) was classi ed by each of the three classi ers after 100 repetitions of
a Monte Carlo cross-validation using a 90% training and 10% test split of the
data. The mean F1 score for classi cation accuracy was calculated for each of
the 100-fold cross validations. This eventually yielded three results sets of F1
accuracy scores, one for each classi er, each containing an average F1 score for
each of the 120 combinations of enhancement and text length.
3.4</p>
      </sec>
      <sec id="sec-3-6">
        <title>Evaluation</title>
        <p>
          The sets of mean F1 Scores for each classi er-enhancement combination were
subjected to Wilcox's trimmed means robust 1-way ANOVA testing [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], at
the 95% con dence level, to determine if text length had a signi cant impact
on classi cation accuracy. This was followed by Wilcox's robust 2-way ANOVA
testing, at the 95% con dence level, on each classi er's data set to determine
whether there was a statistically signi cant interaction between text length and
enhancement method which in uenced accuracy. An approximate measure of
the overall accuracy of each enhancement-classi er combination was made by
summing the accuracy results for all text lengths for each combination - this
may be thought of as a crude measure of the area-under-the-curve for plots of
accuracy (y-axis) drawn on a text-length abscissa (x-axis). The enhanced data
sets were analysed to calculate the average relative size of their texts compared
to the original texts. For example, if the mean length of synonym-enhancements
for original messages of length 20 characters was found to be 140 characters,
the \additive footprint" for synonym enhancement at 20 characters would be
calculated to be 7.0. Additive footprint for a given enhancement was found,
by ANOVA, not to vary signi cantly as a function of text length and so may
be thought of as characteristic of an enhancement as a whole. Both additive
footprints and overall accuracy for each classi er-enhancement combination were
rank-ordered, and Spearman's Rank-Order co-e cient test was carried out to
determine whether the additive footprint of an enhancement was correlated with
the overall classi cation accuracy of that enhancement for a given classi er.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Numeric accuracy results for all three classi ers are omitted in the interest of
brevity. Instead, accuracy results in graphical form are presented along with
tabular results for additive footprint calculations and rank correlation results.
1Way robust ANOVA conducted on each enhancement for each classi er indicates
that signi cant (95%, p 0:0001) di erences are present as text length changes
for all combinations. This nding supports rejection of the hypothesis that text
length does not in uence classi cation accuracy. 2-way robust ANOVA across
text-lengths and enhancements within each classi er indicates that a signi cant
interaction (95%, p 0:001) exists between text length and enhancement for
all classi ers. This nding supports rejection of the hypothesis that the chosen
enhancement method has no signi cant e ect on the way in which the F1 score
changes with changes in text length. Note that, on all three sets of classi er plots,
local or absolute maxima for accuracy are frequently observed in text-lengths
from 20 to 25 characters. The mean additive footprint of each enhancement and
the \area under the accuracy curve" for each enhancement-classi er combination
were calculated. These tabular results are displayed in addition to the graphical
accuracy results.</p>
      <p>The values of Spearman's test indicate a strong correlation between
increasing additive footprint and decreasing accuracy as measured by F1 score for the
nave-Bayes and SVM classi ers, and a moderate correlation for the LSA
classi er. In all three cases, the one-tailed z-score indicates a signi cant correlation
between increasing additive footprint and decreasing accuracy at the 95%
condence level.</p>
      <p>
        This empirical result would suggest that enhancements which over-supplement
the original text are likely to be counter-productive in terms of accurate
classi cation, and that the greater the degree of over-supplementation the greater
the negative impact on classi cation accuracy. Visual inspection of the graphical
data shows that not only do additive enhancements under-perform non-additive
enhancements in this experiment, but that they also actually decrease classi
cation performance as the text length increases. It is postulated that additive
enhancement methods, without careful control, may overwhelm any actual signal
present in the text though the addition of noise associated with poorly matched
textual supplementation and that the associated concept drift will decrease
classi cation accuracy. Qualitative changes in accuracy can be seen to start as text
length decreases towards 50 characters for all non-additive enhancements, and
become very pronounced below 20 characters for all variants of a message. This
intuitive analysis was supported by post-hoc testing which also indicated that, for
stable enhancements, statistically signi cant changes started to occur below 80
characters. This in turn suggests that, if the cases of nave-Bayes and SVM
classi ers can be taken to be representative, text might be usefully, if subjectively,
considered short at lengths below 80 characters and very short at lengths of less
than 20 characters. The LSA classi er shows a decrease in accuracy across all
enhancements with increasing length beyond 25 characters: both this behaviour,
and the root cause of the comparative under-performance of the LSA classi er,
remain open issues for further investigation, but it should be noted that the
unsupervised nature of the LSA classi er might reasonably be expected to perform
less well than the supervised tasks on this particular problem. In contrast, while
SVM has been recognised as a strong performer, several authors explicitly
suggest that nave-Bayes is often under-estimated [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and, given large, balanced
datasets and consistent document lengths, as in this case, may perform on a
par with more sophisticated algorithms [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In a more general sense,
Holte [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] observes that simple problems often respond very well to simple
classi cation approaches and both Halevy, Norvig and Pereira [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Banko and
Brill [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] emphasise the importance of data characteristics over speci c algorithm
choice. Against this backdrop, the relative strength of the nave-Bayes classi er
in this experiment should not be considered anomalous.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        Addressing a lack of quantitative experimental work on the often-discussed
impact of text-length upon classi cation accuracy, this work undertook to
investigate the relationship between text-length, textual enhancement and classi cation
accuracy by means of an experiment in which messages of carefully controlled
length were enhanced using variations on common text supplementation
methods and were then repeatedly classi ed. The primary contribution of this work is
to have provided direct, quantitative, experimental, evidence that classi cation
accuracy, for two of three tested classi ers, declines with declining text length for
non-additive text enhancements, and that the exact quantitative nature of that
decline was dependent upon the enhancement or pre-treatment applied to the
text and to the classi er in use. The concept of \additive footprint" was
introduced to quantify the proportional increase in word count imposed upon a text
by a given enhancement, and it was found that the additive footprint remained,
for this data set, relatively constant for a given enhancement over a range of text
lengths and can thus be considered characteristic of an enhancement method,
independent of text length. The ndings related to additive enhancements may
seem, at rst glance, to contradict many published successes in the area of short
text enhancement. However, the particular di culties encountered in the
supplementation of short text have been obliquely alluded to by several authors [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
The salient nding is that, without some form ltering, textual supplementation,
has proved to be worse than useless. It is perhaps instructive to note that at the
very shortest text lengths, the highest performing 'enhancement' was the original
message which was completely un-enhanced.
      </p>
      <p>Future work might usefully investigate the \bump" in accuracy seen for many
enhancement-classi er combinations at message lengths of 20 to 25 characters.
Some preliminary investigation was conducted to rule out any peculiarity or data
artefact that may cause this small increase in accuracy, but replacement of the
original data sets had no e ect. A carefully designed experiment may be able to
determine whether author-created context and structure inherently varies with
text length: for example, it may indicate that texts in the 20 to 25-character
range have a higher degree of author-created clarity, which might, tentatively,
be attributed to an author's avoidance of ambiguity when composing shorter
messages. Another possible avenue for future work on additive enhancement
methods is experimentation with part-of-speech ltering, either at generation
time (e.g. send only adjectives to wordnet for supplementation) or at application
time (e.g. accept only adjectives as supplemental words) or both together. Such
a ltering mechanism could be potentially used to attempt to limit the addition
of non-relevant words to the original text. The narrow experimental focus of the
experimental work described, in terms of classi ers, enhancements, classi cation
task and datasets provides ample opportunity for the further exploration of the
generalisability of the results presented above.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanathan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Clustering short texts using wikipedia</article-title>
          .
          <source>In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <volume>787</volume>
          {
          <fpage>788</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Banko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brill</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Scaling to very very large corpora for natural language disambiguation</article-title>
          . In:
          <article-title>Proceedings of the 39th annual meeting on association for computational linguistics</article-title>
          . pp.
          <volume>26</volume>
          {
          <fpage>33</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bollegala</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsuo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishizuka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Measuring semantic similarity between words using web search engines</article-title>
          .
          <source>www 7</source>
          , 757{
          <fpage>766</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Caruana</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niculescu-Mizil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An empirical comparison of supervised learning algorithms</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on Machine learning</source>
          . pp.
          <volume>161</volume>
          {
          <fpage>168</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine learning 20(3)</source>
          ,
          <volume>273</volume>
          {
          <fpage>297</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Deerwester</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furnas</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harshman</surname>
          </string-name>
          , R.:
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American society for information science 41</source>
          (
          <issue>6</issue>
          ),
          <volume>391</volume>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markovitch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge</article-title>
          .
          <source>In: AAAI</source>
          . vol.
          <volume>6</volume>
          , pp.
          <volume>1301</volume>
          {
          <issue>1306</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markovitch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computing semantic relatedness using wikipediabased explicit semantic analysis</article-title>
          .
          <source>In: IJcAI</source>
          . vol.
          <volume>7</volume>
          , pp.
          <volume>1606</volume>
          {
          <issue>1611</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Genc</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakamoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nickerson</surname>
          </string-name>
          , J.:
          <article-title>Discovering context: classifying tweets through a semantic transform based on wikipedia. Foundations of augmented cognition. Directing the future of adaptive systems pp</article-title>
          .
          <volume>484</volume>
          {
          <issue>492</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Go</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhayani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Twitter sentiment classi cation using distant supervision</article-title>
          .
          <source>CS224N Project Report, Stanford</source>
          <volume>1</volume>
          (
          <year>2009</year>
          ),
          <volume>12</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norvig</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>The unreasonable e ectiveness of data</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>24</volume>
          (
          <issue>2</issue>
          ),
          <volume>8</volume>
          {
          <fpage>12</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Holte</surname>
          </string-name>
          , R.C.:
          <article-title>Very simple classi cation rules perform well on most commonly used datasets</article-title>
          .
          <source>Machine learning 11(1)</source>
          ,
          <volume>63</volume>
          {
          <fpage>90</fpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text categorization with support vector machines: Learning with many relevant features</article-title>
          .
          <source>Machine learning: ECML-98</source>
          pp.
          <volume>137</volume>
          {
          <issue>142</issue>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kim</surname>
          </string-name>
          , S.B.,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>K.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rim</surname>
            ,
            <given-names>H.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Myaeng</surname>
            ,
            <given-names>S.H.:</given-names>
          </string-name>
          <article-title>Some e ective techniques for naive bayes text classi cation</article-title>
          .
          <source>IEEE transactions on knowledge and data engineering</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <volume>1457</volume>
          {
          <fpage>1466</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foltz</surname>
            ,
            <given-names>P.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laham</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>An introduction to latent semantic analysis</article-title>
          .
          <source>Discourse processes 25(2-3)</source>
          ,
          <volume>259</volume>
          {
          <fpage>284</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          , D.D.:
          <article-title>Naive bayes at forty: The independence assumption in information retrieval</article-title>
          .
          <source>In: European conference on machine learning</source>
          . pp.
          <volume>4</volume>
          {
          <fpage>15</fpage>
          . Springer (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          :
          <article-title>Wordnet: a lexical database for english</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <volume>39</volume>
          {
          <fpage>41</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rennie</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shih</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teevan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karger</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          :
          <article-title>Tackling the poor assumptions of naive bayes text classi ers</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Machine Learning (ICML-03)</source>
          . pp.
          <volume>616</volume>
          {
          <issue>623</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Short text classi cation: A survey</article-title>
          .
          <source>Journal of Multimedia</source>
          <volume>9</volume>
          (
          <issue>5</issue>
          ) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Short text classi cation using very few words</article-title>
          .
          <source>In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <volume>1145</volume>
          {
          <fpage>1146</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Concept-based short text classi cation and ranking</article-title>
          .
          <source>In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          . pp.
          <volume>1069</volume>
          {
          <fpage>1078</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Wilcox</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keselman</surname>
          </string-name>
          , H.:
          <article-title>Modern robust data analysis methods: measures of central tendency</article-title>
          .
          <source>Psychological methods 8</source>
          (
          <issue>3</issue>
          ),
          <volume>254</volume>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cong</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thalmann</surname>
            ,
            <given-names>N.M.:</given-names>
          </string-name>
          <article-title>Enhancing naive bayes with various smoothing methods for short text classi cation</article-title>
          .
          <source>In: Proceedings of the 21st International Conference on World Wide Web</source>
          . pp.
          <volume>645</volume>
          {
          <fpage>646</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>