<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Classification with Deep Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maaz Amajd</string-name>
          <email>maazamjad@phystech.edu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>NRU HSE</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ivoronkov@hse.ru</string-name>
        </contrib>
      </contrib-group>
      <fpage>362</fpage>
      <lpage>370</lpage>
      <abstract>
        <p>In this paper, we analyze the use of different neural networks for the text classification task. The accuracy of the studied text classifiers can be changed by a small number of previously classified texts. This is important due to the fact that in many applications of text classification a large number of unlabeled texts are easily accessible, while the receipt of marked texts is quite a difficult task. The paper also shows that the convolution neural network can work better at the level of words, and does not require knowledge of the syntactic or semantic structure of the language. On the other hand, a recurrent neural network for the level of data representation in the form of a sequence can effectively classify the text. Experimental results obtained for text corpora from two different sources show that using a vector data representation can also improve the accuracy of the classification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Text classification is a classic topic for natural language processing and has many
important applications in topics such as parsing, semantic analysis, information
extraction and web searching [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, it has become a source of attraction for
many researchers.
      </p>
      <p>
        In natural language processing, nowadays the core task in text processing is how
to present features. Most techniques, such as bag-of-words where unigram, bigram
and more broadly n-grams or some other developed patterns that are being used for
feature extraction. In order to extract more useful and distinct features, many methods
have developed like LDA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], PLSA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], frequency and MI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In spite of the fact
that many researchers have developed some more complex features (such as tree
kernel) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in order to extract more contextual information and accurate word order, but
there exist few issues such as data sparseness which has the great impact on the
classification accuracy. In deep machine learning, some of the most successful deep
learning methods involve deep neural networks. In the past few years, deep neural
networks and expeditiously advancement in the direction of pre-trained word embedding
has become the source of new opulent ideas of NLP tasks. Word embedding is a
distributed feature learning over sequences of words and heavily assuage the data
sparsity issues. It is also worthwhile to mention that some researchers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] present that
pre-trained word embedding has the ability to extract useful syntactic and semantic
regularities. In addition, some composition-based methods were proposed to extract
the semantic representation of text with the help of word embedding. To construct
sentence representation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] suggested the idea of Recursive Neural Network
(Recursive NN) that has ended up being effective performance. Recursive NN has the
ability to extract the semantic of a sentence by using tree structure technique. Its
performance heavily relies on upon the execution of the textual tree development.
Nevertheless, the time complexity of contracting such textual tree is at least O(n2) where n
is the length of the text. If a sentence or document is so long, this approach would be
too time-taking. Additionally, it can be very difficult to develop a relationship
between two sentences by a tree structure. Consequently, Recursive NN is
unsatisfactory for molding long sentences or document.
      </p>
      <p>
        Recurrent Neural Network (Recurrent NN) is another type of model that shows a
time complexity O(n). This model investigates a text word to word and saves the
semantics of all the past text in a rigid-sized hidden layer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. There is no doubt that it
has the ability to capture the semantics of a big text but it is a biased model. It means
that it focuses on later words than earlier words that cause to minimize the efficiency
of capturing the semantics of a whole document as all the words have the same
probability to appear in the sequences of words.
      </p>
      <p>
        Convolutional Neural Network (CNN) is introduced in natural language
processing in order to solve the biases issue. Convolutional neural networks initially were
used for video recognition tasks, but recent works [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] showed that CNN can also be
applied to NLP (natural language processing). Social media and networks become
very interesting topic for scientists, since nowadays more and more people are sharing
their opinions on different subjects online. CNN has the ability to extract the semantic
of texts in a very systematic way as compared to recurrent or recursive neural
networks with time complexity O(n). It can capture critical dialogues in a text by using a
max-pooling layer. Although previous research on CNN reveals the kernel technique
as a rigid window [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In addition, it is so hard to find the most suitable size of a
kernel (window size). If a size of the kernel is huge, it would be a cause of many
parameter spaces (possibly hard to train it) whereas small size kernel may lead
inaccurate results by missing discriminative information.
      </p>
      <p>
        In text classification, the crux of this direction of NLP mainly predominantly
concentrates on three parts: how to design techniques to capture nice features, how to
capture appropriate features by using designed techniques and last one is how to
design distinctive sorts algorithms for machine learning. In text analysis, Bag-of-words
is the most commonly used feature capturing tool. Furthermore, there are some others
features selection tools for complex features selection, such as noun phrases [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
part-of-speech tags and tree kernels [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The central goal of feature selection tools is
to eliminate inutility and noisy features in the text in order to get good performance of
classification tasks. The most well-known component selection strategy is expelling
the stop words (e.g., "the", "a", "an"). In order to capture valuable features, some
modern techniques, such as L1 regularization [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], mutual information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], or
information gain are being used. In Machine learning process, Machine learning
algorithms frequently utilize classifiers, for example, logistic regression (LR), support
vector machine (SVM) and Naïve Bayes (NB) classifiers etc. Nonetheless, these
techniques contain the issues of data sparsity problem.
      </p>
      <p>
        Artificial intelligence revolutionizes the field of deep learning especially Deep
Neural Networks [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], word representation learning [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and came up with to deal
with data sparsity issues. In the meanwhile, there are many others neural networks
models have been suggested for word representation learning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. In the text analysis, word embedding is the neural representation of a word is
that is a real-valued vector. The word embedding technique makes us capable of
assessing word relevance by just utilizing the distance between two embedding vectors.
      </p>
      <p>
        Word embeddings with the pre-trained technique play an important role in getting
the best performance results of neural networks in many NLP tasks. To foresee the
sentiment of a sentence [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] employ semi-supervised recursive autoencoders. In the
same way, by also using a recurrent neural network [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] suggested a method for
paraphrase detection. However, to examine the sentiment of phrases and sentences [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
introduced a new method of a recursive neural tensor network. In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] employs the
recurrent neural network to construct the language models. For dialogue act
classification, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] suggested a novel recurrent network. For semantic role labeling [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
presents convolutional neural network.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Setup for Comparison</title>
      <p>Multi-Layer Perceptron (MLP) with a single hidden layer is used. Word Embedding’s
is fed as input to the neural network. Word Embedding’s layer is the first layer in a
model. A 32-dimension vector is used to represent each word. For an experiment,
only top 5,000 most frequent words in the dataset are used to set the vocabulary. A
movie review is bounded at 500 words, truncating longer reviews and padding shorter
reviews with zero value so that they are all the same length for modeling.</p>
      <p>
        Finally, the MLP model is defined by creating a word embedding layer as the first
layer. The word vector size to 32 dimensions and the input length to 500. The output
of this first layer would be a matrix with the size 32 x 500. Embedded layers output
will be flattened to one dimension then use one dense hidden layer of 250 units with a
rectifier activation function. The output layer has one neuron and will use a sigmoid
activation to get the output values between 0 and 1 (probability-like values) as
predictions. A batch size of 128 is used for training the model. Adam algorithm is used in
training because it gives best solutions by controlling the learning rate [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Basically,
it uses moving averages of the parameters (momentum) that allow it to use a larger
effective step size, and the algorithm will converge to this step size without fine
tuning. After the model is trained, evaluation is done to check its accuracy on the test
dataset. The proposed model achieves a score of 87.16% accuracy. A greater accuracy
can be achieved by training this network using a larger embedding and with the
addition of more hidden layers.
      </p>
      <p>Convolutional Neural Networks (CNN), in biological terms, are inspired variants
of MLP. They were designed to honor the spatial structure in image data while being
vigorous to the position and orientation of learned objects in the picture. This same
idea can be used on sequences, such as the one-dimensional sequence of words in a
movie review. This is how the same properties that make the CNN model more useful
for learning to identify objects in pictures can assist to learn pattern (structure) in
paragraphs of words, namely the methods invariance to the specific position of
features.</p>
      <p>Word Embedding’s is fed as input to the convolutional neural network. Word
Embedding’s layer is the first layer in the architecture of a model. A 32-dimension vector
is used to represent each word.</p>
      <p>Eventually, define a convolutional neural network model for the experiment. This
time, after the Embedding input layer, a Conv1D layer is inserted. This Conv1D layer
has 32 feature maps and reads 3 (kernel size) vector elements of the word embedding
at a time. The convolutional layer is followed by a 1D max pooling layer with a
length and stride of 2 that halves the size of the feature maps from the convolutional
layer. The rest of the network is the same as the MLP. It is important to note that
Conv1D layer conserves the dimensionality of Embedding input layer of
32dimensional input of 500 words. The pooling layer compresses this representation by
halving it. One main feature of CNNs is that units share weights, which greatly
minimize the amount of computation needed for the model training. It is also seen that
CNNs are better at capturing the spatial relationships between words. After training
and testing the neural network, the model achieves an accuracy of 87.71%.</p>
      <p>LSTM for Sequence Classification: Sequence classification is a predictive
modeling task. If a sequence of words is given to the model as input, the task is to predict a
category (class) for given the sequence. It is not a trivial case to perform because the
issue is that the sequences can vary in length, be comprised of a very large vocabulary
of input symbols. Different sequences may have different length of words. When
these sequences fed to a neural network model, it may require the model to learn the
long-term context or dependencies between patterns (symbols) in the input sequence.
LSTM recurrent neural network models are used for sequence classification in a
movie review dataset.</p>
      <p>A movie review is a variable sequence of words. It means that each movie review
contains different amount of words and the sentiment of each movie review must be
classified. The words have been substituted by integers that exhibit the ordered
frequency (how many times a word occurred) of each word in the dataset. In each
review, consequently the sentences are made up of a sequence of integers.</p>
      <p>Word Embedding’s layer is the first layer in a model. It uses 32 length vectors to
indicate each word. The next layer is the LSTM layer that contains 100 memory units
(called smart neurons). A dense output layer with a single neuron and a sigmoid
activation are used because this is a binary classification task. A sigmoid activation
function is used to make 0 or 1 (probability-like values) predictions for the two classes
(good and bad). A huge batch size of 64 reviews is utilized to space out weight
updates. In the end, a model is created for an experiment. The model is fit for a small
epochs only. The reason is that it quickly over fits the task. The model with LSTM
performing little tuning achieves an accuracy of 88.03%.</p>
      <p>Recurrent neural networks (RNN) like LSTM generally have the problem of
overfitting. Dropout is a powerful technique for combating overfitting in LSTM models.
Dropout can be applied between layers of the neural network. Dropout is applied by
adding new Dropout layers between the Embedding and LSTM layers and the LSTM
and Dense output layers in different experiments. In first experiment, the model gets
the accuracy of 87.17% that indicate a slightly slower trend in convergence compared
to the simple LSTM case. Another technique is adapted to compare the results, add
dropout to the input and recurrent connections of the memory units with the LSTM
precisely and separately. In the second experiment, the model gets the accuracy of
86.44%. It can be seen that LSTM specific dropout has a more pronounced effect on
the convergence of the network than the layer-wise dropout.</p>
      <p>CNN and LSTM for Sequence Classification: Convolutional neural networks
(CNN) proficient at learning the spatial pattern (structure) in input data but LSTM
needs to be larger and trained for longer to achieve the same skill. The IMDB review
data does have a one-dimensional spatial pattern (structure) in the sequence of words
in movie reviews. CNN may be able to select invariant features for positive and
negative sentiment. This learned spatial features may then be learned as patterns
(sequences) by an LSTM layer. It is easily possible to add a one-dim CNN and max-pooling
layers after the Embedding layer. Subsequently, this Embedding layer feeds the
consolidated features to the LSTM. A fairly small set consist of 32 features with a small
filter length of 3 (kernel size) is used. In the next step, the pooling layer utilized the
standard length size of 2 to reduce (halve) the size of the feature map. The rest
architecture is same as explained above. The model gets the accuracy of 86.74%. It can be
seen that the model achieves similar results to the LSTM for Sequence Classification
with Dropout" although with fewer weights and faster training time.
3</p>
    </sec>
    <sec id="sec-3">
      <title>CNN for Twitter sentiment analysis</title>
      <p>
        The model architecture is a modification of the CNN architecture in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Input is a
tweet itself, which is representing by a matrix of real numbers, each column is a word
of the tweet. The quantity of rows corresponds to a dimensionality of the used word
embedding. There are different word embedding models have been used in this work.
Almost all of them have dimensionality in 300 figures. CNN has one convolutional
layer, one max-pool layer and a full-connected layer with a non-linear function. This
network’s goal is a binary classification task, predict for a given tweet if it is positive
or negative.
      </p>
      <p>More concisely, tweet is a vector of words = [ , , . . . ] = [ , , . . . ],
where ∈ ( = 300 ). = [ , , . . . ] For a given model should give an output
1 or 0. Where 1 is a positive class, 0 is a negative one. The convolutional layer has 3
different filter lengths: 3, 4, 5, there are 100 different filters producing unique feature
map by each length. Since there is no padding in convolutional layer, output of each
filter has different length, hence one max-pool layer just takes one the most illustrious
feature from a feature map. The output of the max-pool layer then flows to
fullconnected layer with non-linearity, the last step is generating an output label by using
the softmax function. The models were tested with different non-linear activation
function. The Sigmoid function was chosen for next steps of the experiment, since it
produces almost the same performance in a shorter time (Table 1):</p>
      <sec id="sec-3-1">
        <title>Activation function</title>
      </sec>
      <sec id="sec-3-2">
        <title>Relu Sigmoid Tanh</title>
        <p>
          The model was tested with SemEval-2016 [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] dataset using 2 classes of tweets,
positive and negative. The positive labeled tweets quantity is 8306, the negative ones
are 3190. For word embeddings were used 4 different pre-trained models: word2vec
[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], glove-twitter, glove-wikipedia, glove-common. Glove refers to Global Vector
Representation, models were trained on twitter, wikipedia or common internet
respectively. Glove models are accessible online from [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
        </p>
        <p>The words from tweet that weren’t found in a pre-trained word embeddings are
initialized with random numbers and are corrected during the train stage. The dataset
was tested by using cross-validation method with 10 parts (10-fold CV used).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and conclusion</title>
      <p>Since the aim of the first experiment is to classify texts of IMDB movie review
dataset. The results will be judged separately so that the scores of different techniques
can be compared. The results are obtained using the commonly applied method of
tenfold cross-validation which calculates average classification accuracy, with
accuracy simply defined as the number of correctly predicted labels. For evaluation
purposes, accuracy scores are often compared to the majority Baseline, which is the
accuracy score obtained when the label from the largest class (i.e. the label with the
highest prior probability) is assigned to each document.</p>
      <p>Naïve Bayes (multiple entry on text) was applied to the IMDB data on three
different stop words list; once with unigrams, once with combinations of unigrams and
bigrams, and once with combinations of unigrams, bigrams and trigrams as features
for classification. The lists of stop words were selected from different sources (NLTK
(Natural Language Toolkit) and Google stop words). NLTK stop words file indicated
pretty good results for sentiment classification.</p>
      <p>Multilayer neural network and Convolutional neural network was applied to the
IMDB dataset. The data distribution was 50% for both training and testing. The model
achieves a score of 87.16% and 87.71% accuracy after two epochs respectively. These
are good predicting models as compared to the traditional machine learning
techniques. A fairly good results can also be achieved after several epochs. CNN show
better performance because it has good architecture to highlight the text patterns
(sequences) in the training.</p>
      <p>The LSTM layer of 100 memory units (called smart neurons) model is used to the
IMDB dataset for sequence classification. The data distribution was 50% for both
training and testing and the model achieves an accuracy of 88.03% after three epochs.
LSTM show better performance because its architecture contain memory cells that
can memorize and forget text patterns (sequences). Convolutional neural networks
and LSTM were applied to the IMDB dataset for sequence classification with the
same data distribution. The model achieves an accuracy of 86.74% after three epochs.</p>
      <p>If we compare all the results for IMDB dataset, we can see that CNN show best
performance for classification task. On the other hand, the LSTM model achieves
good performance for sequence classification. Subsequently, we show their training
time and compare their performances (Table 2).</p>
      <p>Time(sec)
1120
72
88
2230
1999
869</p>
      <p>As for comparison for second experiment for short messages (twitter) the results
with different word embeddings models should be analyzed with knowing that the
number of random initialized words was quite big, from 46% (for word2vec) to 37%
(for glove-common). Also glove models from Twitter are 200 dimensions vector,
whereas others have 300 dimensionality. As it can be seen on results (Table 3), the
relevance of information and its vastness are crucial points for a choice of the word
embeddings.</p>
      <p>Results show that convolutional neural networks are more efficient with sentiment
analysis then other machine learning algorithms, and all machine learning algorithms
are much more efficient then hand-made features according analysis big datasets.
Moreover, convolutional neural networks showed a very high learning rate for
different text corpora (IMDB and SemEval-2016).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>A survey of text classification algorithms</article-title>
          .
          <source>In Mining text data</source>
          . Springer.
          <fpage>163</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hingmire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chougule</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Palshikar,
          <string-name>
            <given-names>G. K.</given-names>
            ; and
            <surname>Chakraborti</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Document classification by topic labeling</article-title>
          .
          <source>In SIGIR</source>
          ,
          <fpage>877</fpage>
          -
          <lpage>880</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ducharme</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Vincent,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Jauvin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>A Neural Probabilistic Language Model</article-title>
          .
          <source>JMLR</source>
          <volume>3</volume>
          :
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Text categorization by boosting automatically extracted concepts</article-title>
          .
          <source>In SIGIR</source>
          ,
          <fpage>182</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cover</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Elements of information theory</article-title>
          . John Wiley Sons.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Post</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bergsma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Explicit and implicit syntactic features for text classification</article-title>
          .
          <source>In ACL</source>
          ,
          <fpage>866</fpage>
          -
          <lpage>872</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yih</surname>
          </string-name>
          , W.T. and
          <string-name>
            <surname>Zweig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .
          <source>In hlt-Naacl</source>
          ,
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sahami</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heckerman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horvitz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>A Bayesian approach to filtering junk e-mail.</article-title>
          .
          <source>Learning for Text Categorization: Papers from the AAAI Workshop</source>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>62</lpage>
          .
          <source>Tech. rep. WS-98-05</source>
          , AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , A. Y.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          (
          <year>2011a</year>
          ).
          <article-title>Dynamic pooling and unfolding recursive autoencoders for paraphrase detection</article-title>
          .
          <source>In NIPS</source>
          , volume
          <volume>24</volume>
          ,
          <fpage>801</fpage>
          -
          <lpage>809</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Socher</surname>
            , R.; Pennington,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , A. Y.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          (
          <year>2011b</year>
          ).
          <article-title>Semisupervised recursive autoencoders for predicting sentiment distributions</article-title>
          .
          <source>In EMNLP</source>
          ,
          <fpage>151</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Elman</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          (
          <year>1990</year>
          ).
          <article-title>Finding structure in time</article-title>
          .
          <source>Cognitive science 14</source>
          (
          <issue>2</issue>
          ):
          <fpage>179</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kim</surname>
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Convolutional neural networks for sentence classification arXiv preprint</article-title>
          arXiv:
          <volume>1408</volume>
          .
          <fpage>5882</fpage>
          . -
          <lpage>2014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Collobert R</surname>
          </string-name>
          . et al.
          <source>Natural language processing (almost) from scratch Journal of Machine Learning Research. - 2011</source>
          . -
          <fpage>Т</fpage>
          .
          <year>12</year>
          . -
          <fpage>№</fpage>
          .
          <source>Aug. - С</source>
          .
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Blunsom</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Recurrent convolutional neural networks for discourse compositionality</article-title>
          . In Workshop on CVSC,
          <fpage>119</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          (
          <year>1992</year>
          ).
          <article-title>An evaluation of phrasal and clustered representations on a text categorization task</article-title>
          .
          <source>In SIGIR</source>
          ,
          <fpage>37</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Three new graphical models for statistical language modelling</article-title>
          . In ICML,
          <fpage>641</fpage>
          -
          <lpage>648</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Feature selection, l1 vs. l2 regularization, and rotational invariance</article-title>
          .
          <source>In ICML, 78.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Post</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bergsma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Explicit and implicit syntactic features for text classification</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G. E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R. R.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Reducing the dimensionality of data with neural networks</article-title>
          .
          <source>Science</source>
          <volume>313</volume>
          (
          <issue>5786</issue>
          ):
          <fpage>504</fpage>
          -
          <lpage>507</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Vincent,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Representation learning: A review and new perspectives</article-title>
          .
          <source>IEEE TPAMI 35</source>
          (
          <issue>8</issue>
          ):
          <fpage>1798</fpage>
          -
          <lpage>1828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          ; Socher,
          <string-name>
            <given-names>R.</given-names>
            ; Manning, C. D.; and
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Y.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Improving word representations via global context and multiple word prototypes.</article-title>
          .
          <source>In ACL</source>
          ,
          <fpage>873</fpage>
          -
          <lpage>882</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Statistical language models based on neural networks</article-title>
          . .
          <source>Ph.D. Dissertation</source>
          , Brno University of Technology.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.;
          <string-name>
            <surname>Perelygin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J. Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chuang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Manning,
          <string-name>
            <given-names>C. D.</given-names>
            ;
            <surname>Ng</surname>
          </string-name>
          , A. Y.; and
          <string-name>
            <surname>Potts</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Recursive deep models for semantic compositionality over a sentiment treebank</article-title>
          .
          <source>In EMNLP</source>
          ,
          <fpage>1631</fpage>
          -
          <lpage>1642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Kingma</surname>
          </string-name>
          ,P.D;
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25. SemEval-2016 URL: http://alt.qcri.org/semeval2016/ (accessed date :
          <volume>31</volume>
          .
          <fpage>05</fpage>
          .
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T.</given-names>
          </string-name>
          et al.
          <article-title>Distributed representations of words and phrases</article-title>
          and
          <source>their compositionality //Advances in neural information processing systems</source>
          .
          <source>- 2013</source>
          . -
          <fpage>С</fpage>
          .
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Glove</surname>
            <given-names>URL</given-names>
          </string-name>
          : https://nlp.stanford.edu/projects/glove/ (accessed date:
          <volume>31</volume>
          .
          <fpage>05</fpage>
          .
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Mohammad</surname>
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiritchenko</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            <given-names>X</given-names>
          </string-name>
          .
          <article-title>NRC-Canada: Building the state-of-the-art in sentiment analysis</article-title>
          of tweets //arXiv preprint arXiv:
          <volume>1308</volume>
          .
          <fpage>6242</fpage>
          . -
          <lpage>2013</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Gamallo</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Citius: A Naive-bayes strategy for sentiment analysis on english tweets //</article-title>
          <source>Proceedings of SemEval. - 2014</source>
          . -
          <fpage>С</fpage>
          .
          <fpage>171</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Agarwal</surname>
            <given-names>A.</given-names>
          </string-name>
          et al.
          <source>Sentiment analysis of twitter data //Proceedings of the workshop on languages in social media. - Association for Computational Linguistics</source>
          ,
          <year>2011</year>
          . -
          <fpage>С</fpage>
          .
          <fpage>30</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Go</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhayani</surname>
            <given-names>R.</given-names>
          </string-name>
          , Huang L.
          <article-title>Twitter sentiment classification using distant supervision //CS224N Project Report</article-title>
          , Stanford. -
          <source>2009. - Т. 1</source>
          . - No.
          <year>12</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>