<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Alignment-Based Topic Extraction Using Word Embedding</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Charleston</institution>
          ,
          <addr-line>Charleston SC 29424</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Being able to extract targeted topics from text can be a useful tool for understanding the large amount of textual data that exists in various domains. Many methods have surfaced for building frameworks that can successfully extract this topic data. However, it is often the case that a large number of training samples must be labeled properly, which requires both time and domain knowledge. This paper introduces new alignment-based methods for predicting topics within textual data that minimizes the dependence upon a large, properly-labeled training set. Leveraging Word2Vec word embeddings trained using unlabeled data in a semi-supervised approach, we are able to reduce the amount of labeled data necessary during the text annotation process. This allows for higher prediction levels to be attained in a more time-efficient manner with a smaller sample size. Our method is evaluated on both a publicly available Twitter sentiment classification dataset and on a real estate text classification dataset with 30 topics.</p>
      </abstract>
      <kwd-group>
        <kwd>Topic extraction</kwd>
        <kwd>Text annotation</kwd>
        <kwd>Text classification</kwd>
        <kwd>Word vectors</kwd>
        <kwd>Text tagging</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Finding specific topics within textual data is an important task for many
domains. A multitude of approaches for achieving this task have appeared in recent
years [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], no doubt due to the ever growing amount of textual data available
to organizations and researchers. In the most straightforward case, topic labels
are known a priori and non-domain experts can be trained to manually label
examples using systems such as Mechanical Turk [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In a second case, topic
labels are not known prior to annotation and only after a domain expert has
defined them can non-experts be used to label examples. Conversely, once these
topic labels have been defined, many domains require an expert throughout the
entire annotation process [
        <xref ref-type="bibr" rid="ref1 ref13">1, 13</xref>
        ]. The fourth case, and the one that motivated
our work specifically, is the case in which a domain expert must concurrently
evolve the set of topic labels through manual annotation of examples.
      </p>
      <p>
        In all of these cases, once an appropriate number of training samples have
been acquired, many different machine learning algorithms have successfully
been used to automatically identify topics in new examples [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Further,
reducing the number of training samples needed to produce an accurate predictive
model is beneficial in all instances, although the fourth case in particular benefits
from a flexible predictive modeling approach that can quickly be retrained on
a small number of examples. This is because the domain expert may revise or
modify the current topic labels while exploring their problem domain. An
example of this scenario arose during a data science project in the real estate domain
where a sentence-level topic annotator was desired. Experts in this domain could
not specify a final or draft set of topic labels, preventing us from utilizing other
approaches that are built to expect a stable set of topics. Further, domain
experts required a lightweight, web-based interface where topics could be easily
defined, modified, and applied at the sentence level while minimizing the
number of training examples needed to produce an automated prediction due to the
large number of topics in their domain.
      </p>
      <p>This paper describes a novel method and application for predicting topics at
the sentence level in order to reach a high level of accuracy with a limited number
of training samples. We evaluate our algorithm on its ability to predict over
30 different sentence-level topic labels within real estate data. We also provide
an evaluation of our algorithm on a standard and publicly available twitter
sentiment-based prediction dataset. In the real-estate domain, expert knowledge
is required for the annotation of the important topics of interest. The topics were
not available or known a priori, further limiting the number of possible expert
annotators. We show that our algorithm achieved a higher prediction accuracy
with a very small number of examples, resulting in a significant improvement
over standard methods for topic identification with small sample sizes. Finally,
our method maintains its ability to predict well in small sample sizes, even in the
absence of negative training examples which would further reduce the burden
on the domain expert. The rest of this paper is structured as follows: in Section
2 we present a review of related work; in Section 3 we provide descriptions of
relevant machine learning algorithms and introduce our architecture and novel
methods; and we conclude with Section 6 where we present our experimental
results and discussion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Kim et al. proposes a method for categorizing text at the sentence and document
level using word vector distances [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Their algorithm helps deal with sparsity
issues in word data, specifically caused by words looking very different when
actually meaning the same thing. These sparsity issues are often caused by short
documents or a small amount of training data. One of the datasets that they
used was the SemEval 2013 Task B dataset (Twitter), which contains 12,348
tweets that are labeled as positive, negative, or neutral. They found that their
algorithm produced good results quickly and with a relatively small number of
training samples (requiring at least sample sizes in the hundreds).
      </p>
      <p>
        Topic and feature extraction is popular in the world of data mining, and there
has been a lot of work for making this process more efficient and more accurate.
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] proposes an algorithm that combines binary classifiers, Conditional Random
Fields (CRFs), and Expectation Maximization in order to extract information
from business-to-consumer (B2C) emails. Different pieces of the extracted
information are annotated as specific features. This is done by utilizing templates
that have been predefined based on other similar documents. Their approach
is fully unsupervised and requires no manual corrections or fixes to the data.
Instead of using word vectors for matching, they run low-accuracy annotators
trained on weak features which then feed into their CRF model.
      </p>
      <p>
        Nguyen et al. introduces an algorithm that combines preprocessing, pattern
recognition, iterative model development, and active learning to annotate and
classify features found within clinical text records [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In their algorithm, the
textual data is first standardized and normalized to perform tasks such as
correcting spelling, expanding abbreviations, and converting to a standard layout.
The data then moves to the iterative model development process, where models
are trained and evaluated using Support Vector Machines (SVMs) and CRFs.
The model is refined using a visual annotator that allows for some manual
correction, along with active learning that lets the learner select the most informative
data to retrain the model. Their approach requires a relatively large number of
training samples. In one of their tests, they ran various active learning algorithms
on 100 batches of radiology reports with 10 reports per batch. The F-scores for
the active learning algorithms, on average, did not surpass 75% until around
7-10 batches (i.e., 70-100 reports) were run. It also took the algorithms between
30-50 batches to reach a 90% F-score.
      </p>
      <p>
        Wang et al. offers a new approach to modeling targeted subtopics within
text. Instead of extracting all larger topics within a corpus, they search for more
specific subtopics using a targeted topic model (TTM) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. To do this, each
sentence is treated as its own topic that focuses on only one aspect. These topics
are deemed relevant or irrelevant based on a set of specified keywords. Their
model was run on five datasets taken from Twitter that range in size from 10k
to 50k samples.
      </p>
      <p>
        Finally, several attempts have been made to create architectures that can be
used for new domains. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposes a Python framework to ameliorate the process
of feature extraction in various different forms of media such as video, audio, and
text. Their framework, Pliers, attempts to package the benefits of multiple other
machine learning frameworks and services into one coherent feature extraction
toolbox.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <sec id="sec-3-1">
        <title>Standard Approaches</title>
        <p>
          We will now present brief descriptions of several widely applied methods used for
target identification. This section is broken up into two additional subsections:
feature extraction and machine learning methods. All of the methods described
in this work rely on one of two word embedding methods: bag-of-words (BOW)
or Word2Vec [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. We briefly describe how these methods were implemented
and incorporated into our work.
        </p>
        <p>
          Feature Extraction In its simplest form, BOW is an orderless representation
of word frequencies in a document. In the context of this sentence-level
target identification problem, the word counts from each sentence are normalized
into a word frequency matrix prior to classification. The Python natural
language toolkit (nltk) and native Python String library [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] were used for this step.
Python’s String library was used to parse out punctuation and stop words were
removed using nltk. This was followed by stemming using nltk’s
SnowballStemmer [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          Word2Vec is an NLP system that utilizes neural networks in order to create a
distributed representation of words in a corpus [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. While the BOW pipelines
produce word frequency for each document respectively, Word2Vec creates
vectors for each word present in a document. These vectors have a smaller distance
between them for related words. The words Athens and Greece are examples
of this, along with pluralities or tense switches, such as alumnus and alumni or
walking and walked [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In order to map words to vectors, Word2Vec uses an
underlying shallow neural network in addition to techniques seen in deep
learning tasks. This unsupervised task takes each individual sentence for a given
corpus and, within the neural network, encodes the context of word in the sentence,
much like the deep learning autoencoders seen in restricted Boltzmann machines
and Stacked Denoising Autoencoders [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ]. This is done through the usage of
skip-grams, which calculate the probabilities of occurrence for words within a
certain distance before and after a given word. Inter-relating these probabilities
creates similar word vectors for those with higher probabilities.
        </p>
        <p>
          For the purposes of evaluation in this paper, two Word2Vec models were used.
The first was a model trained on the real estate corpus, and the second was a
publicly available Twitter Word2Vec pre-trained model available at http://yuca.
test.iminds.be:8900/fgodin/downloads/word2vec_twitter_model.tar.gz.
Machine Learning Methods Three standard machine learning methods that
are often used in topic identification were selected for comparison: Na¨ıve Bayes,
SVM, Random Forest (RF), and K-Nearest Neighbor (KNN). Standard
implementations of these algorithms are available in scikit-learn. Na¨ıve Bayes was
applied to the BOW features [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] using empirical priors and non-parametric
settings. SVMs with BOW features have been shown to perform well on a wide range
of text classification applications [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. SVMs are not prone to error with
highdimensional datasets and have been previously shown to be useful in text-based
classification problems [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Four standard kernels were tested: linear, polynomial,
radial basis, and sigmoid. The penalty parameter (C) was also varied as 0.01,
0.1, 1, and 10. For the presentation of the results, a single entry for SVMs is
displayed that corresponds to the best parameter selection for each class. KNN
with two standard distance metrics was tested. Both distance metrics are built
upon Word2Vec embeddings as opposed to BOWs. The two metrics used were
the mean and maximum cosine-similarity between all pairs of words.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Novel Approaches</title>
        <p>Two novel approaches were developed and evaluated in this work that stem from
an alignment-based distance metric. The first approach is a standard KNN
classifier utilizing the novel alignment-based distance metric in place of traditional
BOW distance metrics. To classify a new unknown sample, the alignment-based
KNN calculates the distance between the target sample and all labeled data.
The k-nearest neighbors are then found and the majority class is returned. The
second approach is an alignment-based threshold classifier that only requires
positively labeled data and a predefined threshold. This threshold-based
classifier measures the distance from a target unknown sample to only the positively
annotated samples. If the score is above the threshold, a positive class prediction
is returned. The success of both methods is dependent on the alignment-based
distance metric described below.</p>
        <p>
          Alignment Distance Metric All alignment-based approaches were
implemented using the NeedlemanWunsch algorithm that has been made famous for
its use in aligning biological sequences [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We have adapted the scoring and
gap penalties for the target identification classification problem. The algorithm
produces a numerical score based on the aggregation of misalignment penalties
between words (cosine-similarity) combined with the penalties for skipping a
word in either the labeled or unlabeled sentence. The cosine similarity
misalignment score is the dot product between the vector representation of each word.
There is no gap penalty for skipping a word in the unlabeled sentence; however,
skipping a target word carries a high penalty and is therefore avoided by the
algorithm. This forces the algorithm to match all target words that have been
identified by a domain expert. An example alignment of two sentences is shown
in Table 1. For all alignment-based methods, the Word2Vec implementation
described in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was used.
        </p>
        <p>Standard Distance Metrics In addition to these two novel approaches, we
implemented and tested two standard Word2Vec distance metrics. The first
metric was the maximum cosine-similarity between pairs of words in annotated and
unknown sentences. The second was the average cosine-similarity score between
pairs of scores.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Use Cases</title>
      <sec id="sec-4-1">
        <title>Real Estate</title>
        <p>The domain we designed our original system for involved property descriptions
from real estate data. This data was created by realtors and contains information</p>
        <p>Annotated: — master bedroom — ——- downstairs</p>
        <p>Unknown: two master bedrooms are located downstairs
about various real estate listings. Each listing has a large amount of metadata
(e.g., location, images, basic features); however, the main piece of information we
use in our system is the description written for each listing. This description is
usually no more than a short paragraph and is created by the real estate agent to
summarize the listing as a whole. It can include any information that the agent
wishes to convey to the potential buyer, such as property features, kitchen
appliances, etc. By analyzing a listing’s description, we attempt to extract specific
features about the listing itself.</p>
        <p>We created a web application using the Angular JavaScript framework. The
main interface for this application provides a way to easily and quickly annotate
real estate listings. This interface pulls listings that have not yet been annotated
from our server, along with the sentence and word data associated with that
listing. It also retrieves the topic information so that the listing can be properly
annotated.</p>
        <p>Once a listing has been retrieved, the annotator can select a sentence from
the listing description to annotate. They can then toggle the individual words
in the sentence that match a specific topic. If they believe that a pattern should
be mapped to a topic that does not currently exist, then they have the option to
create it. Once a topic has been defined, it can then be used by future annotators.
When an annotator is satisfied with an annotation, they simply click submit and
it is stored inside of the annotations database to be used in future alignment
predictions.</p>
        <p>Along with this annotation interface, we also included other pages within
the web application that provide statistical information regarding the database.
Some of these pages display simple information, such as the specific progress of
different annotators. Other pages give annotators more control over the database
itself, allowing them to take actions such as correcting mistakes in previous
annotations or modifying topic names.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Twitter</title>
        <p>
          Twitter has grown quite large in the past decade, along with the amount of
textual data it has created. This data has proven to be a popular source for
testing and implementing text-based machine learning algorithms [
          <xref ref-type="bibr" rid="ref11 ref17 ref3 ref6">3, 6, 11, 17</xref>
          ]. We
tested our alignment algorithm on the Twitter dataset used in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This dataset
includes the text data for 12,348 tweets and annotations that have been made for
each tweet. These annotations are based on the entire tweet and represent the
overall sentiments conveyed therein. The possible sentiment values include
positive, neutral, negative, objective, and objective-OR-neutral. In our experiment
and the experiment done by [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], only positive, neutral, and negative sentiments
are used.
        </p>
        <p>In order to annotate this dataset, we created a simple interactive widget that
runs inside of a Python Jupyter Notebook. This publicly available widget is very
similar to the annotation interface we created for the real estate data and can
be seen in supplemental Figure 4. We store the tweets in a text file, which the
interface uses to select tweets at random. Once a tweet is selected, the annotator
can toggle the words in the tweet that they believe match the sentiment that was
predicted. For example, if a tweet was labeled as having a positive sentiment,
then only words that are relevant to that sentiment will be selected and stored
in the annotation. This allows us to store annotations in a very similar format
to the ones that were stored for our real estate data.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>Each classifier was subjected to iterative cross-validation as a function of the
training size for each category. This procedure is summarized in Figure 1. The
average F1 score across all iterations for each training set size was calculated
and plotted as a function of training set size. Two examples are shown in Figure
2. The area under this curve was then calculated to measure the ability of each
classifier to perform well for small as well as larger training set sizes. The 95%
confidence interval was calculated for the area under the F1 curve as a function
of the training set size.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results and Discussion</title>
      <p>We compared our alignment-based algorithms to bag-of-words derived SVM,
Na¨ıve Bayes, random forest, and KNN classifiers on 30 real estate categories
(a) C. Fenced Backyard
(b) D. Eat-in Kitchen
and positive/negative sentiment tweet prediction. Four algorithms based on
Word2Vec word embedding were evaluated. Two standard distance metrics (mean
and max) were evaluated as baseline classifiers in addition to two novel
alignmentbased classifiers (Alignment KNN and Alignment Threshold). These results are
summarized in Table 2 which shows the iterative cross-validation confidence
intervals of the area under the F1 curve versus training set size. The table is sorted
by the average score for the Alignment Threshold method. The first column in
the table represents a summary of how this method performed in comparison to
the others. If the Alignment Threshold method had a higher and non-overlapping
confidence interval when compared to all other methods, this is indicated with
a W+. If the lower bound on the Alignment Threshold method was higher than
any other method and does overlap, this is indicated with a W. A T was used
if the Alignment Threshold confidence interval was not higher than any other
method but still overlapped the best method. All other cases are indicated with
an L. In total, there are 29% W+, 35% W, 19% T, and 16% L, meaning that the
Alignment Threshold method was as good or better in 84% of the classes. These
results demonstrate how the alignment-based algorithms are able to perform
better on fewer samples or equivalent to standard approaches. This is
significant even in the case of equivalent accuracy as the alignment-based threshold
algorithm does not require negative training samples which reduces the number
of samples an expert must annotate. Further, all alignment-based methods are
built upon a semi-supervised learning approach where large amounts of
unlabeled data is used to reduce the need for labeled data.</p>
      <p>
        It was our desire to test our algorithm against the multi-level kernel system
developed by Kim et al. discussed in section 2; however, we were unable to find a
public implementation of this algorithm, and our attempts to reach the authors
were not successful. In their paper, the authors evaluated their algorithm using
the Twitter dataset also used in this paper [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which includes 12,348 tweets that
are each matched with an overall sentiment value. The multi-level kernel system
was able to achieve accuracy between around 80% and 90% using samples sizes
ranging from the low hundreds to the mid thousands, with their best accuracy
result being 0.808 (with a standard test data split of 25%). No results were
presented for sample sizes less than 50.
      </p>
      <p>Further inspection of Table 2 shows that in some cases the Alignment
Threshold classifier performs poorly while the Alignment KNN classifier, which uses
both positive and negative training examples, performs better or equivalent to
standard approaches. We believe this is due to the underlying word
embedding vectors for the specific targets. Words such as those found in positive and
negative tweets are relatively ubiquitous, while those found when mentioning
a walk-in closet are relatively rare. It is reasonable to assume that hard coded
rules could also be developed in some cases, but that the approach presented in
this paper would be preferred as it is easily extended to additional categories
without the need to maintain a rule management system.</p>
      <p>The original annotation system was researched and built specifically for a
company specializing in real estate technology. Because this data is proprietary,
our algorithms are also tested our algorithm on a publicly available Twitter
sentiment prediction dataset. The original Javascript Angular based system is
proprietary, but we have implemented the core functionality of our system using
Jupyter Notebook widgets. The widget loads directly inside of the Notebook and
displays a tweet’s text, its overall sentiment value, and buttons that represent
each word in the tweet. These buttons can be toggled to create the annotation
pattern. All of this data is stored in a Pandas DataFrame and serialized to
its own file, which can be used to train and evaluate our algorithm. We have
made this version of the annotation interface open source in addition to all
alignment-based implementations and evaluation code (https://github.com/
Anderson-Lab/sentence-annotation).
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>Being able to extract topics from text is an ever-growing problem for many
domains given the large amount of textual data that is constantly being created.
Therefore, it is necessary to minimize the amount of annotating needed to achieve
high levels of accuracy. To accomplish this, we introduced a novel topic prediction
algorithm that requires only a small amount of human annotation. Our results
show that this approach can provide significant performance benefits when the
target labels are not known a priori or when the sample size is small. Future
directions of this work include experiments to determine if these alignment-based
distance metrics continue to provide non-redundant benefits as the sample size
grows significantly. For community and reproducibility purposes, our methods
are available in a public repository that includes a Jupyter Notebook annotation
widget that allows for annotation to easily be carried out on other real world
datasets.
g
re ra trsop ilappA lan g ilroonF tse irckB ck tsn anL
:reanduA teeavyhb reogy rceeeodnhP illtsceokn itteeounC lltsseeeSn ii-tcenhnK rceaahb roonm lreoonFP iiltseaanV iloodnnwm iftreaonP ftrseaaakB iil-ItabnnC itrssounC l ebudnT ittrcenuU -ceaSuD ililteeganudC iregaonpSph itceenhpnK irseaogpLF trrseeooadB ilttrgaahuL treoaonwwD itttrrcenuuU cceeaykndB segaanR tsee trrseaaauR llifsseaooyn
le2 red taC rcS aWranG itaS taE eN uS pO uD rC eH rB uB enT ooP raG truS l-C V N O G M N N S F G tw eN rP</p>
    </sec>
    <sec id="sec-8">
      <title>Supplemental</title>
      <p>Our work was directly motivated to reduce the burden of interaction with human
supervisors, and therefore, we present an overview of our architecture that can
be broken up into four distinct phases: preprocessing, annotation, learning, and
evaluation. Figure 3 illustrates these phases and the steps that are taken in each.
Preprocessing &amp; Cleaning The first step in our pipeline involves processing
the raw listing description data and converting it into a consistent format. Each
description is stripped of any extra whitespace, transformed into all lowercase
letters, and decoded using the UTF-8 character encoding. Once this has been
done, the description is separated into its constituent sentences. These sentences
are then broken up and become lists of words that can be used in our alignment
algorithm.</p>
      <p>Annotation In order to expedite the annotation process, we made two simple
annotation interfaces; one for each of the datasets that we tested our algorithm
on. A screenshot of the Jupyter widget is shown in Figure 4.</p>
      <sec id="sec-8-1">
        <title>Preprocessing</title>
      </sec>
      <sec id="sec-8-2">
        <title>Annotation Learning Evaluation</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andrzejewski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Craven</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Incorporating domain knowledge into topic modeling via Dirichlet Forest priors</article-title>
          .
          <source>In: Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          (
          <year>2009</year>
          ). https://doi.org/10.1145/1553374.1553378, http://portal.acm.org/ citation.cfm?doid=
          <volume>1553374</volume>
          .
          <fpage>1553378</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , E.:
          <article-title>Natural language processing with Python</article-title>
          .
          <source>No. January</source>
          <year>2009</year>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly Media</surname>
          </string-name>
          , Inc. (
          <year>2009</year>
          ). https://doi.org/10.17509/ijal.v1i1.106, https://books.google.com/books?id=KGIbfiiP1i4C{\&amp;}pg=PR5
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Boella</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caro</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruggeri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robaldo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Learning from syntax generalizations for automatic semantic annotation</article-title>
          .
          <source>Journal of Intelligent Information Systems</source>
          <volume>43</volume>
          (
          <issue>2</issue>
          ),
          <fpage>231</fpage>
          -
          <lpage>246</lpage>
          (
          <year>2014</year>
          ). https://doi.org/10.1007/s10844-014-0320-9
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Buhrmester</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gosling</surname>
          </string-name>
          , S.D.:
          <article-title>Amazon's mechanical Turk: A new source of inexpensive, yet high-quality</article-title>
          ,
          <source>data? Perspectives on Psychological Science</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>5</lpage>
          (
          <year>2011</year>
          ). https://doi.org/10.1177/1745691610393980
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text categorization with support vector machines: Learning with many relevant features</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          . vol.
          <volume>1398</volume>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          (
          <year>1998</year>
          ). https://doi.org/10.1007/s13928716
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rousseau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vazirgiannis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization</article-title>
          .
          <source>Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (September)</source>
          ,
          <fpage>775</fpage>
          -
          <lpage>780</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          , D.D.:
          <string-name>
            <surname>Naive</surname>
          </string-name>
          (
          <article-title>Bayes) at forty: The independence</article-title>
          assumption in information retrieval pp.
          <fpage>4</fpage>
          -
          <lpage>15</lpage>
          (
          <year>1998</year>
          ). https://doi.org/10.1007/BFb0026666, http: //link.springer.com/10.1007/BFb0026666
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>McNamara</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de la Vega</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yarkoni</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Developing a comprehensive framework for multimodal feature extraction</article-title>
          .
          <source>In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          (
          <year>2017</year>
          ). https://doi.org/10.1145/3097983.3098075
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301.3</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          (
          <year>2013</year>
          ). https://doi.org/10.1162/153244303322533223, http://arxiv.org/abs/1301.3781
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yih</surname>
          </string-name>
          , W.t.,
          <string-name>
            <surname>Zweig</surname>
          </string-name>
          , G.:
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .
          <source>Proceedings of NAACL-HLT (June)</source>
          ,
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          (
          <year>2013</year>
          ). https://doi.org/10.3109/10826089109058901, http://www.aclweb.org/ anthology/N13-1090
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenthal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Wilson, T.:
          <article-title>SemEval-2013 Task 2: Sentiment Analysis in Twitter</article-title>
          .
          <source>Proceedings of the International Workshop on Semantic Evaluation (SemEval-2013)</source>
          <article-title>2</article-title>
          (
          <issue>SemEval</issue>
          ),
          <fpage>312</fpage>
          -
          <lpage>320</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Needleman</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wunsch</surname>
          </string-name>
          , C.D.:
          <article-title>A general method applicable to the search for similarities in the amino acid sequence of two proteins</article-title>
          .
          <source>Journal of Molecular Biology</source>
          <volume>48</volume>
          (
          <issue>3</issue>
          ),
          <fpage>443</fpage>
          -
          <lpage>453</lpage>
          (
          <year>1970</year>
          ). https://doi.org/10.1016/
          <fpage>0022</fpage>
          -
          <lpage>2836</lpage>
          (
          <issue>70</issue>
          )
          <fpage>90057</fpage>
          -
          <lpage>4</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patrick</surname>
          </string-name>
          , J.:
          <article-title>Text Mining in Clinical Domain</article-title>
          .
          <source>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '</source>
          16 pp.
          <fpage>549</fpage>
          -
          <lpage>558</lpage>
          (
          <year>2016</year>
          ). https://doi.org/10.1145/2939672.2939720, http://dl.acm.org/citation.cfm?doid=
          <volume>2939672</volume>
          .
          <fpage>2939720</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pawar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gawande</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A comparative study on different types of approaches to text categorization</article-title>
          .
          <source>International Journal of Machine Learning and Computing</source>
          <volume>2</volume>
          (
          <issue>4</issue>
          ),
          <fpage>423</fpage>
          -
          <lpage>426</lpage>
          (
          <year>2012</year>
          ). https://doi.org/10.7763/IJMLC.
          <year>2012</year>
          .V2.158
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Salakhutdinov: Multimodal Learning with Deep Boltzmann Machines</article-title>
          .
          <source>Advances in neural information processing systems (NIPS) 15</source>
          ,
          <fpage>2222</fpage>
          -
          <lpage>2230</lpage>
          (
          <year>2012</year>
          ). https://doi.org/10.1109/CVPR.
          <year>2013</year>
          .49
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Vincent</surname>
            <given-names>PASCALVINCENT</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Larochelle</surname>
          </string-name>
          <string-name>
            <surname>LAROCHEH</surname>
          </string-name>
          , H.:
          <article-title>Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion Pierre-Antoine Manzagol</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>11</volume>
          ,
          <fpage>3371</fpage>
          -
          <lpage>3408</lpage>
          (
          <year>2010</year>
          ). https://doi.org/10.1111/
          <fpage>1467</fpage>
          -
          <lpage>8535</lpage>
          .
          <fpage>00290</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Emery</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Targeted Topic Modeling for Focused Analysis</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16</source>
          (
          <year>2016</year>
          ). https://doi.org/10.1145/2939672.2939743
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Josifovski</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          :
          <article-title>Annotating Needles in the Haystack without Looking</article-title>
          .
          <source>Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '</source>
          15 pp.
          <fpage>2257</fpage>
          -
          <lpage>2266</lpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1145/2783258.2788580, http://dl.acm. org/citation.cfm?doid=
          <volume>2783258</volume>
          .
          <fpage>2788580</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>