<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classi cation of Insincere Questions with ML and Neural Approaches</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>MT &amp; NLP Lab, LTRC</institution>
          ,
          <addr-line>IIIT-Hyderabad</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>CIQ or Classi cation of Insincere Question task in FIRE 2019 focuses on di erentiating proper information seeking questions from di erent kinds of insincere questions. As a part of this task, we (team A3-108) submitted di erent machine learning and neural network based models. Our best performing model which was an ensemble model of gradient boosting, random forest and 3-nearest neighbor classi ers with majority voting. This model could correctly classify 62.37% of the questions and we secured third position in the task.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning</kwd>
        <kwd>Neural Networks</kwd>
        <kwd>Adaboost</kwd>
        <kwd>LSTM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In recent years, community question answering forums have seen an upswing.
The number of users of such forums has recorded exponential growth. Di erent
toxic, malicious, hate related posts throw the biggest challenges to most of them.
In this task, an attempt has been made to lter out malicious content from the
forum of Quora 1 that will keep their platform more secured for users.
The task aimed at distinguishing true information seeking questions (ISQ) from
non-information seeking questions (NISQ). Six ne grained classes were designed
for this classi cation and distribution of them in the given training corpus is
shown in table 1. This task is motivated by an earlier task 2 which focused
on the binary classi cation of sincere questions from the insincere ones. The
current task is a ner counterpart of question classi cation posted at Quora. As
the statistics of the below table suggests, the dataset is a highly imbalanced one
where 2 classes constitute majority of the samples.
Preprocessing plays a vital role in tasks where the input data is in textual format.
We did not use any external tokenizer for tokenizing the input. The punctuations
were discarded and the white space acted as a delimiter between the words.
We used TF-IDF vectors at character and word levels for this task. We
experimented with classi ers individually as well as their ensembles. Di erent voting
procedures were also tried out. In hard voting, the class labels are predicted
based on majority voting among the participating classi ers. In the case of soft
voting, the voting classi er picks out the maximum of the sums of the predicted
probabilities computed for the constituent classi ers. The following were
implemented using scikit-learn [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] machine learning library.
      </p>
      <p>{ Linear SVM
{ Multinomial Naive Bayes (mNB)
{ Adaboost (Adaptive Boosting)
{ Gradient Boost (GB)
{ Random Forest (RF)
{ k-Nearest Neighbor (k-NN)
{ Voting Classi er
We tried various combinations of word and character level n-grams for the
classi cation. By performing grid-search, we observed that combining both word
unigrams and bigrams outperformed character level n-gram TF-IDF vectors as
well as the combination of character and word level n-grams. The nal
submission was a hard voting classi er consisting of gradient boosting, random forest
and 3-nearest neighbors classi ers.</p>
      <p>
        Neural Network Models
We have also experimented with neural network based sequential classi ers,
where we utilized word level features as inputs to the LSTM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] layer (64 units)
followed by Embedding layer (100 dimensions) using sequential pipeline of keras
3. In this pipeline, we use dense output layer with softmax activation and
categorical crossentropy as loss function along with the Adam optimizer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We
trained this classi er for 20 epochs with an early stopping criteria. Apart from
the above classi er, we have also tried combination of CNN+LSTM classi er and
pre-trained glove [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] embedding+LSTM classi ers. The performance of these two
classi ers were considerably poor. Therefore, we ignored them from further
experimentation and reporting. In result section, we show and discuss results in
detail.
4
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>
        Di erent classi ers were trained to predict the class of each question. We include
the top performing system outputs in table 2.
We could observe that boosting methods Gradient boosting and Adaboost [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
perform better than others for this task with the latter being the best. This is
due to the weighted combination of di erent weak classi ers in Adaboost. In
community QA forums like Quora, the number of spelling variations are fewer
compared to social media due to character constraints. So word n-gram based
TF-IDF was superior to its character counterparts. Machine learning approaches
3 https://keras.io
outperformed the neural networks. This could be due to the higher number
of parameters that deep learning approaches try to learn from a very limited
amount of data.
      </p>
      <p>
        Based on above results, we try to automatically analyze training data to
understand the di cultly present in the Community Question Answering task.
For that, with basic tokenization and cleaning we applied LDA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on the training
data (without label consideration) and derived 6 text clusters from it. We used
Gensim toolkit [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for this. Figure 1 shows these derived text clusters, where
Topic-5 gives hint for the Sexual content class clearly. But from rest of the
topics, it is di cult to infer other classes.
      </p>
      <p>
        We also used LDA model to analyze training text by plotting them using
T-SNE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in two dimensions. Figure 2 represents the training text and
corresponding labels that we got from LDA and gure 3 shows the text representing
annotated class label from the training data. Both of these representations show
that classi cation of these text points is quite di cult as simple topic modeling
does not provide any major clues for the class boundaries.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>We presented our supervised approaches for the FIRE task of classi cation of
insincere questions (CIQ) in Quora for English. From our experiments, we can
argue that for low resource and imbalance task such as CIQ, traditional machine
learning algorithms with feature engineering outperform recent neural network
based approaches. Adaboost classi er with word unigram and bigram TF-IDF
features performed the best among all the classi ers. Huge amounts of unlabeled
questions from Quora can be explored to improve the clustering techniques.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3(Jan)</source>
          ,
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Freund</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schapire</surname>
          </string-name>
          , R.E.:
          <article-title>A decision-theoretic generalization of on-line learning and an application to boosting</article-title>
          .
          <source>Journal of computer and system sciences 55(1)</source>
          ,
          <volume>119</volume>
          {
          <fpage>139</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Maaten</surname>
          </string-name>
          , L.v.d.,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of machine learning research 9(Nov)</source>
          ,
          <volume>2579</volume>
          {
          <fpage>2605</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rehurek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          . pp.
          <volume>45</volume>
          {
          <fpage>50</fpage>
          . ELRA, Valletta, Malta (May
          <year>2010</year>
          ), http://is.muni.cz/publication/884893/en
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>