=Paper=
{{Paper
|id=Vol-2517/T5-4
|storemode=property
|title=IIT-BHU at CIQ 2019: Classification of Insincere Questions
|pdfUrl=https://ceur-ws.org/Vol-2517/T5-4.pdf
|volume=Vol-2517
|authors=Akanksha Mishra,Sukomal Pal
|dblpUrl=https://dblp.org/rec/conf/fire/MishraP19a
}}
==IIT-BHU at CIQ 2019: Classification of Insincere Questions==
<pdf width="1500px">https://ceur-ws.org/Vol-2517/T5-4.pdf</pdf>
<pre>
        IIT-BHU at CIQ 2019 : Classification of
                Insincere Questions

                        Akanksha Mishra and Sukomal Pal

                 Department of Computer Science and Engineering,
           Indian Institute of Technology (BHU), Varanasi - 221005, India.
                      {akanksham.rs.cse17,spal}@itbhu.ac.in
                     https://cse-iitbhu.github.io/irlab/index.html


        Abstract. This paper presents details of the work done by the team of
        IIT (BHU) Varanasi for the “Classification of Insincere Questions” track
        organized in the FIRE 2019. We implement bidirectional long short term
        memory using pre trained GloVe word embedding. Further, we analysed
        the results with different other embeddings.

        Keywords: Quora Insincere question · Bidirectional Long Short Term
        Memory · GloVe embedding.


1 Introduction
In today’s era, all the major websites on the internet are facing the challenge of
providing appropriate content on their space. They are relying on the user of the
website for marking unsafe content so that they can remove those contents and
hence make the content of their website safe. Quora is one of the widely used
websites on the internet with the user base of 300 million monthly users1 . We
can ask questions on any topic that affect us or the world; moreover, if we want
to have an opinion of experts about any real-world incidents or how others would
have tackled any particular situation. We can get interesting answers about all
the stuff that we care about.


2 Task Definition
The people are frequently using web forums like StackOverflow, Quora, and
many more for getting answers to their information-seeking questions. However,
with the vast user base, some people tend to ask questions with objectionable
content. Sometimes, some questions are posted to target some specific group
or spread hate speech. It is challenging for the human moderators to filter out
insincere questions manually due to the vast number of questions. The task is
about the identification of non-information seeking questions with the varying
characteristics into one of the six categories:-
1
    https://expandedramblings.com/index.php/quora-statistics/


“Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15
December 2019, Kolkata, India.”
2        A. Mishra et al.

    – Rhetorical questions
    – Sexual Content
    – Hate Speech
    – Hypothetical
    – Other
    – Not an Insincere question

3 Data and Resources
Task organizers provided participants with the question id and label assigned
to each question for 900 training instances. Due to dataset sharing constraints
of kaggle, they could not provide the question text; hence the participants were
supposed to extract the question text from the competition organized on the
kaggle2 website using question id. We extracted the question text from the data
section of the competition on kaggle and formed the training set. Each instance
of the training set consists of the question id, question text, and label assigned
to it. A sample of the training set is given below in table 1.
Figure 1 shows the distribution of training instances among different categories.
It can be seen from figure 1 that there is an imbalance of training instances
among different categories.

                             Table 1. Sample of training set

     qid                    qtext                                                   label
     03c5993d2c4898c57e49   Is it OK to be white?                                   1
     05981e7a85209fe81046   Why have 50% of women in Finland been raped?            2
     0599b2f6bf4ce21d0dea   Why Mohajirs hate Punjabi, Sindhi, Baloch, Pathan and 3
                            pretty much everyone?
     05e945cc9bf993ae9a5d   Can I get my desired gazetted officer government job by 4
                            the help of black magic?
     0485c68c75ca7ce02272   Can bleach cure autism?                                 5
     034dadb82db0211e2ca7   How do I sign up for Quora account?                     0


4 System Description
This section discusses the implementation of our approach as shown in Algorithm
1. Firstly, we perform preprocessing on the question text, followed by feature
extraction using pre-trained embedding and training on the bidirectional Long
Short Term Memory model.

Data Preprocessing: We perform preprocessing by removing punctuation,
‘#’,‘@’ and ‘https’ symbols. We keep the stop words and hashtag words to get a
better understanding of the context during training. Also, we removed numerals
as with different contexts they play different roles in question texts; hence, we
feel it is better to remove them. We also lowercased all the texts of the question.
2
    https://www.kaggle.com/c/quora-insincere-questions-classification
               IIT-BHU at CIQ 2019 : Classification of Insincere Questions       3


            Fig. 1. Number of instances in each category in training set


 Algorithm 1: System Description
   Data: Training and Test instances
   Result: Predict true labels of all test instances
 1 Generate a list of tokens of each instance of the train and test set
 2 Remove punctuation, ‘#’, ‘@’ and ‘https’ symbols from the generated tokens
 3 Remove digits and convert all tokens to lowercase from the generated tokens
 4 Obtain a maximum length of an instance from the train and test data
 5 Pad all instances whose length is less than the maximum length obtained in
     Step 5
 6 Generate a list of vocabulary of the dataset
 7 Extract embedding of dimension 300 using GloVe pre-trained embedding for
     each word of the vocabulary
 8 If any word of the vocabulary is not present in the GloVe pre-trained
     embedding, then generate random embedding of dimension 300
 9 Represent all labels using one-hot encoding
10 Build a sequential model consists of Bidirectional LSTM, dropout and dense
     layer
4        A. Mishra et al.

Feature Representation: We use pre-trained GloVe [3] word embedding to
represent words in the form of vectors. Different versions of glove pre-trained em-
bedding exist; however, we use embedding trained of dimension 300 on common
crawl using 840B tokens and 2.2M vocabulary3 . We generated random embed-
ding of dimension 300 for out of vocabulary words.

Model Description: We determine the maximum length of the sentence from
question texts of all training instances. We perform padding of the question text
in each training instance whose sentence length is less than the maximum length
of the question texts. We use bidirectional Long Short Term Memory[1, 4] layer
followed by dropout layer to avoid overfitting. We added a fully connected dense
layer at the end.


5 Results

In this section, we will discuss the experimental settings, results obtained with
the model, and further analysis of the results.

Experimental Settings: We use Keras4 neural network library for training our
model which uses Tensorflow as backend. The model is trained for ten epochs
with a batch size of 32. We use a validation split of 0.3 to analyze the overfitting
using validation loss that may occur during training. Table 2 list out the values
for parameters and hyperparameters used for training the model.


                     Table 2. Parameters and Hyper parameters

               Parameters / Hyper parameters Values
               BiLSTM Activation Function    tanh
               Recurrent Dropout             0.2
               Dropout                       0.3
               Dense Activation Function     softmax
               Optimizer                     adam
               Loss                          Categorical Cross Entropy


Results: We obtained an accuracy of 64.35% on test set which consists of 101
instances. The accuracy was calculated and shared by the task organizers.

Analysis: The task organizers shared the true labels of the test instances hence
we used different other word embeddings for further analysis. The accuracy ob-
tained using different embeddings with Bidirectional LSTM model is listed in
the table 3. Model M2 was submitted for the evaluation. We trained vocabulary
of the train set using word2vec[2] continuous bag of words architecture in model
3
    https://nlp.stanford.edu/projects/glove/
4
    https://keras.io
                IIT-BHU at CIQ 2019 : Classification of Insincere Questions        5

M1; however, we used pre-trained embedding paragram[5] in model M3. In the
case of GloVe, it is observed that there are 60 out of vocabulary words; however,
only 40 words of the vocabulary were not present in paragram embedding. All
three embeddings represent each word of dimension 300.


                Table 3. Accuracy with different word embeddings

           Model   Embedding         #OOV words    Model       Accuracy
           M1      Word2Vec          -             BiLSTM      65.34%
           M2      GloVe             60            BiLSTM      64.35%
           M3      Paragram          40            BiLSTM      63.36%


6 Conclusion

We used a bidirectional Long short term memory model for classification of insin-
cere questions on Quora. The system can be used for the automatic classification
of insincere questions. We obtained an accuracy of 64.35% on the test set. We
can incorporate linguistic features to improve the system.


References
1. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com-
   put. 9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735,
   http://dx.doi.org/10.1162/neco.1997.9.8.1735
2. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
   sentations in vector space (2013)
3. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-
   sentation. In: Empirical Methods in Natural Language Processing (EMNLP). pp.
   1532–1543 (2014), http://www.aclweb.org/anthology/D14-1162
4. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. Trans.
   Sig. Proc. 45(11), 2673–2681 (Nov 1997). https://doi.org/10.1109/78.650093,
   http://dx.doi.org/10.1109/78.650093
5. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: From paraphrase database to com-
   positional paraphrase model and back. Transactions of the Association for Compu-
   tational Linguistics 3, 345–358 (2015)

</pre>