=Paper= {{Paper |id=Vol-2036/T4-4 |storemode=property |title=SSN_NLP@INLI-FIRE-2017: A Neural Network Approach to Indian Native Language Identification |pdfUrl=https://ceur-ws.org/Vol-2036/T4-4.pdf |volume=Vol-2036 |authors=D. Thenmozi,Kawshik Kannan,Chandrabose Aravindhan |dblpUrl=https://dblp.org/rec/conf/fire/ThenmoziKA17 }} ==SSN_NLP@INLI-FIRE-2017: A Neural Network Approach to Indian Native Language Identification== https://ceur-ws.org/Vol-2036/T4-4.pdf
      SSN_NLP@INLI-FIRE-2017: A Neural Network Approach to
             Indian Native Language Identification
               D. Thenmozhi                                  Kawshik Kannan                          Chandrabose Aravindan
         SSN College of Engineering                     SSN College of Engineering                    SSN College of Engineering
            Chennai, Tamilnadu                             Chennai, Tamilnadu                            Chennai, Tamilnadu
            theni_d@ssn.edu.in                            kawshik98@gmail.com                           aravindanc@ssn.edu.in

ABSTRACT                                                                       • Build a neural network model from the features of training
Native Language Identification (NLI) is the process of identifying               data
the native language of non-native speakers based on their speech               • Predict class label for the instance as any of the six languages
or writing. It has several applications namely authorship profiling              namely Tamil, Hindi, Kannada, Malayalam, Bengali or Tel-
and identification, forensic analysis, second language identifica-               ugu using the model
tion, and educational applications. English is one of the prominent         We have implemented our methodology in Python for the INLI
language used by most of the non-English people in the world.            task. The data set used to evaluate the task consists of a set of
The native language of the non-English speakers may be easily            training data for six Indian languages and test data. The number
identified based on their English accents. However, identification of    of training instances are 207, 211, 203, 200, 202 and 210 for the
native language based on the users posts and comments written in         languages Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu
English is a challenging task. In this paper, we present a neural net-   respectively and number of test instances are 783. The steps used in
work approach to identify the native language of an Indian speaker       our approach are explained in detail in the following subsections.
based on the English comments that are posted in microblogs. The
lexical features are extracted from the text posted by the user and      2.1     Feature Extraction
are used to build a neural network classifier to identify the native     As a preprocessing step, all the ’xml’ tags are removed from the
language of the user. We have evaluated our approach using the           given text and only the body part of the given input is considered
data set given by INLI@FIRE2017 shared task.                             for further processing. The punctuations like “, ”, - , _, ‘, and, ’ are
                                                                         removed from the text and the terms such as n’t, &, ’m, ’ll are re-
KEYWORDS                                                                 placed as ‘not’, ‘and’, ‘am’, and ‘will’ respectively before extracting
Neural Network, Machine Learning, Language Recognition, Indian           the features. Each term of the text is annotated with parts of speech
Native Language Identification                                           (POS) information such as noun, verb, adjective, adverb, and deter-
                                                                         miner. In general, nouns present in the text can be used as features.
1   INTRODUCTION                                                         However, adjectives may also be helpful to identify the native lan-
                                                                         guage. For example, from the post ‘I attended my kutty brother
Native Language Identification (NLI) is the process of automatically
                                                                         Rams birthday party’, the adjective ‘kutty’ may used to identify the
identifying the native language of a person based on her/his speech
                                                                         language as Tamil. So, in our approach, we have considered nouns
or writing in another language. It has several applications namely
                                                                         and adjectives as features. All forms of nouns (NN*) namely NN,
authorship profiling and identification [2], forensic analysis [3],
                                                                         NNS and NNP, and all forms of adjectives (JJ*) JJ, JJR and JJS are
second language identification [8] and educational applications
                                                                         extracted from the text. The feature set is constructed by lemmatiz-
[11]. Several research work have been reported on NLI based on
                                                                         ing each extracted term and by eliminating all the duplicate terms.
the speakers text [13], [5], [1], [4], [7], [9] and their speech [10],
                                                                         We have obtained the bag of words (BOW) by processing all the
[12]. English is one of the commonly used languages by many
                                                                         text of given training data.
people in the world and several shared tasks on NLI have been
                                                                            We have used the NLTK tool kit1 to preprocess the given data
conducted since 2013 to identify the native language based on
                                                                         and to annotate the text with POS information. The Wordnet Lem-
English text and speech. In this work, we have focused on the shared
                                                                         matizer was used to lemmatize the terms that are extracted from
task of INLI@FIRE2017 (co-located with the Forum for Information
                                                                         POS information. We have obtained a total of 12067 features from
Retrieval Evaluation (FIRE), 2017) which aims to identify the native
                                                                         training data. We have used the boolean model to construct the
language of Indians based on their comments posted in social media
                                                                         feature vectors for the instances of training data.
in English [6]. The focus of the task is to develop techniques for
identifying the native languages namely Tamil, Hindi, Kannada,
Malayalam, Bengali or Telugu from a set of Facebook comments.
                                                                         2.2     Language Identification
                                                                         We have applied a neural network approach to identify the native
2   PROPOSED APPROACH                                                    language of the user. The set of BOW features along with the
                                                                         class labels namely Tamil, Hindi, Kannada, Malayalam, Bengali
We have implemented a supervised approach for this INLI task. The
                                                                         and Telugu from training data are used to build a model using a
steps used in our approach are given below.
                                                                         simple neural network with two hidden layers. The features are
    • Preprocess the given text
    • Extract linguistics features for training data                     1 http://www.nltk.org/
extracted for each instance of test data with unknown class label ’?’,       the system. The performance may improve if we select only the
similar to training data using the features identified from training         significant features using χ 2 feature selection [14].
data. One of the label from the given labels namely Tamil, Hindi,
Kannada, Malayalam, Bengali and Telugu is identified for the test            ACKNOWLEDGMENTS
data instances using the built model.                                        We would like to thank the management of SSN Institutions for
   We have used the Keras framework2 with Tensorflow backend to              funding the High Performance Computing (HPC) lab where this
implement a neural network classifier for this problem. The number           work is being carried out.
of BOW features (12067) constitutes the number of neurons for the
input layer of the network. We have used a sequential model of               REFERENCES
Keras to construct our neural network. We have added two hid-                 [1] Serhiy Bykh and Detmar Meurers. 2014. Exploring Syntactic Features for Native
den layers with number of neurons as 64 and 32 respectively with                  Language Identification: A Variationist Perspective on Feature Encoding and
                                                                                  Ensemble Optimization.. In Proceedings of COLING 2014, the 25th International
’RELU’ activation function. The output layer was added by specify-                Conference on Computational Linguistics: Technical Papers. Ireland, 1962–1973.
ing the number of neurons as 6 (to classify the instance to one of            [2] Dominique Estival, Tanja Gaustad, Ben Hutchinson, Son Bao Pham, and Will
the 6 languages) with an activation function ’SOFTMAX’. We used                   Radford. 2007. Author profiling for English emails. In Proceedings of the 10th
                                                                                  Conference of the Pacific Association for Computational Linguistics. ACL, Australia,
’sparse_categorical_crossentropy’ loss function with ’SGD’ opti-                  263–272.
mizer to compile the model. We trained the model with a batch_size            [3] John Gibbons. 2003. Forensic linguistics: An introduction to language in the justice
                                                                                  system. Wiley-Blackwell.
of 10 for 100 epochs and obtained a training accuracy of 98.1%.               [4] Radu-Tudor Ionescu, Marius Popescu, and Aoife Cahill. 2014. Can characters
                                                                                  reveal your native language? A language-independent approach to native lan-
3     RESULTS AND DISCUSSIONS                                                     guage identification.. In Proceedings of the 2014 Conference on Empirical Methods
                                                                                  in Natural Language Processing (EMNLP). ACL, 1363–1373.
Our approach for native language identification has been evaluated            [5] Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013. Maximizing Classification
based on the metrics namely precision, recall, and F1 measure for                 Accuracy in Native Language Identification. In Proceedings of the Eighth Workshop
                                                                                  on Innovative Use of NLP for Building Educational Applications. ACL, Atlanta,
each language and also overall accuracy. The results obtained by                  Georgia, 111–118.
our approach are presented in Table 1. A comparative study of                 [6] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso.
results of all the participants of INLI@FIRE2017 is available in [6].             2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language
                                                                                  Identification. In Notebook Papers of FIRE 2017. CEUR Workshop Proceedings,
                                                                                  Bangalore, India.
                      Table 1: Performance on Test Data                       [7] Shervin Malmasi and Mark Dras. 2017. Native Language Identification using
                                                                                  Stacked Generalization. arXiv preprint arXiv:1703.06541 (2017).
                                                                              [8] Shervin Malmasi, Mark Dras, et al. 2014. Language Transfer Hypotheses with
               Class      Precision   Recall   F1-measure                         Linear SVM Weights.. In Proceedings of the 2014 Conference on Empirical Methods
                                                                                  in Natural Language Processing (EMNLP). ACL, Qatar, 1385–1390.
                 BE          46.20    76.20       57.60                       [9] Elham Mohammadi, Hadi Veisi, and Hessam Amini. 2017. Native Language
                                                                                  Identification Using a Mixture of Character and Word N-grams. In Proceedings of
                 HI          49.40    16.30       24.60                           the 12th Workshop on Innovative Use of NLP for Building Educational Applications.
                 KA          39.60    48.60       43.60                           ACL, Copenhagen, Denmark, 210–216.
                                                                             [10] Taraka Rama and Çağrı Çöltekin. 2017. Fewer features perform well at Native
                 MA          31.70    21.70       25.80                           Language Identification task. In Proceedings of the 12th Workshop on Innovative
                 TA          27.50    49.00       35.30                           Use of NLP for Building Educational Applications. 255–260.
                 TE          27.00    21.00       23.60                      [11] Alla Rozovskaya and Dan Roth. 2011. Algorithm selection and model adaptation
                                                                                  for ESL correction tasks. In Proceedings of the 49th Annual Meeting of the Associa-
                                                                                  tion for Computational Linguistics: Human Language Technologies-Volume 1. ACL,
                                                                                  Portland, Oregon, USA, 924–933.
   We have obtained an overall accuracy of 38.80% using our neural           [12] Charese Smiley and Sandra Kübler. 2017. Native Language Identification using
network approach for Indian native language identification task.                  Phonetic Algorithms. In Proceedings of the 12th Workshop on Innovative Use of
This is very poor compared to the training accuracy of 98.1% and is               NLP for Building Educational Applications. 405–412.
                                                                             [13] Joel Tetreault, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. 2012. Native
an indication of over-fitting. We need to explore using regulariza-               tongues, lost and found: Resources and empirical evaluations in native language
tion techniques such as dropout during training to avoid this.                    identification. Proceedings of COLING 2012 (2012), 2585–2602.
                                                                             [14] D Thenmozhi, P Mirunalini, and Chandrabose Aravindan. 2016. Decision Tree
                                                                                  Approach for Consumer Health Information Search.. In FIRE (Working Notes).
4     CONCLUSION                                                                  CEUR, Kolkata, India, 221–225.
We have presented a system that uses a neural network model for
identifying the native language, namely Tamil, Hindi, Kannada,
Malayalam, Bengali or Telugu, of Indians from the English com-
ments posted by them in social media. We have extracted the linguis-
tics features from training data to build a neural network model
with two hidden layers. The data set given by INLI@FIRE2017
shared task has been used to evaluate our methodology. We have
obtained an overall accuracy of 38.80%. This is very poor compared
to the training accuracy and indicates over-fitting. Regularization
techniques such as dropout may be used to improve generalization.
A lexical database may be used to correct terms such as pls, sry, fyi,
etc., present in social media text for improving the performance of
2 https://keras.io/

                                                                         2