=Paper= {{Paper |id=Vol-2036/T4-5 |storemode=property |title=BMSCE_ISE@INLI-FIRE-2017: A simple n-gram based approach for Native Language Identification |pdfUrl=https://ceur-ws.org/Vol-2036/T4-5.pdf |volume=Vol-2036 |authors=Sowmya Lakshmi B S,Shambhavi B R |dblpUrl=https://dblp.org/rec/conf/fire/SR17 }} ==BMSCE_ISE@INLI-FIRE-2017: A simple n-gram based approach for Native Language Identification== https://ceur-ws.org/Vol-2036/T4-5.pdf

BMSCE_ISE@INLI-FIRE-2017: A simple n-gram based approach
for Native Language Identification
Sowmya Lakshmi B S.1, Dr. Shambhavi B R.2
Department of ISE, BMS College of Engineering, Bangalore, India
sowmyalakshmibs.ise@bmsce.ac.in1, shambhavibr.ise@bmsce.ac.in2
ABSTRACT
Native Language Identification (NLI) aims to identify native
2 PREVIOUS WORK
language L1 of an author by analysing the text written by him/her The work presented in this study was a participant of Indian
in other language L2. NLI is often implemented as a supervised Native Language Identification@FIRE 2017 shared task. Several
classification problem. In this paper, we report a NLI system researchers have investigated NLI and similar problems. An
implemented using character tri-grams, word uni-grams and bi- overview of few common methods used for NLI prior to this
grams methods using linear classifier, Support Vector Machines shared task is provided.
(SVM). The work demonstrated is a participant of Indian Native Most of the researchers have featured NLI as a supervised
Language Identification@FIRE 2017, achieving 0.27 overall classification task, where classifiers were trained on data from
accuracy for the corpus with 6 native languages. Furthermore, different L1. Most commonly included features for NLI are
with subsequent evaluations, the best accuracy score obtained was character n-grams, POS n-grams, content words, function words
0.73 with 10 fold cross-validation on training data. We were able and spelling mistakes. An SVM model [1-3] was trained on these
to achive above accuracy by incorporating uni-grams and bi- features and obtained an accuracy of 60%-80%.
grams of words. In the recent past, word embedding and document embedding
has gained much attention along with other features. Continuous
KEYWORDS Bag of Words (CBOW) and Skip Grams were used to obtain
vectors of word embedding. Vector representations for documents
Language Identification; Supervised Classification; Feature
were generated with distributed bag-of-words architectures using
Selection.
Doc2Vec tool. In [4], authors developed a native language
classifier using document and word embedding with an accuracy
1 INTRODUCTION of 82% for essays and 42% on speech data.
Recently, author profiling is gaining more importance to improve LIBSVM2, variant of SVM was verified to be efficient for text
performance of certain applications like forensics, security and classification. In [5], authors developed a NLI algorithm for
marketing. Author profiling aims to detect author’s details like Arabic language with LIBSVM2. They combined production
age, educational level and native language. Native Language rules, function words and POS bi-grams to perform machine
Identification (NLI) is a sub-class of author profiling where, learning process and obtained an accuracy of 45%.
native language L1 of a writer is automatically detected by First NLI shared task was organized with BEA workshop in
analysing the text written in the second language L2. NLI is often 2013. System participated in closed training task was presented in
implemented as a multiclass supervised classification task. [6]. The model was trained on 11 L1 languages of TOEFL11
The applications of NLI are categorised into two categories: corpus and cross-validation testing was performed for unseen
security related applications and Second Language Acquisition essays resulted in accuracy of about 84.55%. Authors adopted
(SLA)- related applications. Security related applications are features like n-grams of words, characters and POS and spelling
identifying phishing sites or spam e-mails that usually consist of errors with TF-IDF weighing to train SVM model.
strange sentences that might be written by non-native persons. In [7], author reported the work participated in essay track of
SLA applications are to analyse the effect of L1 on later learned the Second NLI Shared Task 2017 held at BEA-12 workshop. A
languages. novel 2-stacked sentence-document architecture was introduced
As proved by preceding work in this area there exist quite a by considering lexical and grammatical features of text. A stack of
few linguistic hints that helps in predicting native language. With two SVM classifiers were used, where first and second classifier
the impact of their native language, authors tend to make common were sentence and document classifiers respectively. First
mistakes in spelling, punctuation and grammar while using other classifier aimed at predicting the native language of each sentence
languages. of a document whereas, these predictions were adopted as features
In this work, we examine the possibility of building native by document classiﬁer. Finally, system was used to predict native
language classifiers by ignoring grammatical errors and semantic language of unseen documents which resulted F1-score of 0.88.
analysis of the text written in L2. A naive set of features using n-
grams of words and characters are explored to develop NLI
system.
Sowmya Lakshmi et al.

3 TASK DESCRIPTION AND DATA features from the text files generated and developed two methods
for NLI using python as explained below.
NLI has drawn the attention of many researchers in recent years.
With the influx of new researchers, the most substantive study in
this field has led to INLI@FIRE 2017 shared task [8]. Task Character tri-grams method
focuses on identifying native language of a writer based on his The tri-gram model reads text files and extracts all tri-grams
writing in other language. In this case, the second language was (sequence of three bytes) and their corresponding counts from the
English. The task was, native language prediction of a writer from text. Frequencies of tri-grams are pursued for every training
the given Text/XML file which contains Facebook comments in language separately. For every language, frequencies are
English language. Six Indian languages were proposed to consider
relativized by dividing individual tri-gram counts through the
for this task. They were Tamil, Hindi, Kannada, Malayalam,
number of all tri-grams in the training corpus and are sorted based
Bengali and Telugu.
on the relative frequency (the probability of the tri-gram in the
Dataset given corpus of a language) to create language model of that
language. A language model for each language in the corpus
The training dataset for the task was xml files, which contains a
provided was created.
set of Facebook comments in English by different native language
Relative frequencies of the tri-grams for test dataset is
speakers. Xml files were annotated as BE, HI, KA, MA, TE, and
calculated and compared with the tri-grams in language models.
TA for Bengali, Hindi, Kannada, Malayalam, Telugu and Tamil
Intuitively, we would say that the tri-gram frequencies of tri-
language respectively. Table 1 shows the training data statistics
grams extracted from two different texts of the same native
that was used for the task.
language speaker should be very similar. The absolute difference
was calculated by subtracting the relative frequency of individual
Table 1: Training data tri-gram in the test dataset from the relative frequency of
corresponding tri-gram in each language model. The absolute
Native Language Files differences were summed up. For instance, if we compare test
data with 5 language models, we would have 5 different values for
Bengali 202 the sum of absolute differences. The minimum value represents
Hindi 211 the best match for test data. Algorithm 1 describes the algorithm
Kannada 203 for character tri-gram approach.

Telugu 207
Algorithm 1: Character tri-grams method
Malayalam 200 Input: Train Dataset for each language, Test Dataset
Tamil 210 Output: Native Language Identification of Test Dataset
begin
for each language in language set
4 FEATURES for each document in Train Dataset of language
Derive all possible tri-grams
NLI has been formulated as a multiclass classification task. We
end for
used language-independent features such as character tri-grams
Language model<- Frequency of each tri-gram in Train
and word n-grams for NLI as described in [1]. From the previous Dataset of language
works we observed that character tri-grams were useful for NLI, end for
and they suggested that this might be due to the impact of author’s for every document in Test Dataset
native language. To reflect this, we calculate character n-grams Derive all possible tri-grams
and word n-grams as features. For characters, we consider tri- Calculate relative frequency of each tri-gram in the document
grams. The features are generated over the entire training data, end for
i.e., every tri-gram in the training dataset is used as a feature. for each language in language model
Similarly, uni-grams and bi-grams of words were used as separate for every tri-gram in the Test Dataset document
Calculate absolute difference
features.
Absolute difference <- (Relative frequency of tri-gram
in Test Data) – (Relative frequency of corresponding tri-
5 APPROACH gram in language model)
Training dataset provided were xml files which contained end for
Facebook comments in English written by different native Sum up the calculated absolute differences of each language
language speakers and files were annotated w.r.t native language model
end for
of the speaker. As a part of preprocessing, these xml files were
Best match <- Among all the computed absolute differences select
scraped to extract Facebook comments and comments related to the one with minimum value.
similar native language were saved in a text file. We extracted end

2
A simple n-gram based approach for Native Language Identification

7 CONCLUSION
In this paper, a supervised system for Indian Native Language
Word n-grams method
Identification has been presented. We describe character tri-
Frequencies of word uni-grams and bi-grams were collected for grams, word uni-grams and bi-grams features, which are the
each language irrespective of their meanings and order of words subset of frequently used features for NLI task. Results of the
in the document. We instantiated countvectorizer module in supervised classiﬁcation using these features on a test data set
python to achieve word uni-grams and bi-grams. A Document consisting of 6 languages were reported as part of INLI@FIRE
Term Matrix X [i, j] was formed, where i is the document id, j 2017 shared task. Our future work lies in improving the
represents dictionary index of each word and Wij is the frequency performance of NLI system by considering features, which can
of occurrence of each word w in document i. Each uni-gram and classify native languages in a better way.
bi-gram of test data was compared with their frequencies of
occurrences in the documents of all languages. REFERENCES
In this experiment, we used Document Term Matrix with n- [1] Nicolai, Garrett, Md Asadul Islam, and Russ Greiner. "Native Language
grams and applied linear SVM from scikit-learn as a classification Identification using probabilistic graphical models." Electrical Information and
algorithm for NLI. Communication Technology (EICT), 2013 International Conference on. IEEE,
2014.
[2] Abu-Jbara, A., Jha, R., Morley, E., & Radev, D. R. (2013, June). Experimental
Results on the Native Language Identification Shared Task. In BEA@ NAACL-
6 RESULT ANALYSIS HLT (pp. 82-88).
We submitted the output of the system for test data provided to [3] Mizumoto, T., Hayashibe, Y., Sakaguchi, K., Komachi, M., & Matsumoto, Y.
(2013, June). NAIST at the NLI 2013 Shared Task. In BEA@ NAACL-HLT (pp.
INLI@FIRE 2017 shared task workshop. A single run of each 134-139).
method for six different languages was submitted and the results [4] Vajjala, Sowmya, and Sagnik Banerjee. "A study of N-gram and Embedding
Representations for Native Language Identification." In Proceedings of the 12th
of native language classification for all the languages are Workshop on Innovative Use of NLP for Building Educational Applications, pp.
recapitulated in Table 2 and Table 3. Character tri-gram model 240-248. 2017.
achieved 22% accuracy and word n-grams model achieved an [5] Mechti, Seifeddine, Lamia Hadrich Belguith, Ayoub Abbassi, Rim Faiz, and
Carthage IHEC. "An empirical method using features combination for Arabic
overall accuracy of 27%. native language identification."
The combined features of uni-grams and bi-grams on the [6] Gebre, B. G., Zampieri, M., Wittenburg, P., & Heskes, T. (2013).” Improving
native language identification with tf-idf weighting”. In the 8th NAACL
training data was used to perform 10 fold cross-validation. With Workshop on Innovative Use of NLP for Building Educational Applications
these features an improved accuracy of 73% was achieved. (BEA8) (pp. 216-223).
[7] Cimino, A., & Dell'Orletta, F. (2017). “Stacked Sentence-Document Classifier
Approach for Improving Native Language Identification”. In Proceedings of the
12th Workshop on Innovative Use of NLP for Building Educational
Table 2: Character tri-grams Applications (pp. 430-437).
[8] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, Paolo Rosso.
Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language
Class Prec. Rec. F1 Identification. In: Notebook Papers of FIRE 2017, FIRE-2017, Bangalore,
India, December 8-10, CEUR Workshop Proceedings.
BE 0.40 0.292 0.338
HI 0.50 0.080 0.160
KA 0.117 0.270 0.163
MA 0.1730 0.641 0.272
TA 0.533 0.080 0.139
TE 0.267 0.383 0.315
Overall Accuracy 0.22

Table 3: Word n-grams

Class Prec. Rec. F1
BE 0.389 0.551 0.456
HI 0.545 0.072 0.127
KA 0.190 0.446 0.266
MA 0.223 0.315 0.261
TA 0.218 0.260 0.237
TE 0.154 0.123 0.137
Overall Accuracy 0.27