IIITV@INLI-2018 : Hyperdimensional Computing for Indian Native Language Identification? Ashish Patel[0000−0002−0409−736X] and Pratik Shah[0000−0002−4558−6071] 1 Indian Institute of Information Technology, Vadodara 2 Gujarat, India {201771002,pratik}@iiitvadodara.ac.in Abstract. Native Language Identification (NLI) is the task of deter- mining user’s native language based on their second language. NLI is an important task that has applications in different areas such as social me- dia analysis, authorship identification, second language acquisition, and forensic investigation. In this paper, we propose to use Hyperdimensional Computing (HDC) as supervised learning for identifying Indian Native Languages from the user’s social media comments written in English. HDC represents language features as high dimensional vectors called hypervectors. Initially, comments are broken in character bi-grams and tri-grams which are used for generating comment hypervectors. These hypervectors are further combined to create different language profile vectors. Profile hypervectors are then used for classification of test com- ments. The proposed approach was validated on 70% of training instances with 65.13% of accuracy. The accuracy of classification on the test set for bi-gram and tri-gram based feature encoding is 29.2% and 31.5% respectively. Keywords: Native Language Identification · Hyperdimensional Com- puting · n-gram · Indian Languages 1 Introduction The diversity of languages used in India has given rise to a large number of challenging problems, both for the linguist and for the computer scientist. The 2001 Indian census recorded 30 languages which were spoken by more than a million native speakers in India. English plays prominent role in every sector especially in higher education in India. It is certainly the most preferred lan- guage for reading and in official communication. The number of second or third language speakers of English is increasing day by day. Influence of their regional and native languages is observed in their usage of English. It is possible to iden- tify the native speaker based on their English pronunciation and accent. But, ? Indian Native Language Identification (INLI) is a shared task held in conjunction with FIRE 2018. Aim of the task is to identify Indian language from user’s social media comments written in English. 2 Ashish Patel and Pratik Shah it is challenging to identify the native users based on their written comments in their non-native language. Native Language Identification (NLI) is important for a number of applications, such as authorship identification, forensic anal- ysis, tracing linguistic influence in multi-author texts, and to support Second Language Acquisition research. It can also be used in educational applications such as developing grammatical error correction systems. The extensive use of social media and online interactions can be a reason of violent threat, which is a common issue faced by commenters. If a comment or a post poses any type of threat, then identifying the native language of the person will be one of the significant feature in finding the source. In the INLI track [1], the task is to identify the native language of the user, based on their social media comments in the English language. Six Indian lan- guages namely Tamil (TA), Hindi (HI), Kannada (KA), Malayalam (MA), Ben- gali (BE) and Telugu (TE) are considered for this shared task. 2 Related Work NLI challenge has gained a lot of attention and has been part of shared tasks in various events [2, 3, 4] worldwide. This challenge can be broken into two sub- tasks: feature selection and classification. Different techniques are used to extract features from non-native language in terms of topic biases, spelling usage, and frequency of word usage, etc. Most of the researcher have considered unigrams, bigrams, and character n-grams [5], POS-tagged n-grams [6] as features. Cur- rent state-of-the-art uses Support Vector Machine (SVM) [6, 7], Multinomial Bayes [5], Random Forest [7], Logistic Regression [5] and feed-forward neural network [8] for classification. Authors in [9] have discussed a possible use of HDC for language identification. 2.1 Hyperdimensional Computing The architecture of a computer that is in use since last 70 years, has been a milestone in mechanizing arithmetic of symbols. Efforts are reported in the form of architectural proposals starting from the representation of entities, memory, and processing units for computing [10]. Convenience and robustness of two value (binary) systems, in terms of realization, has been instrumental in its widespread use. It has been agreed upon that in the current state of evolution of the human brain, adding or multiplying large numbers is a difficult task, barring a few ex- ceptions. At the same time, recognizing objects, using language to communicate and being able to reason are some tasks in which humans are efficient at [11]. In contrast, the computer is good at the former task whereas it needs considerable efforts and hand-holding to perform the later ones. The human brain contains large circuits having billions of neurons and synapses. It is believed that the con- cepts and entities are being stored distributively in such circuits [10]. Inspired by the architecture of the brain, artificial neural networks have been designed and have gained considerable recognition in the field of machine learning [12]. Hyperdimensional Computing for INLI 3 Properties of High dimensional spaces such as holistic representation, one- shot learning, and robustness against noise show similarity with the brain be- havior. Instead of computing with numbers, brain-inspired hyperdimensional computing (HDC) emulates computing with hypervectors. In HDC the entities are represented using high dimensional vectors called hypervectors. In litera- ture, there are different representation schemes proposed for HDC, for example HRR [13], BSC [14], Bipolar Spatter Code [15], MBAT [16] etc. In this paper, we have used Bipolar spatter code hypervectors typically comprises of randomly dis- tributed +1’s and -1’s. Basic algebraic operations on hyperdimensional vectors are binding (multiplication), superposition (addition) and rotation (ρ). Binding of two hypervectors is an element-wise multiplication where as superposition is an element-wise addition. The addition preserves the similarity between added hypervectors and resultant hypervector. Multiplication binds two hypervectors in one and maps it to a hypervector in hyperspace which is nearly orthogonal to multiplied hypervectors. Rotation also generates a nearly orthogonal vector. Usually the similarity between two hypervectors is calculated by using cosine distance. 3 Task Description Language Training Instances Bengali (BE) 202 Hindi (HI) 211 Kannada (KA) 203 Malayalam (MA) 200 Tamil (TA) 207 Telugu (TE) 210 Table 1. Languages and Training Instances INLI is a multiclass classification task. INLI@FIRE2018 shared task focuses on identifying native languages namely Bengali, Hindi, Kannada, Malayalam, Tamil, and Telugu, from XML files. Table 1 describes the training data that is used for the task. Training XML files are annotated as per their native language. These XML files contain extracted social media comments in English. 4 Proposed Approach For Indian native language identification task, training instances of six languages are available. For each language, the number of training instances (user com- ments) are mentioned in Table 1. It is assumed that the comments are from non-native English users. Since the comments are scribed in English (Latin) alphabets, a preprocessing step sets rid of special characters. Preprocessing involves removal of non-English characters, special characters and converting 4 Ashish Patel and Pratik Shah the text in lowercase (alphabets). This ensures that the resulting string will contain twenty-seven different symbols in total, i.e. ‘a,...,z’ and a ‘ ’ - blank space. Each of these 27 symbols is assigned a hypervector, i.e. a bipolar spat- ter code. Each hypervector is a random string of +1’s and -1’s of length 10000. It is easy to show that with very high probability the generated 10k dimen- sional hypervector will have equal number of +1’s and -1’s. Moreover the mean of bitwise distance between any two randomly generated hypervectors will be 5000 bits. So the generated 27 hypervectors are all almost equidistant from each other with almost same number of 1’s and -1’s For more on bipolar spat- ter codes, refer to [15]. In the training phase, we scan through comments se- quentially and create a profile hypervector for a given language class. For a given comment, we find out overlapping character tri-grams (or bi-grams) and for each of the tri-gram, we generate a hypervector from the character hyper- vectors as shown in the Fig. 1. Let H : {a, b, . . . , z, } → {−1, 1}10000 , be a function that maps a symbol to the hyperdimensional space and, the rotation ρ : {−1, 1}10000 → {−1, 1}10000 be defined as a function corresponding to a left circular shift operation on strings. Given a trigram “xyz”, the corresponding tri- gram hypervector is ρρH(x) ∗ ρH(y) ∗ z, where ∗ is called the binding operation which in case of bipolar spatter code is element-wise multiplication. Finally, all the tri-gram hypervectors of a given class are superimposed to generate a class profile vector (language hypervector). Superposition of hypervectors is defined as the element-wise addition operation. For more details on the bipolar spatter codes and related algebra, refer to [15,17]. In this proposed approach HDC works as supervised learning algorithm and considers bi-grams/tri-grams as features. The occurrence of bi-grams/tri-grams in a language is a reasonable feature of that language. The language hypervector is a superposition (addition) of all bi- gram/tri-gram hypervectors occurring in the language and it is similar to all hypervectors added together. Fig. 1. Tri-gram hypervector generation Algorithm 1 shows the training procedure, which takes comments (training set) as input, computes hypervectors for tri-grams and generates a class hy- pervector for each language. Tri-gram hypervector is generated by adding twice rotated hypervector of first character, once rotated hypervector of second charac- ter and third character’s hypervectors. Similarly we can create class hypervector using bi-gram. Algorithm 2 is used for predicting the class of test instances. The query hy- pervector (Q) is generated by similar process as the language hypervector is gen- erated, i.e. superposition of tri-gram hypervectors corresponding to test/query Hyperdimensional Computing for INLI 5 instance. This query hypervector is compared with all the language hypervec- tors. The similarity between bipolar hypervectors is calculated using the cosine of the angle between them. The language which has the highest similarity (lowest angle) with query hypervector is written in the output file with the test dataset name. Algorithm 1: Training Algorithm Input : Training Dataset Output: Language profile Hypervectors #N : Number of Languages; #Mn : Number of training instances of nth language; #H: {a,...,z, } → {1, −1}10000 ; #Ln : Profile hypervector of language n; #tn j (k) : k th symbol of j th training instance of nth language; #tn j (k)∈ {a, ..., z, }; for n ← 1 to N do for j ← 1 to Mn do for k ← 1 to Length(tn j )-2 do Ln = Ln + (ρρH(tn n n j (k)) ∗ ρH(tj (k + 1)) ∗ H(tj (k + 2))); // * binding, + superposition end end end Algorithm 2: Testing Algorithm Input : Test Dataset, Language profile hypervectors Output: Native Language Identification of Test Dataset #R: Number of Test Records; #Q: Query hypervector; 0 #M : Predicted Language; #N : Number of Languages; #H : {a, ..., z, } → {1, −1}10000 ; #Ln : Profile hypervector of language n; #tj (k) : kth symbol of j th test record; #tj (k)∈ {a, ..., z, }; for j ← 1 to R do for k ← 1 to Length(tj ) − 2 do Q = Q + (ρρH(tj (k)) ∗ ρH(tj (k + 1)) ∗ H(tj (k + 2))) ; end for i ← 1 to N do Angle[i]=Cosine(Li , Q) ; // Angle[] stores angle between Li and Q end 0 M = Language(Min(Angle)) ; // Language() returns predicted language 0 Write predicted language M in file; end 6 Ashish Patel and Pratik Shah 5 Results and Discussion For validation of classification algorithm, we have considered 70% as training and 30% as validation instances for given training dataset. Fig. 2. shows the normalized confusion matrix corresponding validation dataset. Data is cross- validated 10 times and average accuracy reported is 65.13%. Normalized confusion matrix 0.72 BE 0.48 0.25 0.00 0.16 0.07 0.05 0.64 HI 0.11 0.73 0.03 0.08 0.05 0.00 0.56 0.48 KA 0.08 0.08 0.52 0.16 0.08 0.07 True label 0.40 MA 0.05 0.05 0.05 0.75 0.07 0.03 0.32 0.24 TA 0.03 0.05 0.03 0.10 0.69 0.10 0.16 TE 0.00 0.03 0.08 0.05 0.11 0.73 0.08 0.00 BE HI KA MA TA TE Predicted label Fig. 2. Normalized confusion matrix of validation dataset Run1 and Run2 results are corresponding to the bi-gram and the tri-gram language profile vectors respectively. The results are presented in Table 2 and Table 3 in which Table 2 and 3 represent result of TestSet-1 and TestSet-2. TestSet-1 is used for comparing results with previous year’s shared task. In Run1 (bi-gram) 32.4% and Run2 (tri-gram) 32.1% accuracy is reported. For Run1 and Run2 for Testset-2 accuracy is 29.2% and 31.5% respectively. These experiments are conducted on 3.6 GHz Intel Core i7 PC with 4 GB of RAM. The python implementation of 3-gram training algorithm recorded 2.81667 minutes. The validation results differ from test results primarily due to the presence of unseen tri-gram (bi-gram) in the test data while training phase. 6 Conclusions In this paper, we have proposed HDC approach for Indian Native Language Identification. The results are encouraging and can be enhanced further. Features selected for classification are tri (bi)-grams which are easy to aggregate. Since the efforts were put on proof of concept of HDC based classification, the other aspects of feature selection and efficient implementations were not addressed in this work. We believe that by deleting or rewriting lexical terms such as pls Hyperdimensional Computing for INLI 7 Runs Language Precision Recall F1-score Accuracy Runs Language Precision Recall F1-score Accuracy BE 47.7% 55.7% 51.4% BE 39.0% 27.5% 32.3% HI 50.0% 7.6% 13.1% HI 7.6% 3.6% 4.9% KA 20.4% 40.5% 27.1% KA 41.5% 38.0% 39.7% Run1 32.4% Run1 29.2% MA 25.2% 58.7% 35.3% MA 22.8% 51.5% 31.6% TA 31.9% 29.0% 30.4% TA 24.7% 30.0% 27.1% TE 24.7% 23.5% 24.1% TE 35.8% 17.6% 23.6% BE 45.1% 55.1% 49.6% BE 45.2% 33.8% 38.7% HI 52.0% 5.2% 9.4% HI 8.8% 3.6% 5.1% KA 22.4% 44.6% 29.9% KA 43.5% 43.2% 43.4% Run2 32.1% Run2 31.5% MA 24.7% 60.9% 35.1% MA 23.9% 53.5% 33.0% TA 31.0% 26.0% 28.3% TA 28.1% 29.3% 28.7% TE 28.4% 25.9% 27.1% TE 32.1% 16.8% 22.0% Table 2. Results of Testset-1 Table 3. Results of Testset-2 (please), u (you), y (why), r (are), sry (sorry), fyi (for your information), etc. will improve the accuracy of the system. We also believe that punctuation marks may have a role to play in natural language identification. In future, we would like to include such features in HDC framework. We also would like to compare the proposed approach with state of the art methods in term of training effort, precision, recall and accuracy. References 1. Anand Kumar, M., Barathi Ganesh, B., Soman K, P.: Overview of the INLI@FIRE- 2018 Track on Indian Native Language Identification. In: In workshop proceedings of FIRE 2018, FIRE-2018, Gandhinagar, India, December 6-9,CEUR Workshop Proceedings. 2. Joel, T., Daniel, B., Aoife, C.: A report on the first native language identification shared task. In: In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. (2013) 3. Malmasi, S., Evanini, K., Aoife, C., Tetreault, J., Pugh, R., Hamill, C., Napolitano, D., Qian, Y.: A Report on the 2017 Native Language Identification Shared Task. In: Proceedings of the 12th Workshop on Building Educational Applications Using NLP, Copenhagen, Denmark, Association for Computational Linguistics (Septem- ber 2017) 4. Anand Kumar, M., Barathi Ganesh, H., Singh, S., Soman, K., Rosso, P.: Overview of the INLI PAN at FIRE-2017 track on indian native language identification. CEUR Workshop Proceedings,2036,pp. 99-105 (2017) 5. Mathur, P., Misra, A., Budur, E.: LIDE: language identification from text docu- ments. CoRR abs/1701.03682 (2017) 6. Nicolai, G., Islam, M.A., Greiner, R.: Native language identification using proba- bilistic graphical models. In: 2013 International Conference on Electrical Informa- tion and Communication Technology (EICT). (2014) 1–6 7. Ulmer, B., Zhao, A., Walsh, N.: Native language identification from i-vectors and speech transcriptions 8. Sari, Y., Fatchurrahman, M.R., Dwiastuti, M.: A shallow neural network for na- tive language identification with character n-grams. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. (2017) 249–254 8 Ashish Patel and Pratik Shah 9. Joshi, A., Halseth, J.T., Kanerva, P.: Language geometry using random indexing. In de Barros, J.A., Coecke, B., Pothos, E., eds.: Quantum Interaction, Cham, Springer International Publishing (2017) 265–274 10. Castro, L.N.D.: Fundamentals of Natural Computing: Basic Concepts, Algorithms, and Applications. Chapman and Hall/CRC (2006) 11. Kahneman, D.: Thinking, Fast and Slow. FARRAR STRAUS & GIROUX (2012) 12. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998) 13. Plate, T.A.: Holographic reduced representations. IEEE Transactions on Neural Networks 6(3) (May 1995) 623–641 14. Kanerva, P.: Binary spatter-coding of ordered k-tuples. In: Artificial Neural Net- works — ICANN 96. Springer Berlin Heidelberg (1996) 869–873 15. Gayler, R.W.: Multiplicative binding, representation operators & analogy (work- shop poster) (1998) 16. Gallant Stephen I., O.T.W.: Representing objects, relations, and sequences. Neural Comput. 25(8) (August 2013) 2038–2078 17. Kanerva, P.: Hyperdimensional computing: An introduction to computing in dis- tributed representation with high-dimensional random vectors. Cognitive Compu- tation 1(2) (2009) 139–159