Tamil Co-Writer: Towards inclusive use of generative AI for writing support

Tamil Co-Writer: Towards inclusive use of generative AI for writing support AntonetteShibani antonette.shibani@uts.edu.au University of Technology Sydney FaerieMattins Vellore Institute of Technology

Chennai

University of Southern California SrivarshanSelvaraj Vellore Institute of Technology

Chennai

University of Southern California RatnavelRajalakshmi rajalakshmi.r@vit.ac.in Vellore Institute of Technology

Chennai

GnanaBharathy University of Technology Sydney Tamil Co-Writer: Towards inclusive use of generative AI for writing support 1613-0073 CDF7D567253CC40BC989D4191C114F5D GROBID - A machine learning software for extracting information from scholarly documents Large language model generative AI artificial intelligence LLM Tamil writing GPT keystroke analysis CoAuthorViz Tamil Co-Writer inclusive AI equity

The increasing use of generative AI in education highlights its potential for enriching learning experiences. One application for utilising the capabilities of Large Language Models (LLMs) such as the Generative Pretrained Transformer (GPT) is the creation of writing support tools. In particular, tools that can work in partnership with humans to co-write with AI hold great promise and have been tested out for English language writing. However, the adaptation of such tools to languages other than English is limited, presenting a disadvantage for learners from linguistically diverse backgrounds. In the current study, we extend previous works in English to develop a writing aid prototype for the low-resource Indian regional language Tamil for co-writing with AI called Tamil Co-Writer. The tool additionally provides a visual summary of user interaction and co-authorship metrics for each writing session for users to reflect on their usage of AI in their own writing. We posit that such interactive tools using the latest generative AI technologies can help writers improve their writing skills and productivity in their own regional languages supporting inclusive AI for education.

Introduction

Large Language Models (LLM) are sophisticated artificial intelligence (AI) systems trained on massive amounts of textual data to generate text. LLMs such as Generative Pre-trained Transformer (GPT) are capable of producing language that is grammatically correct and appears human-written. They are seen to be performing well in a variety of language-related tasks such as translating, summarizing, responding to questions, and creating new content, and are increasingly employed across many sectors. Although relatively new, LLMs are starting to be deployed in intelligent support systems for learners and writers [1,2]. For writing support, a tool that uses GPT-3 called CoAuthor was used to collect a dataset of collaborative writing between humans and AI [3]. The technology allowed users to write freely, solicit suggestions from GPT-3, accept or reject those suggestions, and alter previously written texts or accepted suggestions in any sequence they desired. Past research also presented a visual representation (CoAuthorViz) of user interactions with the tool to study AI-dependency behaviours of users [5]. This demonstrates the usefulness of AI-based tools in enhancing learner capabilities; however, they only support English language users and do not address the needs of diverse groups of learners to promote inclusivity in education.

The objective of the current study is to introduce a working prototype of an AI co-authoring tool for the regional language 'Tamil' [4]. Tamil is a Dravidian language primarily spoken in the Indian state of Tamil Nadu and northeast Sri Lanka. While estimated to be spoken by over 80 million native speakers worldwide, Tamil falls under the category of low-resource languages for Natural Language Processing research, characterized by limited tools and datasets. Recent studies in under-resourced languages aim to bridge this gap by specifically targeting the creation of additional resources [6,7]. Our prototype tool called 'Tamil Co-Writer' can help Tamil language learners engage in interactive writing sessions with AI support in the form of auto-generated suggestions. The tool incorporates speech to text input using Automatic Speech Recognition (ASR) for increased accessibility. The learner also receives a summary statistic and a visual graph at the end of each writing session using the tool to reflect on their dependence on AI suggestions in their writing. Our study demonstrates how tools such as this can cater to linguistically and culturally diverse groups of users to aid their writing for equitable use of AI. The contributions of this paper are as follows:

• Development of a novel writing aid prototype for Tamil using the open-source GPT-2 model called Tamil Co-Writer, incorporating an ASR model for text-to-speech input and visual statistics of AI usage. • The conceptualisation and evaluation of how LLMs can be effectively embedded in writing tools for improved accessibility and language support for diverse users whose first language is not English.

Background

Automated writing analysis, feedback, and digital tools have long been used to support writers in their writing process [8,9]. In learning analytics, the sub-field of writing analytics has examined how tools and analytics techniques can be used to help learners with their writing processes and products [10,11].

With the advent of advanced technologies such as LLMs, more sophisticated tools are now being developed that invite writers to co-write with AI as an active companion. Most well-documented among AI co-creation tools is CoAuthor [12], which is a combination of an interface, dataset, and experiment all in one. The tool was evaluated for use by over 60 people for creative and argumentative writing tasks where they co-wrote with GPT-3 generated text suggestions. The interactions between the writer and the GPT-3 suggestions were also captured using keystroke logging, which has further led to the study of human-AI interaction behaviours [5]. Another such tool produced 'sparks', which were new sentences generated by the AI to inspire users to write scientific content [1]. Here, the inspirations helped writers with elaborating sentences with detail, providing interesting angles to engage readers, and showing potential reader perspectives. A web application called Wordcraft provided multimodal machine intelligence to help writers make integrative leaps in creative writing as they wrote stories [13]. A similar co-creative story authoring tool utilizes large language models and story grammars, allowing authors to easily engineer text generation to meet their expectations [14]. While most tools support the drafting process of writing, other tools that support the revision process not fully handing over the creative process to AI are also being developed. This includes a human-in-the-loop iterative text revision system called Read, Revise, Repeat (R3) where writers interacted with model-generated revisions for deeper edits [15]. Commercial tools such as Grammarly, ProWritingAid, Quillbot, and Scribbr that previously focused on grammar and style also appear to have been transformed by LLMs into co-writing tools in English, providing feedback on content relevance, tone, cohesion. They enhance contents upon request, and correct errors [16]. Being commercial tools, their exact capabilities and limitations have not been well documented in the research space and constantly evolve over time. In the new world, plagiarism checkers are also being incorporated into tools such as Quillbot [19], which, while suggesting improvements or providing feedback, also check for potential plagiarism in content.

A plethora of products facilitate co-authoring, including content generators, programming automation, and AI-based virtual assistants propelled by the increasing availability of generative AI. Co-pilots and assistants where writers can obtain suggestions "as-they-type" are touted as the future of writing tools. However, all tools discussed above have been developed for writing in English. There are no such authoring tools created for low-resource languages such as the regional Indian languages, e.g. Tamil. By not focusing on languages other than English, we miss the rich cultural context and nuances in regional languages that are not well-resourced for NLP tasks. Fortunately, this is beginning to change for Indian languages as fine-tuned LLMs are starting to emerge [17]. Our research aims to contribute to the growing space for Tamil learning technologies and data to cater to linguistically diverse user groups and research areas. This paper explains the technical components in building such an AI cowriting system using open-source technology that can be generalized for other languages for writing support.

Development and evaluation of Tamil Co-Writer

Our novel collaborative writing tool 'Tamil Co-Writer' consists of a simple user interface for writing Tamil text and several underlying technical components that facilitate AI support and speech recognition. Figure 1 shows a flow chart of the individual components, which are explained below.

Tool functionality

The front end of Tamil Co-Writer contains an input text field that allows users to start writing their text in a writing session. The user has an additional option to provide input in speech format, which is processed by a fine-tuned ASR model to convert speech-to-text. This option can improve the accessibility of writing support tools by providing users with the option to write text as they speak their language. The use of non-standard vocabulary, accent differences, and background noise make ASR models difficult to accurately identify speech [18], with increased challenges in low-resource languages such as Tamil having insufficient training data. When using the tool, the user first enters the text in Tamil or speaks through the ASR system in Tamil. Upon asking for a GPT-2 suggestion (by clicking the 'Get suggestion' button), this Tamil text is translated to English and sent out for text generation using GPT-2. Five suggestions automatically generated using GPT-2 are displayed. The writer can accept a suggestion as is, reject a suggestion, or accept the suggestion and then modify it. At the end of the writing session, the writer obtains summary statistics and graphs that show their level of collaborative writing with the aid of GPT-2. The website is built using Django 2 and locally tested for prototyping. The tool thus consists of the three main components below at the back end, described next:

1. The input module processes the written text or speech from the user. 2. The automated text generation module then suggests new text using the GPT-2 model. 3. The metrics generation module populates the visual graph and key metrics of AI usage.

Input module

The input module processes the written text entered by the user directly in Tamil or the entered speech for conversion to text. A screenshot of the user interface is in Figure 2. For ASR, we use a fine-tuned XLSR Wav2Vec2.0 model using the Connectionist Temporal Classification (CTC) algorithm [20,21]. CTC operates by introducing a unique 'blank' symbol that may be introduced in the target sequence between any two consecutive output symbols. The method then converts each potential output symbol, including the blank symbol, into a sequence of probabilities. Blank symbols were employed during decoding to calculate the most likely output sequence given the input sequence. The ability to accommodate the input and output sequences of varying lengths is one of the CTC's main features, making it ideal for voice recognition jobs where the input and output sequences' lengths might change. Moreover, CTC is capable of handling situations where the alignment between the input and output sequences is not known beforehand, such as when the input voice signal could have pauses or other disturbances.

For fine-tuning the XLSR Wav2Vec2.0 model for better performance for Tamil, we use the common voice open dataset [22]. This dataset has over 850 voices and 7.83GB worth of data. For training data that is this big, a powerful machine having high GB and RAM for running non-stop for a minimum of 40 hours is required. In this research, we employ Google Colab Pro with 100 computer units to train the data. The dataset was cleaned by removing special characters. The audio data used to train the XLSR-Wav2Vec2 model was captured at 48 kHz for Babel, Multilingual LibriSpeech (MLS), and Common Voice and then down-sampled to 16 kHz. Common Speech data needed to be down-sampled to 16 kHz for training as it was initially captured at 48 kHz. The Wav2Vec2FeatureExtractor feature extractor was used with a feature size of 1, sampling rate of 16000, and a padding value of zero. Batchwise padding was performed on the training samples to obtain the longest sample. The feature vector size, which is the combined dimension of all the features derived from the speech representations, is 1. Table 1 shows the hyperparameters of the model. The input Tamil text is then converted to English for processing using the existing Google translate API [23], google trans of version 4.0. A web service enables programmers to include translation capabilities into their programs by querying the API built by Google. This Google Cloud Platformprovided API offers a straightforward REST-based user interface for translating text between languages. The query to the Google Translate API contains the text to be translated and the destination language, and the translated text in the intended language is returned by the API. The API offers access to both neural and statistical machine translation models and covers over 100 languages, including some less widely spoken languages. By selecting glossary words or offering their own translation models, developers may also alter the translation results. After translating the input Tamil text to English, Tamil Co-Writer sends it to the GPT-2 text generator to generate suggestions for the writer.

The accuracy of ASR or machine translation was gauged using the word error rate (WER) measure. As compared to the total number of words in the reference text, WER reflects the proportion of words that the system erroneously recognizes or translates [24]. A lower WER score generally denotes a higher degree of system accuracy with respect to speech recognition or machine translation. However, it is important to note that it is not always a true indicator of the system's performance as there might be other contextual factors in the language that are better assessed by a human native speaker. During the manual evaluation of some examples of our Tamil ASR, we observed errors in a few characters and in some cases, the affix and suffix. Many of the predictions are close to ground truth phonetically even though there are errors in the written script, and Tamil is a phonetic language. An additional metric could be introduced in the future to measure the phonetic proximity, that is comparing the phonetic content in addition to than written content. In the fine-tuned ASR model we used, we achieve a WER of 60%, which is good for basic applications in low-resource languages such as Tamil that lack huge training sets. While noting the limitations, we proceed to use this model for translation for the current prototypical version of our tool, as the main objective of this research is not to enhance ASR in Tamil, but rather to showcase how it can be used in an AI-based writing aid.

Automated text generation module

The automated text generation module in Tamil Co-Writer generates Tamil text suggestions to the writer using the open-source large language model GPT-2. The large-scale neural language model Generative Pre-trained Transformer 2 (GPT-2) was created by OpenAI as an extension of the original GPT model and performed better across a range of NLP tasks due to its substantially greater size [25]. The Transformer architecture, a kind of neural network that processes input sequences via self-attention techniques serves as the foundation for GPT-2. The model is pretrained on a sizable corpus of text data using an unsupervised learning technique, allowing it to pick up on the linguistic patterns and structures without direct supervision. Once trained, GPT-2 produces writing that resembles that of a human being by predicting the following word in a series based on the preceding ones using probabilities. It can also be applied for specialized jobs such as text categorization, question-answering, and language translation. The capability of GPT-2 to produce cohesive fluid language that is difficult to distinguish from human written material is one of its standout characteristics. Table 2 shows the parameters used for the GPT-2 model to evaluate and generate text suggestions for the writer. While newer models can provide much better performance (and are discussed as part of future work), GPT-2 provides a baseline model for testing that is free to access and deploy.

Metrics generation module

The tool additionally logs the keystroke-level action of users to analyse and provide summaries of their interactions with AI suggestions using the metrics generation module. This kind of learning analytics from user logs of AI interaction helps users reflect on their process of writing and reliance on AI, which is deemed to be the future of assessment [25]. The visualization follows CoAuthorViz, which is a graphical representation introduced by recent work [5], to represent co-authorship behaviors during a writing session when users interact with AI suggestions at the sentence level. The CoAuthorViz visual representation and its interaction metrics provide insight into how writers utilize AI writing assistants and can help writers reflect on their dependence on AI suggestions in their writing practices. This was used to analyze human-AI collaborative writing for English, and how the graph can be read is explained in detail in the previous paper with examples [5]. Here we illustrate CoAuthorViz for our Tamil Co-Writer tool in Figure 3 using an example from a test writer. In this case, majority of the writing was done independently by the writer (sentences 1, 3, 5, 8, 10, 11, 13, 15, 16, and 18 with black squares), and even when text from GPT-2 was obtained (sentences 2, 4, 6, 7, 9, 12, and 18 with GPT-2 written text), the writer still added more text. Additionally, we can identify instances when a GPT-2 call was made, but the writer disregarded it (white triangles in sentences 6 and 10 empty GPT-2 calls). We also see instances where the writer made changes to the phrases recommended by GPT-2, suggesting partial satisfaction with the suggestions and further editing (squares enclosing gray triangles in sentences 7 and 9).

In addition to the creation of CoAuthorViz for Tamil, we present a summary of the most important events noted in each writing session, which offers concrete metrics that authors may use with visuals to learn more about their writing patterns, as demonstrated in a previous work [5]. This summary includes three metrics: sentence, API, and ratio. The sentence metrics define co-authorship across all sentences in the writing session, , the API metrics keep track of the number of calls made to obtain GPT-2 suggestions and the Ratio metrics consolidate them to study co-authorship behaviours in relation to the total number of sentences generated in a writing session. Table 3 shows key co-authorship metrics for the example writing session discussed in Figure 3. The author has written independently for most of the writing session as indicated by their Autonomous Writing Indicator (RB = 0.61). The total usage of GPT-2 in a sentence being low (RC = 0.38) indicates that the user has a more independent writing style; or is not too satisfied with the AI-offered suggestions. In none of the sentences was the user completely dependent on the GPT-2 suggestion (SC = 0). These metrics can help writers understand how much of their writing is produced by the model and how much their own work is at the end of a writing session.

Table 3 Co-authorship metrics for the sample writing session in

Discussion and future work

Despite the increasing popularity of large language models in NLP based applications, low-resource languages such as the regional Indian language Tamil have a lower adoption rate. In this work, we introduced a novel human-AI collaborative writing tool prototype for Tamil writing called Tamil Co-Writer that used GPT-2 to auto-generate text suggestions to the user when they require help in their writing session. Additionally, we introduced a fine-tuned ASR model for Tamil with 60% WER, which enables the to provide Tamil speech as an input for improved accessibility. We also demonstrated the use of a visual representation called CoAuthorViz and its co-authorship metrics applied through our tool for writers to improve their understanding of AI usage when co-writing with AI. We posit that AIgenerated suggestions can not only help language writers obtain new ideas when writing in Tamil, for instance, in coming up with characters and plots for creative writing outputs such as stories, but they can also provide them with exposure to new vocabulary and different ways of structuring sentences.

In the current prototypical version of our tool, we employed the GPT-2 model for equitable access, because it is an available open-source for deployment. However, newer models created with larger training data, such as higher versions of GPT, can provide better text suggestions for the user (with the downside of costs). We plan to use open-source versions of large language models such as LlaMA or Falcon LLM in later versions of the tool when deploying to users. Exciting new developments such as the introduction of an early-stage LLM for Tamil (fine-tuned from LlaMA with 16,000 Tamil tokens using the LoRA methodology) are starting to occur [17]. LoRA introduces trainable low-rank matrices into specific layers of a pre-trained model, reducing the need for pre-training a large number of weight parameters directly and achieving higher training efficiency. We see this as a synergistic development and a potential opening for an NLP ecosystem in Tamil. If open-source LLMs such as these are developed for global regional languages, tools such as Tamil Co-Writer would be further augmented for supporting learners. Together, they could add to a more inclusive and diverse linguistic AI landscape and ecosystem.

Currently, the Tamil Co-Writer tool also uses Google API for translation purposes with trial access; in future versions, we would like to use better and open-source translation models specifically tailored for Tamil to deploy the tool without paid access to API. This would also improve the user experience for writers with faster processing times, as there would not be any time lost in the back-and-forth translation through APIs. There is also scope to create a better-fine-tuned model for ASR by increasing the run time, input data, and training on heavy resources. Future work will involve additional support for code-switching between languages [28] for multi-lingual usage and improved user interaction features.

Learning analytics offers new opportunities to support effective human-AI collaboration by helping educators and students become aware of the processes involved in learning, in addition to the final products. In current work, in addition to using AI to support their writing, users can also reflect on their usage of AI-generated suggestions using the CoAuthorViz visualisation and metrics. The creation of visual representations to study co-authorship behaviours and associated metrics opens up new avenues to investigate writing processes, such as studying user characteristics and collaboration dynamics among writers [5]. Analytics from generative AI can thus help close the LA cycle by facilitating personalised and adaptive interventions [27]. Writers may use this data to spot patterns and trends in their writing processes, such as if they frequently employ AI for particular kinds of material. This language-agnostic approach can help improve students' writing practices and feedback-seeing behaviors when engaging with AI by understanding their own and AI's respective roles in the writing process, and aids researchers to study these processes.

Future work can also inform feedback mechanisms to provide feedback to the user when overreliance on AI is observed in their writing and develop effective models for optimal human-AI collaboration for writing. The research thus provides useful insights from our prototype evaluation that can be extended to other languages to cater to diverse groups of audiences and their writing needs for language development and can pave the way for more inclusive AI tools for education in the future.

Figure 1 :1Figure 1: Workflow and components of Tamil Co-Writer

Figure 2 :2Figure 2: Tamil Co-Writer user interface for input

Figure 3 :3Figure 3: Example of a CoAuthorViz writing session

Figure 33Parameters Value Total number of sentences (SA) 21 Number of sentences completely authored by the writer (SB) 13 Number of sentences completely authored by GPT-3 (SC) 0 Number of sentences co-authored by GPT-3 and writer (SD) 8 GPT-3 dependence indicator (RA) 0 Autonomous writing indicator (RB) 0.61 Total GPT-3 usage in sentences (RC) 0.38

Table 1 Hyperparameters for the fine-tuned Tamil ASR model1ParametersValueEpochs20Batch size16Gradient accumulation steps2Evaluation strategyStepsHalf-precision floating point format (FP16)TrueSave strategyEpochEvaluation steps100Logging steps10Learning rate0.0001Attention dropout0.1Hidden dropout0.1

Table 2 Parameter set for text generation in GPT-2 model2ParametersValueMaximum length30Number of return sequence5Temperature0.3Top-p1No repeat ngram size2

Sparks: Inspiration for Science Writing using Language Models KIGero VLiu LChilton 10.1145/3532106.3533533 Proceedings of the 2022 ACM Designing Interactive Systems Conference, DIS '22 the 2022 ACM Designing Interactive Systems Conference, DIS '22

New York, NY, USA

Association for Computing Machinery 2022 Metaphorian: Leveraging Large Language Models to Support Extended Metaphor Creation for Science Writing JKim SSuh LChilton HXia 10.1145/3563657.3595996 Proceedings of the 2023 ACM Designing Interactive Systems Conference, DIS '23 the 2023 ACM Designing Interactive Systems Conference, DIS '23

New York, NY, USA

Association for Computing Machinery 2023 Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities MLee PLiang QYang Proceedings of the 2022 CHI conference on human factors in computing systems the 2022 CHI conference on human factors in computing systems 2022 Tamil language 2023 Wikipedia Visual representation of coauthorship with GPT-3: Studying human-machine interaction for effective writing AShibani RRajalakshmi FMattins SSelvaraj SKnight 16th International Conference on Educational Data Mining MFeng TK¨aser PTalukdar

Bengaluru, India

2023 Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text BRChakravarthi RMuralidaran JPPriyadharshini Mccrae Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Marseille, France

2020 SKhanuja DBansal SMehtani SKhosla ADey BGopalan DKMargam PAggarwal RTNagipogu SDave SGupta arXiv:2103.10730 Muril: Multilingual representations for indian languages 2021 arXiv preprint AcaWriter: A learning analytics tool for formative feedback on academic writing SKnight AShibani SAbel AGibson PRyan NSutton RWight CLucas ASandor KKitto MLiu 2020 Digital writing technologies in higher education: theory, research, and practice OKruse CRapp CMAnson KBenetos ECotos ADevitt AShibani 2023 Springer Nature Natural Language Processing-Writing Analytics AGibson AShibani Charles Lang, George Siemens, Alyssa Friend Wise, Dragan Gaševic, and Agathe Merceron 2022 SoLAR Vancouver, Canada 2nd ed Analytic Techniques for Automated Analysis of Writing, Digital Writing Technologies in Higher Education: Theory AShibani Research, and Practice 2023 Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities MLee PLiang QYang Proceedings of the 2022 CHI conference on human factors in computing systems the 2022 CHI conference on human factors in computing systems 2022 Where to hide a stolen elephant: Leaps in creative writing with multimodal machine intelligence NSingh GBernal DSavchenko ELGlassman ACM Transactions on Computer-Human Interaction 30 5 2023 A hybrid approach to co-creative story authoring using grammars and language models ARiddle Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 2022 WDu ZMKim VRaheja DKumar DKang Read arXiv:2204.03685 revise, repeat: A system demonstration for human-in-the-loop iterative text revision 2022 arXiv preprint How we use ai to enhance your writing, Grammarly Spotlight Grammarly 2019 ABalachandran arXiv:2311.05845 Tamil-Llama: A New Tamil Language Model Based on Llama 2 2023 arXiv preprint Automatic speech recognition errors detection and correction: A review RErrattahi AEHannani HOuahmane Procedia Computer Science 128 2018 <author> <persName><surname>Quillbot</surname></persName> </author> <ptr target="https://quillbot.com/" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b19"> <analytic> <title level="a" type="main">wav2vec 2.0: A framework for self-supervised learning of speech representations ABaevski YZhou AMohamed MAuli Advances in neural information processing systems 33 2020 Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks AGraves SFernández FGomez JSchmidhuber Proceedings of the 23rd international conference on Machine learning the 23rd international conference on Machine learning 2006 Common Voice Mozilla Google Cloud Translation AI Word error rate estimation for speech recognition: e-WER AAli SRenals Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics the 56th Annual Meeting of the Association for Computational Linguistics 2018 Language models are unsupervised multitask learners ARadford JWu RChild DLuan DAmodei ISutskever OpenAI blog 1 8 9 2019 JMLodge SHoward MBearman PDawson Associates Assessment reform for the age of artificial intelligence A. G. Tertiary Education Quality and Standards Agency 2023 LYan RMartinez-Maldonado DGašević arXiv:2312.00087 Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle 2023 arXiv preprint NJose BRChakravarthi SSuryawanshi ESherly JPMccrae 2020 6th international conference on advanced computing and communication systems (ICACCS) 2020 A survey of current datasets for code-switching research