-

Language dominance prediction in Spanish-English bilingual children using syntactic information: a rst approximation

Gabriela Ramirez-de-la-Rosa

gabyrr,solorio@cis.uab.edu 1

Manuel Montes-y-Gomez

mmontesg@inaoep.mx 4

Yang Liu

yangl@hlt.utdallas.edu 3

Lisa Bedore

lbedore@mail.utexas.edu 2

Aquiles Iglesias

iglesias@temple.edu 0

Elizabeth Pen~a

lizp@mail.utexas.edu 2 0 Temple University 1 Thamar Solorio, University of Alabama at Birmingham , USA 2 The University of Texas at Austin , USA 3 The University of Texas at Dallas , USA 4 University of Alabama at Birmingham , INAOE Mexico

2011

64 69

This paper presents results on a preliminary study using syntactic information to predict language dominance in Spanish-English bilingual children. Our approach uses a bag of syntactic grammar rules taken from narratives in English and Spanish. We then measure prediction accuracy of categorizing children into Spanishdominant, English-dominant, and Balanced Bilingual. The results are competitive to previous work using a much larger and diverse set of features with shallow syntactic analysis. This paper shows the potential bene t of adding a deeper syntactic analysis for modeling language in young children, even in the case of having mixed language samples.

In the eld of communication disorders, the analysis of spontaneous language samples

This research was partially supported by the National Science Foundation under grants 1018124 and 1017190, and by NIH NIDCD R01 grant DC007439. This work was also supported in part by the UPV, award 1932, under the program Research Visits for Renowned Scientists (PAID-02-11) and by the European Commission as part of the WIQ-EI project (project no. 269180) within the FP7 People Programme. is a common practice to determine language status of children. Typically, this involves a very expensive process of manually coding and analyzing these samples to nd patterns that are known to be good clinical markers. For the analysis of language from monolingual children, especially English-speaking children, there is a vast amount and breath of research that supports the use of these clinical markers. However, for bilingual populations the literature is not as extensive, although it is steadily growing. One task considered critical by clinical researchers when analyzing language from bilingual children is identi cation of language dominance. That is, in order to make nal recommendations or diagnosis, it has been found to be critical to know which language, if any of the two, is more developed in the child. Recent research in communication disorders presents two approaches for determining language dominance in bilingual children, one based on measures of language exposure (Bedore et al., 2010) and the other one based on measures of language productivity (Paradis et al., 2003) , although the former seems to be more widely accepted. However, determination of language required ask to parents and teachers the amount of input and output of children over a period of time, typically a week; since the children are not monitored 100 % of the time.

Previous work by Solorio et al. (2011) from the Natural Language Processing (NLP) community has looked at a corpus driven approach for this problem of determining language dominance. They framed this problem as a text classi cation task, where the classes are the three potential language dominance categories: English dominant (ED), Spanish dominant (SD), and balanced bilingual (BB), and they extracted a large variety of features from the language samples to train a machine learning classi er. In this paper we follow the idea of using a machine learning algorithm, but the set of features we explore here are purely syntactic, and were not explored in the work mentioned above. Our results show that deeper syntactic information carries rich relevant content for the task of determining the language dominance of Spanish-English bilingual children. We extract features from the parse trees generated by o -the-shelf syntactic parsers for English and Spanish. Then we train a learning algorithm using the set of syntactic rules found in each transcript as features. We call this a bag of rules (BOR) approach. The accuracy results obtained by our simple syntactic based features are higher than several of the features presented in previous work. We speculate that combining this information with that in Solorio et al.'s paper can lead to even higher accuracies. 2.

Related Work

Previous work has used NLP techniques to help in the areas of communication disorders. In Gabani et al. (2009) , in order to predict language impairment in monolingual English and Spanish-English bilingual children, they used six sets of features to build a computational model: language productivity, morphosyntactic skills, vocabulary knowledge, speech uency, perplexities from LMs and standard scores. In this previous work the best result reported was around 60 % of Fmeasure. In a more recent work, an addition of 3 sets of features to previous features was proposed. In particular, demographic information, syntactic complexity, and POS n-grams, were included to predict the dominant language in bilingual children (Solorio et al., 2011) . This more recent work added some syntactic information as features but only at the level of part of speech tags. The best result obtained in this work was 72 % of accuracy.

On the other hand, NLP techniques have also been explored in the detection of mild cognitive impairment (Roark, Mitchell, and Hollingshead, 2007) , where features such as Yngve and Frazier scores, together with features derived from automated parse trees are explored in that work to model syntactic complexity. Similar features are used in the classi cation of language samples as belonging to children with autism, language impairment, or none of the above (Prud'hommeaux et al., 2011) .

The last two approaches inspired us to explore the use of information generated by automatically parsing the language samples. The features, as they are proposed here, have not been used in previous work. In this sense, the novelty of our study is the use of a representation analogous to bag of words that used syntactic patterns as extracted from parse trees. The next section describes our proposed method in more detail.

Proposed Approach

The goal of the task is the prediction of language dominance of a child into one of three core categories: BB (balanced bilingual), ED (English dominant), and SD (Spanish dominant). Since we want to streamline the process of language analysis as much as possible, we restrict the feature set to features that can be automatically extracted from the transcripts. Moreover, since previous work for automated language dominance prediction has not explored the use of parse trees, or features derived from parse trees, we study in this work their contribution to developing an accurate model for this task. We expect that children at similar stages of language acquisition will have mastered a similar set of grammatical constructions and that this can be exploited by a learning algorithm. An interesting twist in this classi cation task is the fact of having information, language samples, in each of the two languages. While it is widely accepted that in a bilingual population is important to assess language ability on both languages, it is less clear how to do this in a machine learning scenario. Here, we explore di erent ways to combine the observed samples in both languages.

The idea of this study is very simple. It consists of the following steps: 1. Automatically parsing the transcripts. In this step we generate a set of parse trees for each transcript using trained monolingual parsers. Because we lack gold standard parse trees of bilingual child language, we are assuming that a parser trained on mostly adult language will not have a major negative e ect in our proposed solution. However, it should be noted here that the noise from the parse trees is not only coming from the di erences between adult language constructs and those from children, but also from the mixed language input. As explained in the following section, children are prompted to elicit the language samples in one target language, but frequently these children code switched between their two languages. Our assumption is that the parser will make consistent decisions when unexpected tokens appear during analysis, and thus the noise from those elements will by systematically added to both, training and testing data and this will not have a major e ect on classi cation accuracy into language dominance. But we do recognize that if careful analysis will be performed on the parse trees, then adaptation of the parsers, to both child language, and mixed language input, might be needed. 2. Finding rules. Using every parse tree for a transcript, we nd each rule of the form of ! , where is the root of a subtree and is the set of children in that particular subtree. Because we are more interested in grammatical structure than in the actual vocabulary, we only add to the list those rules not involving a lexicon entry. 3. Creating the representation of transcripts. Once we gather the lexicon of grammar rules red in the training set, we used them as features to represent each transcript. This representation is analogous to BOW (bag of words), but instead of words we have rules, thus we refer to this representation as BOR (bag of rules). We also use standard Boolean weights for the rules. The intuition is that it is enough to observe a syntactic construct once to assume the child masters that construction. 4. Training a model for language dominance prediction. Each transcript in the training set is transformed into a BOR vector. Then we use a standard machine learning algorithm to train a model. We assume then, that this problem of language dominance prediction can be cast as a classi cation problem. 5. Classifying a child. To classify the language dominance of a new child, we transform the transcript to a vector of n dimensions, where n is the number of elements in the BOR, and the value of each dimension is either presence (1) or absence (0) of the speci c rule. Then we can use the trained model generated in the previous step to make a prediction for the new sample.

In the following section we describe the data set used to evaluate our proposed representation.

Data

The data set used in this paper contains transcripts gathered as part of an on-going longitudinal study of language impairment in bilingual Spanish-English speaking children (Pen~a et al., 2006) . The children in this study were enrolled in kindergarten with a mean age of about 6 years and 1 month. A total of 180 children participated in this study, however, we only worked with 52 bilingual children since the data for the rest of the children was not available for analysis at this point. Table 1 shows the distribution of our data.

Category Balanced Bilingual (BB) English Dominant (ED) Spanish Dominant (SD)

The transcripts were gathered following standard procedures for collection of spontaneous language samples in the eld of communication disorders. For each child in the sample, four transcripts of story narratives were collected, two in each language. Children are shown a wordless picture book and are asked to narrate the story behind the book. The story narratives are based on Mayer's wordless picture books. The books used for English were A boy, A dog, and a frog (Mayer, 1967) and Frog, where are you? (Mayer, 1969b) . The books used for Spanish were Frog on his own (Mayer, 1973) and Frog goes to dinner (Mayer, 1969a) . 5.

Experimental Setting

For extracting the parse trees we used FreeLing1. This parser comes with trained models for English and Spanish. The output of FreeLing is a set of parse trees. We break down the parse trees into grammar rules by traversing each tree in a breath rst fashion. We only add rules to the BOR vector that are composed of a root and its immediate children. In Table 2 we show an example of a parse tree generate by FreeLing and the rules we extracted from it. Once we have the BORs we use them as features to represent the test transcripts. The value assigned to each rule in the vector is a boolean weight, wi;j , one if the rule i appears in the transcript j, and zero otherwise.

As we mentioned in the previous section, we have 4 transcripts per child, but since our data set is small and we are using a corpus driven approach, we decided to duplicate the number of instances by separating the 4 sets of transcripts per child into 2 pairs. We realize that we are reducing by half how much 1FreeLing is available http://nlp.lsi.upc.edu/freeling in the website: information we observe per child to train our model and to test prediction accuracy. However in this case we believe it is more important to have more data samples to both train and evaluate. Moreover, clinicians and clinical researchers use one transcript per language for the most part, so this is also aligned with current practices. Despite this separation of transcripts per story, we were careful to put in the same partition (training or test) all transcripts of the same child. That way we avoid confounding the ultimate goal of the task.

To decide the language dominance of a particular child or instance we consider 2 transcripts, thus I = fT1 [ T2g. Because we have 4 transcripts per child, we consider the following options for combining the transcripts:

One in English and one in Spanish Both in the same language (English or Spanish)

These two combinations are selected to answer one question: what is more helpful for analyzing language ability in bilingual children, using information from two languages, or more input in a single language? We already know the answer to this question from the point of view of communication disorders, and we speculate that in this case as well the most bene cial scenario will be when using information from both languages. But it is interesting to explore if this pattern will hold when using a machine learning algorithm to predict language dominance.

To evaluate the performance of our method we used 5x2 cross fold validation, following recommendations in (Dietterich, 1998) for small sample sets. This means, we did 5 replications of 2-fold cross validation, in each repetition the available data was randomly partitioned into two equal-sized sets. In all our experiments we used the Weka (Witten and Frank, 1999) implementation of the machine learning algorithms.

Experimental Results

In our rst experiment we wanted to determine whether by taking into account language samples only in one language is possible learn to distinguish between the three categories. However, to provide a fair comparison to that of using samples from each language, we took the two samples in the same language from each child. Thus we have two scenarios in this experiment: English-English and Spanish-Spanish. Table 3 shows the accuracy using ve of the most common classi cation methods used in NLP problems: Naive Bayes, Support Vector Machines, C4.5, and k-Nearest Neighbors with k = 1 and k = 5.

Eng.

Spa.

NB 45.9 58.5

SVM 49.62 55.6

The results shown are rather poor, but are comparable to results reported in (Solorio et al., 2011) on the same data set when using individual sets of features even though they are using information on both languages. Their reported accuracy ranges from 40 %, when using only demographic information, to 72 %, when using di erent metrics of syntactic complexity. However, direct comparisons are not possible since they used a leave one out cross validation setting.

Now we want to show that our hypothesis of combining information from both languages is better than looking only at one language. In this setting we used two transcripts per child, one for English and one for Spanish. Table 4 shows the results of this setting over the same 5 classi cation methods used in the previous experiment. The results improve accuracy by up to 10 % in relation to the rst experiment.

Eng. &

Spa.

NB 63.3

SVM 67.8

As we mentioned in related work, the closer work that predicted language dominance and used the same datasets of transcripts (Solorio et al., 2011) shows an accuracy of 72 %. However, they used 9 types of features measuring di erent dimensions of language combined with some demographic information, and the only type of syntactic information used in that work was at the level of POS n-grams. In this paper we used only the syntactic information extracted from parsing the transcripts in a BOR representation. While our results are a little bit below previous results, they are still relevant in that they show how this syntactic information is valuable, and can outperform other feature types from previous work, including speech uency measures, language productivity measures, demographic information, morphosyntactic features, speaking rate, and n-grams of POS. We believe that combining this BOR representation with those features used in (Solorio et al., 2011) can boost accuracy further.

Conclusions and Future Work

We proposed a representation based on bag of rules from parse trees for the problem of predicting language dominance in SpanishEnglish children. Our results show that combining information from transcripts in both languages yields the best results. This study also shows that syntactic information is important for language analysis, even though there could be a considerable amount of noise in the parse trees from having mixed language, as well as child language.

The results obtained are comparable to the recent work looking at the same problem, but di erent from them we only look at one dimension of language. We only extract features derived from syntactic trees, while previous work looks at vocabulary, language production, uency, and measures of readability, among others. We predict that adding this dimension to previous work will help achieve higher prediction accuracy.

As future work we want to explore other syntactic information that can also be extracted from the parse trees to build a more robust language model that can improve the results achieved so far. Other things we are working on include the use of di erent weighting schemes for the rules, such as TF-IDF, and entropy of the grammar rules.

Bedore , Lisa

, Pen~a, Elizabeth

, Gillam , Ron B., and Tsunghan Ho . 2010 . Language sample measures and language ability in Spanish-English bilingual kindergarteners . Journal of Communication Disorders , 43 ( 6 ): 498 { 510 , Nov-Dec .

Dietterich , Thomas. G.

1998 . Approximate statistical tests for comparing supervised classi cation learning algorithms . Neural Computation , 10 ( 7 ): 1895 { 1924 .

Gabani , Keyur, Melissa Sherman, Thamar Solorio, Yang Liu, Lisa M. Bedore , and Elizabeth D. Pen~a. 2009 . A corpusbased approach for the prediction of language impairment in monolingual English and Spanish-English bilingual children . In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACLHLT) 2009 , pages 46 { 55 , Boulder , Colorado, June. Association for Computational Linguistics.

Mayer , Mercer. 1967 . A boy, a dog, and a frog . Dial Press, New York, NY.

Mayer , Mercer. 1969a. Frog goes to dinner . Dial Press, New York, NY.

Mayer , Mercer. 1969b. Frog, where are you? Dial Press, New York, NY.

Mayer , Mercer. 1973 . Frog on his own . Dial Press, New York, NY.

Paradis , Johanne, Martha Crago, Fred Genesee, and Mabel

Rice . 2003 . FrenchEnglish bilingual children with SLI: How do they compare with their monolingual peers ? Journal of Speech , Language, and Hearing Research, 46 : 113 { 127 .

Pen~a , Elizabeth

, Lisa

Bedore , Ronald B. Gillam , and Thomas Bohman . 2006 . Diagnostic markers of language impairment in bilingual children. Grant awarded by the NIDCD, NIH .

Prud'hommeaux , Emily T., Brian

Roark

, Lois M. Black , and Jan van Santen. 2011 . Classi cation of atypical language in autism . In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics , pages 88 { 96 , Portland , Oregon, USA, June. Association for Computational Linguistics.

Roark , Brian, Margaret Mitchell, and Kristy

Hollingshead . 2007 . Syntactic complexity measures for detecting mild cognitive impairment . In BioNLP 2007 : Biological, translational, and clinical language processing , pages 1 { 8 , Prague, June. ACL.

Solorio , Thamar, Melissa Sherman, Yang Liu, Lisa Bedore, Elizabeth Pen~a, and Aquiles

Iglesias . 2011 . Analyzing language samples of Spanish-English bilingual children for the automated prediction of language dominance . Natural Language Engineering , 17 : 367 { 395 .

Witten , Ian. H. and Eibe. Frank. 1999 . Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations . Morgan Kaufmann, San Francisco, CA.