<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Language dominance prediction in Spanish-English bilingual children using syntactic information: a rst approximation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriela Ramirez-de-la-Rosa</string-name>
          <email>gabyrr,solorio@cis.uab.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Montes-y-Gomez</string-name>
          <email>mmontesg@inaoep.mx</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Liu</string-name>
          <email>yangl@hlt.utdallas.edu</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lisa Bedore</string-name>
          <email>lbedore@mail.utexas.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aquiles Iglesias</string-name>
          <email>iglesias@temple.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elizabeth Pen~a</string-name>
          <email>lizp@mail.utexas.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Temple University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Thamar Solorio, University of Alabama at Birmingham</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The University of Texas at Dallas</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Alabama at Birmingham</institution>
          ,
          <addr-line>INAOE</addr-line>
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>64</fpage>
      <lpage>69</lpage>
      <abstract>
        <p>This paper presents results on a preliminary study using syntactic information to predict language dominance in Spanish-English bilingual children. Our approach uses a bag of syntactic grammar rules taken from narratives in English and Spanish. We then measure prediction accuracy of categorizing children into Spanishdominant, English-dominant, and Balanced Bilingual. The results are competitive to previous work using a much larger and diverse set of features with shallow syntactic analysis. This paper shows the potential bene t of adding a deeper syntactic analysis for modeling language in young children, even in the case of having mixed language samples.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the eld of communication disorders,
the analysis of spontaneous language samples</p>
      <p>
        This research was partially supported by the
National Science Foundation under grants 1018124 and
1017190, and by NIH NIDCD R01 grant DC007439.
This work was also supported in part by the UPV,
award 1932, under the program Research Visits for
Renowned Scientists (PAID-02-11) and by the
European Commission as part of the WIQ-EI project
(project no. 269180) within the FP7 People Programme.
is a common practice to determine
language status of children. Typically, this
involves a very expensive process of manually
coding and analyzing these samples to nd
patterns that are known to be good clinical
markers. For the analysis of language from
monolingual children, especially English-speaking
children, there is a vast amount and breath
of research that supports the use of these
clinical markers. However, for bilingual
populations the literature is not as extensive,
although it is steadily growing. One task
considered critical by clinical researchers when
analyzing language from bilingual children is
identi cation of language dominance. That
is, in order to make nal recommendations
or diagnosis, it has been found to be critical
to know which language, if any of the two,
is more developed in the child. Recent
research in communication disorders presents
two approaches for determining language
dominance in bilingual children, one based on
measures of language exposure
        <xref ref-type="bibr" rid="ref1">(Bedore et al.,
2010)</xref>
        and the other one based on measures of
language productivity
        <xref ref-type="bibr" rid="ref8">(Paradis et al., 2003)</xref>
        ,
although the former seems to be more widely
accepted. However, determination of
language required ask to parents and teachers the
amount of input and output of children over
a period of time, typically a week; since the
children are not monitored 100 % of the time.
      </p>
      <p>
        Previous work by
        <xref ref-type="bibr" rid="ref12">Solorio et al. (2011)</xref>
        from the Natural Language Processing
(NLP) community has looked at a corpus
driven approach for this problem of determining
language dominance. They framed this
problem as a text classi cation task, where the
classes are the three potential language
dominance categories: English dominant (ED),
Spanish dominant (SD), and balanced
bilingual (BB), and they extracted a large
variety of features from the language samples
to train a machine learning classi er. In this
paper we follow the idea of using a machine
learning algorithm, but the set of features we
explore here are purely syntactic, and were
not explored in the work mentioned above.
Our results show that deeper syntactic
information carries rich relevant content for the
task of determining the language dominance
of Spanish-English bilingual children. We
extract features from the parse trees generated
by o -the-shelf syntactic parsers for English
and Spanish. Then we train a learning
algorithm using the set of syntactic rules found
in each transcript as features. We call this a
bag of rules (BOR) approach. The accuracy
results obtained by our simple syntactic
based features are higher than several of the
features presented in previous work. We
speculate that combining this information with
that in Solorio et al.'s paper can lead to even
higher accuracies.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Previous work has used NLP techniques
to help in the areas of communication
disorders. In
        <xref ref-type="bibr" rid="ref3">Gabani et al. (2009)</xref>
        , in order to
predict language impairment in monolingual
English and Spanish-English bilingual
children, they used six sets of features to build a
computational model: language productivity,
morphosyntactic skills, vocabulary
knowledge, speech uency, perplexities from LMs and
standard scores. In this previous work the
best result reported was around 60 % of
Fmeasure. In a more recent work, an
addition of 3 sets of features to previous
features was proposed. In particular, demographic
information, syntactic complexity, and POS
n-grams, were included to predict the
dominant language in bilingual children
        <xref ref-type="bibr" rid="ref12">(Solorio et
al., 2011)</xref>
        . This more recent work added
some syntactic information as features but only
at the level of part of speech tags. The best
result obtained in this work was 72 % of
accuracy.
      </p>
      <p>
        On the other hand, NLP techniques
have also been explored in the detection of
mild cognitive impairment
        <xref ref-type="bibr" rid="ref11">(Roark, Mitchell,
and Hollingshead, 2007)</xref>
        , where features such
as Yngve and Frazier scores, together with
features derived from automated parse trees
are explored in that work to model
syntactic complexity. Similar features are used
in the classi cation of language samples
as belonging to children with autism,
language impairment, or none of the above
        <xref ref-type="bibr" rid="ref10">(Prud'hommeaux et al., 2011)</xref>
        .
      </p>
      <p>The last two approaches inspired us to
explore the use of information generated by
automatically parsing the language samples.
The features, as they are proposed here, have
not been used in previous work. In this sense,
the novelty of our study is the use of a
representation analogous to bag of words that used
syntactic patterns as extracted from parse
trees. The next section describes our
proposed method in more detail.</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <p>The goal of the task is the prediction of
language dominance of a child into one of
three core categories: BB (balanced bilingual),
ED (English dominant), and SD (Spanish
dominant). Since we want to streamline the
process of language analysis as much as possible,
we restrict the feature set to features that can
be automatically extracted from the
transcripts. Moreover, since previous work for
automated language dominance prediction has
not explored the use of parse trees, or
features derived from parse trees, we study in
this work their contribution to developing an
accurate model for this task. We expect that
children at similar stages of language
acquisition will have mastered a similar set of
grammatical constructions and that this can be
exploited by a learning algorithm. An
interesting twist in this classi cation task is the
fact of having information, language samples,
in each of the two languages. While it is
widely accepted that in a bilingual population is
important to assess language ability on both
languages, it is less clear how to do this in a
machine learning scenario. Here, we explore
di erent ways to combine the observed
samples in both languages.</p>
      <p>The idea of this study is very simple. It
consists of the following steps:
1. Automatically parsing the
transcripts. In this step we generate a set
of parse trees for each transcript using
trained monolingual parsers. Because we
lack gold standard parse trees of
bilingual child language, we are assuming
that a parser trained on mostly adult
language will not have a major
negative e ect in our proposed solution.
However, it should be noted here that the
noise from the parse trees is not only
coming from the di erences between adult
language constructs and those from
children, but also from the mixed language
input. As explained in the following
section, children are prompted to elicit the
language samples in one target language,
but frequently these children code
switched between their two languages. Our
assumption is that the parser will
make consistent decisions when unexpected
tokens appear during analysis, and thus
the noise from those elements will by
systematically added to both, training and
testing data and this will not have a
major e ect on classi cation accuracy into
language dominance. But we do
recognize that if careful analysis will be
performed on the parse trees, then adaptation
of the parsers, to both child language,
and mixed language input, might be
needed.
2. Finding rules. Using every parse tree
for a transcript, we nd each rule of the
form of ! , where is the root of
a subtree and is the set of children in
that particular subtree. Because we are
more interested in grammatical
structure than in the actual vocabulary, we only
add to the list those rules not involving
a lexicon entry.
3. Creating the representation of
transcripts. Once we gather the lexicon
of grammar rules red in the training
set, we used them as features to
represent each transcript. This representation
is analogous to BOW (bag of words), but
instead of words we have rules, thus we
refer to this representation as BOR (bag
of rules). We also use standard Boolean
weights for the rules. The intuition is
that it is enough to observe a syntactic
construct once to assume the child
masters that construction.
4. Training a model for language
dominance prediction. Each transcript
in the training set is transformed into
a BOR vector. Then we use a standard
machine learning algorithm to train a
model. We assume then, that this
problem of language dominance prediction
can be cast as a classi cation problem.
5. Classifying a child. To classify the
language dominance of a new child, we
transform the transcript to a vector of
n dimensions, where n is the number of
elements in the BOR, and the value of
each dimension is either presence (1) or
absence (0) of the speci c rule. Then we
can use the trained model generated in
the previous step to make a prediction
for the new sample.</p>
      <p>In the following section we describe the
data set used to evaluate our proposed
representation.</p>
    </sec>
    <sec id="sec-4">
      <title>Data</title>
      <p>
        The data set used in this paper contains
transcripts gathered as part of an on-going
longitudinal study of language impairment in
bilingual Spanish-English speaking children
        <xref ref-type="bibr" rid="ref9">(Pen~a et al., 2006)</xref>
        . The children in this study
were enrolled in kindergarten with a mean
age of about 6 years and 1 month. A total of
180 children participated in this study,
however, we only worked with 52 bilingual children
since the data for the rest of the children was
not available for analysis at this point. Table
1 shows the distribution of our data.
      </p>
      <p>Category
Balanced Bilingual (BB)
English Dominant (ED)
Spanish Dominant (SD)</p>
      <p>
        The transcripts were gathered following
standard procedures for collection of
spontaneous language samples in the eld of
communication disorders. For each child in the
sample, four transcripts of story narratives
were collected, two in each language.
Children are shown a wordless picture book and
are asked to narrate the story behind the
book. The story narratives are based on
Mayer's wordless picture books. The books used
for English were A boy, A dog, and a frog
        <xref ref-type="bibr" rid="ref4">(Mayer, 1967)</xref>
        and Frog, where are you?
        <xref ref-type="bibr" rid="ref5 ref6">(Mayer, 1969b)</xref>
        . The books used for Spanish were
Frog on his own
        <xref ref-type="bibr" rid="ref7">(Mayer, 1973)</xref>
        and Frog goes
to dinner
        <xref ref-type="bibr" rid="ref5 ref6">(Mayer, 1969a)</xref>
        .
5.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Setting</title>
      <p>For extracting the parse trees we used
FreeLing1. This parser comes with trained
models for English and Spanish. The output
of FreeLing is a set of parse trees. We break
down the parse trees into grammar rules by
traversing each tree in a breath rst fashion.
We only add rules to the BOR vector that
are composed of a root and its immediate
children. In Table 2 we show an example of a
parse tree generate by FreeLing and the rules
we extracted from it. Once we have the BORs
we use them as features to represent the test
transcripts. The value assigned to each rule
in the vector is a boolean weight, wi;j , one
if the rule i appears in the transcript j, and
zero otherwise.</p>
      <p>As we mentioned in the previous section,
we have 4 transcripts per child, but since our
data set is small and we are using a corpus
driven approach, we decided to duplicate the
number of instances by separating the 4 sets
of transcripts per child into 2 pairs. We
realize that we are reducing by half how much
1FreeLing is available
http://nlp.lsi.upc.edu/freeling
in
the
website:
information we observe per child to train our
model and to test prediction accuracy.
However in this case we believe it is more
important to have more data samples to both train
and evaluate. Moreover, clinicians and
clinical researchers use one transcript per
language for the most part, so this is also aligned
with current practices. Despite this
separation of transcripts per story, we were
careful to put in the same partition (training or
test) all transcripts of the same child. That
way we avoid confounding the ultimate goal
of the task.</p>
      <p>To decide the language dominance of a
particular child or instance we consider 2
transcripts, thus I = fT1 [ T2g. Because we
have 4 transcripts per child, we consider the
following options for combining the
transcripts:</p>
      <p>One in English and one in Spanish
Both in the same language (English or
Spanish)</p>
      <p>These two combinations are selected to
answer one question: what is more helpful for
analyzing language ability in bilingual
children, using information from two languages,
or more input in a single language? We
already know the answer to this question from
the point of view of communication disorders,
and we speculate that in this case as well the
most bene cial scenario will be when using
information from both languages. But it is
interesting to explore if this pattern will hold
when using a machine learning algorithm to
predict language dominance.</p>
      <p>
        To evaluate the performance of our
method we used 5x2 cross fold validation,
following recommendations in
        <xref ref-type="bibr" rid="ref2">(Dietterich, 1998)</xref>
        for small sample sets. This means, we did 5
replications of 2-fold cross validation, in each
repetition the available data was randomly
partitioned into two equal-sized sets. In all
our experiments we used the Weka
        <xref ref-type="bibr" rid="ref13">(Witten
and Frank, 1999)</xref>
        implementation of the
machine learning algorithms.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results</title>
      <p>In our rst experiment we wanted to
determine whether by taking into account
language samples only in one language is
possible learn to distinguish between the three
categories. However, to provide a fair
comparison to that of using samples from each
language, we took the two samples in the same
language from each child. Thus we have two
scenarios in this experiment: English-English
and Spanish-Spanish. Table 3 shows the
accuracy using ve of the most common classi
cation methods used in NLP problems: Naive
Bayes, Support Vector Machines, C4.5, and
k-Nearest Neighbors with k = 1 and k = 5.</p>
      <p>Eng.</p>
      <p>Spa.</p>
      <p>NB
45.9
58.5</p>
      <p>SVM
49.62
55.6</p>
      <p>
        The results shown are rather poor, but
are comparable to results reported in
        <xref ref-type="bibr" rid="ref12">(Solorio et al., 2011)</xref>
        on the same data set when
using individual sets of features even though
they are using information on both
languages. Their reported accuracy ranges from
40 %, when using only demographic
information, to 72 %, when using di erent metrics of
syntactic complexity. However, direct
comparisons are not possible since they used a leave
one out cross validation setting.
      </p>
      <p>Now we want to show that our hypothesis
of combining information from both
languages is better than looking only at one
language. In this setting we used two transcripts per
child, one for English and one for Spanish.
Table 4 shows the results of this setting over
the same 5 classi cation methods used in the
previous experiment. The results improve
accuracy by up to 10 % in relation to the rst
experiment.</p>
      <p>Eng. &amp;</p>
      <p>Spa.</p>
      <p>NB
63.3</p>
      <p>SVM
67.8</p>
      <p>
        As we mentioned in related work, the
closer work that predicted language
dominance and used the same datasets of transcripts
        <xref ref-type="bibr" rid="ref12">(Solorio et al., 2011)</xref>
        shows an accuracy of
72 %. However, they used 9 types of features
measuring di erent dimensions of language
combined with some demographic
information, and the only type of syntactic
information used in that work was at the level of POS
n-grams. In this paper we used only the
syntactic information extracted from parsing the
transcripts in a BOR representation. While
our results are a little bit below previous
results, they are still relevant in that they show
how this syntactic information is valuable,
and can outperform other feature types from
previous work, including speech uency
measures, language productivity measures,
demographic information, morphosyntactic
features, speaking rate, and n-grams of POS. We
believe that combining this BOR
representation with those features used in
        <xref ref-type="bibr" rid="ref12">(Solorio et
al., 2011)</xref>
        can boost accuracy further.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>We proposed a representation based on
bag of rules from parse trees for the problem
of predicting language dominance in
SpanishEnglish children. Our results show that
combining information from transcripts in both
languages yields the best results. This study
also shows that syntactic information is
important for language analysis, even though
there could be a considerable amount of noise
in the parse trees from having mixed
language, as well as child language.</p>
      <p>The results obtained are comparable to
the recent work looking at the same problem,
but di erent from them we only look at one
dimension of language. We only extract
features derived from syntactic trees, while
previous work looks at vocabulary, language
production, uency, and measures of readability,
among others. We predict that adding this
dimension to previous work will help achieve
higher prediction accuracy.</p>
      <p>As future work we want to explore other
syntactic information that can also be
extracted from the parse trees to build a more
robust language model that can improve the
results achieved so far. Other things we are
working on include the use of di erent
weighting schemes for the rules, such as TF-IDF,
and entropy of the grammar rules.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bedore</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lisa</surname>
            <given-names>M.</given-names>
          </string-name>
          , Pen~a,
          <string-name>
            <surname>Elizabeth</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gillam</surname>
            , Ron B., and
            <given-names>Tsunghan</given-names>
          </string-name>
          <string-name>
            <surname>Ho</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Language sample measures and language ability in Spanish-English bilingual kindergarteners</article-title>
          .
          <source>Journal of Communication Disorders</source>
          ,
          <volume>43</volume>
          (
          <issue>6</issue>
          ):
          <volume>498</volume>
          {
          <fpage>510</fpage>
          ,
          <string-name>
            <surname>Nov-Dec</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Dietterich</surname>
            ,
            <given-names>Thomas. G.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Approximate statistical tests for comparing supervised classi cation learning algorithms</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>10</volume>
          (
          <issue>7</issue>
          ):
          <year>1895</year>
          {
          <year>1924</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gabani</surname>
          </string-name>
          , Keyur, Melissa Sherman, Thamar Solorio, Yang Liu,
          <string-name>
            <surname>Lisa M. Bedore</surname>
          </string-name>
          , and Elizabeth D. Pen~a.
          <year>2009</year>
          .
          <article-title>A corpusbased approach for the prediction of language impairment in monolingual English and Spanish-English bilingual children</article-title>
          .
          <source>In North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACLHLT)</source>
          <year>2009</year>
          , pages
          <fpage>46</fpage>
          {
          <fpage>55</fpage>
          ,
          <string-name>
            <surname>Boulder</surname>
          </string-name>
          , Colorado, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Mayer</surname>
          </string-name>
          , Mercer.
          <year>1967</year>
          .
          <article-title>A boy, a dog, and a frog</article-title>
          . Dial Press, New York, NY.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Mayer</surname>
          </string-name>
          , Mercer. 1969a.
          <article-title>Frog goes to dinner</article-title>
          . Dial Press, New York, NY.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Mayer</surname>
          </string-name>
          , Mercer. 1969b. Frog, where are you? Dial Press, New York, NY.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Mayer</surname>
          </string-name>
          , Mercer.
          <year>1973</year>
          .
          <article-title>Frog on his own</article-title>
          . Dial Press, New York, NY.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Paradis</surname>
            , Johanne, Martha Crago, Fred Genesee, and
            <given-names>Mabel</given-names>
          </string-name>
          <string-name>
            <surname>Rice</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>FrenchEnglish bilingual children with SLI: How do they compare with their monolingual peers</article-title>
          ?
          <source>Journal of Speech</source>
          , Language, and Hearing Research,
          <volume>46</volume>
          :
          <fpage>113</fpage>
          {
          <fpage>127</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>Pen~a</article-title>
          ,
          <string-name>
            <surname>Elizabeth</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lisa</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bedore</surname>
          </string-name>
          , Ronald B.
          <string-name>
            <surname>Gillam</surname>
            , and
            <given-names>Thomas</given-names>
          </string-name>
          <string-name>
            <surname>Bohman</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Diagnostic markers of language impairment in bilingual children. Grant awarded by the NIDCD, NIH</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Prud'hommeaux</article-title>
          , Emily T.,
          <string-name>
            <surname>Brian</surname>
            <given-names>Roark</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lois M. Black</surname>
          </string-name>
          , and Jan van Santen.
          <year>2011</year>
          .
          <article-title>Classi cation of atypical language in autism</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics</source>
          , pages
          <volume>88</volume>
          {
          <fpage>96</fpage>
          ,
          <string-name>
            <surname>Portland</surname>
          </string-name>
          , Oregon, USA, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Roark</surname>
            , Brian, Margaret Mitchell, and
            <given-names>Kristy</given-names>
          </string-name>
          <string-name>
            <surname>Hollingshead</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Syntactic complexity measures for detecting mild cognitive impairment</article-title>
          .
          <source>In BioNLP</source>
          <year>2007</year>
          :
          <article-title>Biological, translational, and clinical language processing</article-title>
          , pages
          <volume>1</volume>
          {
          <fpage>8</fpage>
          , Prague, June. ACL.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Solorio</surname>
            , Thamar, Melissa Sherman, Yang Liu, Lisa Bedore, Elizabeth Pen~a, and
            <given-names>Aquiles</given-names>
          </string-name>
          <string-name>
            <surname>Iglesias</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Analyzing language samples of Spanish-English bilingual children for the automated prediction of language dominance</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>17</volume>
          :
          <fpage>367</fpage>
          {
          <fpage>395</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>Ian. H.</given-names>
          </string-name>
          and Eibe. Frank.
          <year>1999</year>
          .
          <article-title>Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations</article-title>
          . Morgan Kaufmann, San Francisco, CA.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>