Detecting Age-Related Linguistic Patterns in Dialogue:
                      Toward Adaptive Conversational Systems
               Lennert Jansen1 , Arabella Sinclair1 , Margot J. van der Goot2 ,
                             Raquel Fernández1 , Sandro Pezzelle1
     1
       Institute for Logic, Language and Computation (ILLC), University of Amsterdam
   2
     Amsterdam School of Communication Research (ASCoR), University of Amsterdam
                              lennertjansen95@gmail.com
{a.j.sinclair|m.j.vandergoot|raquel.fernandez|s.pezzelle}@uva.nl

                        Abstract                                 age 19-29
                                                                 A: oh that’s coolaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    This work explores an important dimen-                                                   B: different sights and stuff
                                                                 A: oh
    sion of variation in the language used by
    dialogue participants: their age. While                      age 50+
                                                                 A: well quite and I’d have to come back as wellaaaaaaaa
    previous work showed differences at var-                                                           B: that’s of course
    ious linguistic levels between age groups                    A: and make up for you know
    when experimenting with written dis-
    course data (e.g., blog posts), previous                    Figure 1: Example dialogue snippets from speak-
    work on dialogue has largely been limited                   ers of different age groups in the British National
    to acoustic information related to voice                    Corpus. We conjecture that stylistic and lexical
    and prosody. Detecting fine-grained lin-                    differences between age groups can be detected.
    guistic properties of human dialogues is                    Here, we experiment at the level of the utterance.
    of crucial importance for developing AI-
    based conversational systems which are
                                                                of a particular individual or group of users con-
    able to adapt to their human interlocu-
                                                                tinue to pose more of a challenge. Recent exam-
    tors. We therefore investigate whether,
                                                                ples of this line of research include adaptation at
    and to what extent, current text-based NLP
                                                                style level (Ficler and Goldberg, 2017), persona-
    models can detect such linguistic differ-
                                                                specific traits (Zhang et al., 2018), or other traits
    ences, and what the features driving their
                                                                such as sentiment (Dathathri et al., 2020).
    predictions are. We show that models
                                                                   Personalised interaction is of crucial importance
    achieve a fairly good performance on age-
                                                                to obtain systems that can be trusted by users and
    group prediction, though the task appears
                                                                perceived as natural (van der Goot and Pilgrim,
    to be more challenging compared to dis-
                                                                2019), but most of all to be accessible to varying
    course. Through in-depth analysis of the
                                                                user profiles, rather than targeted at one particular
    best models’ errors and the most predic-
                                                                user group (Zheng et al., 2019; Zeng et al., 2020).
    tive cues, we show that, in dialogue, differ-
                                                                   In this work, we focus on one particular as-
    ences among age groups mostly concern
                                                                pect that may influence conversational agent suc-
    stylistic and lexical choices. We believe
                                                                cess: user age profile. We investigate whether
    these findings can inform future work on
                                                                the linguistic behaviour of conversational partic-
    developing controlled generation models
                                                                ipants differs across age groups using state-of-the-
    for adaptive conversational systems.
                                                                art NLP models on purely textual data, without
                                                                considering vocal cues. We aim to detect age from
1    Introduction
                                                                characteristics of language use and adapt to this
Research on developing conversational agents has                signal, rather than work from ground-truth meta-
experienced impressive progress, particularly in                data about user demographics. This is in the inter-
recent years (McTear, 2020). However, artifi-                   est of preserving privacy, and from the perspective
cial systems that can tune their language to that               that while age and language use may have a rela-
                                                                tionship, this will not be linear (Pennebaker and
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       Stone, 2003) and there are individual differences.
ternational (CC BY 4.0).                                           Previous work on age detection in dialogue has
focused on speech features, which are known to            age     #samples   #tokens   mean L (± sd)   min-max L
systematically vary across age groups. For exam-          19-29   33,641     381,195   11.3 (±15.98)   1-423
ple, Wolters et al. (2009) learn logistic regression      50+     33,641     406,157   12.1 (±21.62)   1-1246
age classifiers from a small dialogue dataset us-         all     67,282     787,352   11.7 (±19.0)    1-1246
ing different acoustic cues supplemented with a
small set of hand-crafted lexical features, while Li      Table 1: Descriptive statistics of the dataset. L
et al. (2013) develop SVM classifiers using acous-        means length, i.e., number of tokens in a sample.
tic and prosodic features extracted from scripted
utterances spoken by participants interacting with
                                                          2     Data
an artificial system. In contrast to this line of work,
we investigate whether different age groups can be
                                                          We use a dataset of dialogue data where informa-
detected from textual linguistic information rather
                                                          tion about the age of the speakers involved in the
than voice-related cues. We explore whether, and
                                                          conversation is available (see the dialogue snip-
to what extent, various state-of-the-art NLP mod-
                                                          pets in Figure 1), i.e., the spoken partition of the
els are able to capture such differences in dialogue
                                                          British National Corpus (Love et al., 2017). This
data as a preliminary step to age-group adaptation
                                                          partition includes spoken informal open-domain
by conversational agents.
                                                          conversations between people that were collected
   We build on the work of Schler et al. (2006),          between 2012 and 2016 via crowd-sourcing, and
who focus on age detection in written discourse           then recorded and transcribed by the creators. Di-
using a corpus of blog posts. The authors learn           alogues can be between two or more interlocu-
a Multi-Class Real Winnow classifier leveraging a         tors, and are annotated along several dimensions
set of pre-determined style- and content-based fea-       including age and gender together with geographic
tures, including part-of-speech categories, func-         and social indicators. Speaker ages are catego-
tion words, and the 1000 unigrams with the high-          rized in ten brackets: 0-10, 11-18, 19-29, 30-39,
est information gain in the training set. They            40-49, 50-59, 60-69, 70-79, 80-89, and 90-99.
find that content features (lexical unigrams) yield
                                                             We focus on conversations that took place be-
higher accuracy (74%) than style features (72%),
                                                          tween two interlocutors, and only consider dia-
while their best results (76.2%) are obtained with
                                                          logues between people of the same age group. We
their combination. We extend this investigation in
                                                          then restrict our investigation to a binary opposi-
several key ways: (1) we leverage state-of-the-art
                                                          tion: younger vs. older age group. We split the
NLP models that allow us to learn representations
                                                          dialogues into their constituent utterances (e.g.,
end-to-end, without the need to specify concrete
                                                          from each dialogue snippet in Figure 1 we extract
features in advance; (2) we apply this approach
                                                          three utterances), and further pre-process them by
to dialogue data, using a large-scale dataset of
                                                          removing non-alphabetical characters. Only sam-
transcribed, spontaneous open-domain dialogues,
                                                          ples which are not empty after pre-processing are
and also use this approach to replicate the exper-
                                                          kept. For the younger group, we consider the
iments of Schler et al. (2006) on disccourse; (3)
                                                          19-29 bracket, which contains 138,662 utterances.
we show that text-based models can indeed detect
                                                          For the older, we group conversations from five
age-related differences, even in the case of very
                                                          brackets: 50-59, 60-69, 70-79, 80-89, and 90-99
sparse signal at the level of dialogue utterances;
                                                          (hence, 50+), which sums up to a total of 33,641
and finally (4) we carry out an in-depth analysis of
                                                          utterances. The choice of grouping these brackets
the models’ predictions to gain insight on which
                                                          is a trade-off between experimenting with fairly
elements of language use are most informative.1
                                                          distinct age groups (the age difference between
   Our work can be considered a first step toward
                                                          them is at least 20 years) and obtaining large-
the modeling of age-related linguistic adaptation
                                                          enough data for each of them.
by AI conversational systems. In particular, our
results can inform future work on controlled text            We randomly sample 33,614 utterances from
generation for dialogue agents (Dathathri et al.,         the 19-29 group in order to experiment with a bal-
2020; Madotto et al., 2020).                              anced number of samples per group. The resulting
                                                          dataset, that we use for our experiments, includes
  1
    Code and data available at: https://github.com/       around 67K utterances with an average length of
lennertjansen/detecting-age-in-dialogue                   11.7 tokens. Descriptive statistics are in Table 1.
3     Method                                           Model        Accuracy         F1
                                                                                       (19−29)
                                                                                                     F1
                                                                                                        (50+)

                                                                     ↑ better        ↑ better        ↑ better
We frame the problem as a binary classification
                                                       Random         0.500           0.500           0.500
task: given some text, we seek to predict whether
the age class of its speaker is younger or older.      unigram    0.701 (0.007)    0.708 (0.009)   0.693 (0.004)
                                                       bigram     0.719 (0.002)    0.724 (0.003)   0.714 (0.003)
                                                       trigram    0.722 (0.001)    0.727 (0.003)   0.717 (0.001)
3.1   Models
                                                       LSTM       0.693 (0.003)    0.696 (0.005)   0.691 (0.007)
We experiment with various models, that we             BiLSTM     0.691 (0.009)    0.702 (0.017)   0.679 (0.007)
briefly describe below. Details on model training      BERTf rozen 0.675 (0.003)   0.677 (0.008)   0.673 (0.010)
and evaluation are given at the end of the section.    BERTF T 0.729 (0.002)       0.730 (0.011)   0.727 (0.010)

n-gram Our simplest models are based on n-             Table 2: Test set results averaged over 5 random
grams, which have the advantage of being highly        initializations. Format: average metric (standard
interpretable. Each data entry (i.e., a dialogue ut-   error). Values in bold are the highest in the col-
terance) is split into chunks of all possible con-     umn; in blue, the second highest.
tiguous sequences of n tokens. The resulting vec-
torized features are used by a logistic regression
model to estimate the odds of a text sample be-        ent random initializations. All models are trained
longing to a certain age group. We experiment          on an NVIDIA TitanRTX GPU.
with unigram, bigram and trigram models. A bi-            The n-gram models are trained in a One-vs-Rest
gram model uses unigrams and bigrams, and a tri-       (OvR) fashion, and optimized using the Limited-
gram model unigrams, bigrams, and trigrams.            memory Broyden–Fletcher–Goldfarb–Shanno (L-
                                                       BFGS) algorithm (Liu and Nocedal, 1989), with a
LSTM and BiLSTM We use a standard Long                 maximum of 106 iterations. The n-gram models
Short-Term Memory network (LSTM) (Hochre-              are trained until convergence or for the maximum
iter and Schmidhuber, 1997) with two lay-              number of iterations.
ers, embedding size 512, and hidden layer size            LSTMs and BERT models are optimized using
1024. Batch-wise padding is applied to variable        Adam (Kingma and Ba, 2015), and trained for
length sequences. The original model’s bidirec-        10 epochs, with an early stopping patience of 3
tional extension, the bidirectional LSTM (BiL-         epochs. The RNN-based models’ embeddings are
STM) (Schuster and Paliwal, 1997), is also used.       jointly trained, and optimal hyperparameters (i.e.,
Padding is similarly applied to this model, and the    learning rate, embedding size, hidden layer size,
following optimal architecture is experimentally       and number of layers) are determined using the
found: embedding size 64, 2 layers, and hidden         validation set and a guided grid-search. BERTF T
layer size 512. Both RNN models are found to           is fine-tuned on the validation set for 10 epochs, or
perform optimally for a learning rate of 10−3 .        until the early stopping criterion is met. BERT has
                                                       a maximum input length of 512 tokens. Sequences
BERT We experiment with a Transformer-                 exceeding this length are truncated.
based model, i.e., BERT (Devlin et al., 2019).
BERT is pre-trained to learn deeply bidirectional      4   Results
language representations from massive amounts
of unlabeled textual data. We experiment with          We report accuracy and F1 for each age group
the base, uncased version of BERT, in two set-         in Table 2. As can be seen, the performance of
tings: by using its pre-trained frozen embeddings      all models is well beyond chance level, which in-
(BERTf rozen ) and by fine-tuning the embeddings       dicates that age-related linguistic differences can
on our age classification task (BERTF T ). BERT        be detected, to some extent, even by a simple
embeddings are followed by dropout with proba-         model based on unigrams. At the same time,
bility 0.1 and a linear layer with input size 768.     BERT fine-tuned on the task turns out to be the
                                                       best-performing model both in terms of accuracy
Experimental details The dataset is randomly           (0.729) and F1 scores, which confirms the effec-
split into a training (75%), validation (15%), and     tiveness of Transformer-based representations to
test (10%) set. Each model with a given configura-     encode fine-grained linguistic differences. How-
tion of hyperparameters is run 5 times with differ-    ever, it can be noted that the model based on tri-
                       % cases    avg. length (±std)*     the very good performance of the trigram model
both correct           63.17%       13.51 (±18.98)        suggests that leveraging ‘local’ linguistic features
both wrong             19.78%        5.82 (±8.33)         captured by n-grams is extremely effective in dia-
only trigram correct   7.91%        10.44 (±11.66)        logue. This could indicate that differences among
only BERT correct      9.14%        11.53 (±12.12)        various age groups are at the level of local lexical
                                                          constructions. This deserves further analysis, that
Table 3: Percentage cases of (non-)overlapping            we carry out in the next section.
(in)correctly predicted cases between trigram and
BERTF T . *Utterance length measured in tokens.           5     Analysis
                                                          We compare the two best-performing models, i.e.,
grams is basically on par with BERT in terms of           BERTF T and the one using trigrams, and aim to
accuracy (0.722), and well above both the LSTM            shed light on what cues they use to solve the task.
and BiLSTM models (0.693 and 0.691, respec-               We first compare the prediction patterns of the two
tively). A similar pattern is observed for F1             models, which allows us to detect easy and hard
scores, where BERTF T and the trigram model               examples. Second, we focus on the trigram model
achieve comparable performance, with LSTMs                and report the n-grams that turn out to be most
being overall behind.                                     informative to distinguish between age groups.
   Overall, our results indicate that text-based
models are effective, to some extent, in predict-         5.1    Comparing Model Predictions
ing the age group to which a speaker involved             We split the data for analysis by whether or not
in a dialogue belongs. This complements previ-            both models make the same correct or incorrect
ous evidence that age-related features can be de-         prediction, or whether they differ. Table 3 shows
tected in discourse (Schler et al., 2006), and shows      the breakdown of these results. As can be seen, a
that in dialogue the task appears to be somehow           quite large fraction of samples are correctly clas-
more challenging: The improvement in accuracy             sified by both models (63.17%), while in 19.78%
with respect to the majority/random baseline is           cases neither of the models make a correct predic-
lower in our dialogue results (+22.9%) as com-            tion. The remaining cases are almost evenly split
pared to what observed in discourse both by Schler        between cases where only one of the two is cor-
et al. (2006) (+32.4%) and by us (+27%) when              rect. As shown in Figure 2, the 19-29 age group
replicating their study using the models and exper-       appears to be be slightly easier compared to the
imental setup described in Section 3.1. Similarly         50+ group, where models make more errors.
to dialogue, BERTF T achieves the highest results            To qualitatively inspect what the utterances
in discourse (0.742). In contrast, both LSTMs             falling into these classes look like, in Table 4
(0.663) and n-grams (0.625) significantly lag be-         we show a few cherry-picked cases for each age
hind it. Note that, although based on the same            group. We notice that, not surprisingly, both mod-
corpus of texts, i.e., the Blog Authorship Corpus,2       els have trouble with backchanneling utterances
and the same 3 age groups, i.e., 13-17, 23-27, and        consisting of a single word, such as yeah, mm, or
33+, our replicated results are not fully compara-        really?, which are used by both age groups. For
ble to those by Schler et al. (2006). Due to our          example, both models seem to consider yeah as
more cautious data pre-processing, we experiment          a ‘young’ cue, which leads to wrong predictions
with more samples than they do (677K vs. 511K),           when yeah is used by a speaker in the 50+ group.
which in turn leads to a different majority baseline.     As for the utterance really?, BERTF T assigns it
   There can be several reasons why age group de-         to the 50+ group, while the trigram model makes
tection is more challenging in dialogue than in dis-      the opposite prediction. This indicates that certain
course. For example, in dialogue there may be             utterances simply do not contain sufficient distin-
dimensions of variation, such as turn-taking pat-         guishing information, and model predictions that
terns, that are not captured by our models and            are based on them should therefore not be con-
experimental setup. Yet, the present results do           sidered reliable. This seems to be particularly the
reveal a few interesting insights. In particular,         case for short utterances. Indeed, through com-
  2
    The corpus contains blog posts appeared on https://   paring the average length of the utterances incor-
www.blogger.com, gathered in or before August 2004.       rectly classified by both models (rightmost column
age      both correct                  both wrong                             only BERTF T correct           only trigram correct
19-29    I don’t know? sounded crazy   that’s a lot of people for one house   yeah okay                      really?
19-29    yeah                          well there you go                      oh I’m not very good at that   I’ve got a pen I’ve got a pen
19-29    do you have exams again?      mm                                     empty promises isn’t it?       day of death and ice-cream
50+      and as I say                  yeah                                   really?                        well if I were you
50+      yes                           that would be controversial            yeah it seems to               that’s it
50+      oh really?                    he’s got that already                  that we caused it              oh I thought you said Godzilla

        Table 4: Examples where both models are correct/wrong or only BERTF T /trigram is correct.


of Table 3), we notice that they are much shorter                      n-grams in the older category will more likely use
than those belonging to the other cases. This is in-                   yes, right, right right. A feature of younger lan-
teresting, and indicates a key challenge in the anal-                  guage also apparent from these examples is in their
ysis of dialogue data: on average, shorter utter-                      use of more informal language, which also extends
ances contain less signal. On the other hand, short                    to the use of foul language, making up a percent of
utterances can provide rich conversational signal                      the most informative unigrams shown in Table 5.
in dialogue; for example, backchanneling, excla-                          Interestingly, while topic words make up many
mations, or other acknowledging acts. As a con-                        of the most informative n-grams for older speakers
sequence, using length alone as a filter is not an                     in Table 5, younger speakers are more defined by
appropriate approach, as it can remove aspects of                      their use of slang words such as wanna, foul lan-
language use key to differentiating speaker groups.                    guage, or adjectives such as cute, cool, and mas-
                                                                       sive. A key finding from Schler et al. (2006) is
5.2     Most Informative N-grams                                       in the sentiment of language playing an important
Analyzing the most informative n-grams used by                         role, something which some of the most informa-
the trigram model allows us to qualitatively com-                      tive n-grams suggest may also be true for the di-
pare the linguistic differences inherent to each                       alogue dataset. As Table 5 demonstrates, younger
age group. In Table 5 we report the top 15 n-                          speakers use more dramatic language such as neg-
grams per group. We find, firstly and intuitively,                     ative foul words, and positive love, cute, cool; all
that colloquial language seems somewhat gener-                         words with a strong connotative meaning. We be-
ational, with unigrams particularly indicative of                      lieve that further inspection is needed to determine
younger speakers consisting of words such as cool                      whether the same sentiment pattern will be true of
and massive, and for older speakers, words like
wonderful. These unigrams are both informative
                                                                                           19-29                     50+
to the model and indicative of differences in both                               coef.     n-gram            coef.   n-gram
formality and ‘slang’ use across age groups.
                                                                                 -3.20     um                2.37    yes
   These most informative n-grams also indicate                                  -2.84     cool              2.12    you know
differences in back-channeling use between age                                   -2.58     s**t              2.09    wonderful
                                                                                 -2.12     hmm               1.90    how weird
groups; younger speaker’s language is more char-                                 -2.09     like              1.84    chinese
acterized by the use of um, hmm, while the top                                   -2.02     was like          1.73    right
                                                                                 -1.96     love              1.71    building
                                                                                 -1.96     as well           1.66    right right
                                                                                 -1.88     as in             1.55    so erm
                                                                                 -1.84     cute              1.43    mm mm
                                                                                 -1.82     uni               1.41    cheers
                                                                                 -1.79     massive           1.39    shed
                                                                                 -1.79     wanna             1.37    pain
                                                                                 -1.79     f**k              1.36    we know
                                                                                 -1.72     tut               1.08    yeah exactly

                                                                       Table 5: Top 15 most informative n-grams per age
                                                                       group used by the trigram model. coef. is the coef-
                                                                       ficient (and sign) of the corresponding n-gram for
                                                                       the logistic regression model: the higher its abso-
Figure 2: Distribution of predicted cases by tri-                      lute value, the higher the utterance’s odds to be-
gram and BERTF T models, split by age groups.                          long to one age group. * indicates foul language.
dialogue as it has been reported to be in discourse.   References
                                                       Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane
6   Conclusion                                           Hung, Eric Frank, Piero Molino, Jason Yosinski, and
                                                         Rosanne Liu. 2020. Plug and play language mod-
We investigated whether, and to what extent, NLP         els: A simple approach to controlled text generation.
models can detect age-related linguistic features in     In International Conference on Learning Represen-
                                                         tations.
dialogue data. We showed that, in line with what
we observed for discourse, state-of-the-art mod-       Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
els are capable of doing so with a reasonable ac-         Kristina Toutanova. 2019. BERT: Pre-training of
                                                          deep bidirectional transformers for language under-
curacy, in particular when the dialogue fragment
                                                          standing. In Proceedings of the 2019 Conference of
is long enough to contain discriminative signal.          the North American Chapter of the Association for
At the same time, we found that much simpler              Computational Linguistics: Human Language Tech-
models based on n-grams achieve comparable per-           nologies, Volume 1 (Long and Short Papers), pages
formance, which suggests that, in dialogue, ‘lo-          4171–4186, Minneapolis, Minnesota, June. Associ-
                                                          ation for Computational Linguistics.
cal’ features can be indicative of the language of
speakers from different age groups. We showed          Jessica Ficler and Yoav Goldberg. 2017. Control-
this to be the case, with both lexical and stylistic      ling linguistic style aspects in neural language gen-
                                                          eration. In Proceedings of the Workshop on Stylis-
cues being informative to these models in this task.      tic Variation, pages 94–104, Copenhagen, Denmark,
   While we performed the classification task at          September. Association for Computational Linguis-
the level of single dialogue utterances, future work      tics.
may take into account larger dialogue fragments,       Sepp Hochreiter and Jürgen Schmidhuber. 1997.
such as the entire dialogue or a fixed number of         Long short-term memory. Neural computation,
turns. This would make the setup more compa-             9(8):1735–1780.
rable to discourse, but would require making ex-
                                                       Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
perimental choices and dealing with extra compu-         method for stochastic optimization. In Yoshua Ben-
tational challenges. Moreover, it could be tested        gio and Yann LeCun, editors, 3rd International Con-
whether the language used by a speaker is equally        ference on Learning Representations, ICLR 2015,
discriminative when talking to a same-age (this          San Diego, CA, USA, May 7-9, 2015, Conference
                                                         Track Proceedings.
work) or a different-age interlocutor.
   Finally, we believe our findings could inform       Ming Li, Kyu J Han, and Shrikanth Narayanan. 2013.
future work on developing adaptive conversational        Automatic speaker age and gender recognition us-
                                                         ing acoustic and prosodic level information fusion.
systems. Since consistent language style differ-         Computer Speech & Language, 27(1):151–167.
ences were found between age groups (for exam-
ple, at the level of exclamatives and acknowledg-      Dong C Liu and Jorge Nocedal. 1989. On the limited
                                                         memory BFGS method for large scale optimization.
ments), systems whose language generation capa-
                                                         Mathematical programming, 45(1):503–528.
bilities aim to be consistent with a given age group
should therefore reproduce these patterns. This        R Love, C Dembry, A Hardie, V Brezina, and
could be achieved, for example, by embedding one         T McEnery. 2017. The Spoken BNC2014: design-
                                                         ing and building a spoken corpus of everyday con-
or more discriminative modules that control the          versations. In International Journal of Corpus Lin-
generation of a system’s output, which could lead        guistics, 22(3):319–344.
to better, more natural interactions between human
                                                       Andrea Madotto, Etsuko Ishii, Zhaojiang Lin, Sumanth
speakers and a conversational system.
                                                         Dathathri, and Pascale Fung. 2020. Plug-and-play
                                                         conversational models. In Findings of the Associa-
Acknowledgements                                         tion for Computational Linguistics: EMNLP 2020,
                                                         pages 2422–2433, Online, November. Association
This work received funding from the University of        for Computational Linguistics.
Amsterdam’s Research Priority Area Human(e) AI         Michael McTear. 2020. Conversational AI: Dialogue
and from the European Research Council (ERC)             systems, conversational agents, and chatbots. Syn-
under the European Union’s Horizon 2020 re-              thesis Lectures on Human Language Technologies,
search and innovation programme (grant agree-            13(3):1–251.
ment No. 819455).                                      James W Pennebaker and Lori D Stone.   2003.
                                                         Words of wisdom: Language use over the life
  span. Journal of Personality and Social Psychology,
  85(2):291–301.
Jonathan Schler, Moshe Koppel, Shlomo Argamon,
  and James W Pennebaker. 2006. Effects of age
  and gender on blogging. In AAAI spring sympo-
  sium: Computational approaches to analyzing we-
  blogs, volume 6, pages 199–205.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
  tional recurrent neural networks. IEEE transactions
  on Signal Processing, 45(11):2673–2681.
Margot J van der Goot and Tyler Pilgrim. 2019. Ex-
 ploring age differences in motivations for and accep-
 tance of chatbot communication in a customer ser-
 vice context. In International Workshop on Chatbot
 Research and Design, pages 173–186. Springer.
Maria Wolters, Ravichander Vipperla, and Steve Re-
 nals. 2009. Age recognition for spoken dialogue
 systems: Do we need it? In Tenth Annual Con-
 ference of the International Speech Communication
 Association (Interspeech).
Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang,
  Sicheng Wang, Ruisi Zhang, Meng Zhou, Jiaqi
  Zeng, Xiangyu Dong, Ruoyu Zhang, Hongchao
  Fang, Penghui Zhu, Shu Chen, and Pengtao Xie.
  2020. MedDialog: Large-scale medical dialogue
  datasets. In Proceedings of the 2020 Conference on
  Empirical Methods in Natural Language Process-
  ing (EMNLP), pages 9241–9250, Online, Novem-
  ber. Association for Computational Linguistics.

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur
  Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
  sonalizing dialogue agents: I have a dog, do you
  have pets too? In Proceedings of the 56th An-
  nual Meeting of the Association for Computational
  Linguistics (Volume 1: Long Papers), pages 2204–
  2213, Melbourne, Australia, July. Association for
  Computational Linguistics.
Yinhe Zheng, Guanyi Chen, Minlie Huang, Song Liu,
  and Xuan Zhu. 2019. Personalized dialogue gener-
  ation with diversified traits. CoRR, abs/1901.09672.