Detecting Age-Related Linguistic Patterns in Dialogue: Toward Adaptive Conversational Systems Lennert Jansen1 , Arabella Sinclair1 , Margot J. van der Goot2 , Raquel Fernández1 , Sandro Pezzelle1 1 Institute for Logic, Language and Computation (ILLC), University of Amsterdam 2 Amsterdam School of Communication Research (ASCoR), University of Amsterdam lennertjansen95@gmail.com {a.j.sinclair|m.j.vandergoot|raquel.fernandez|s.pezzelle}@uva.nl Abstract age 19-29 A: oh that’s coolaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa This work explores an important dimen- B: different sights and stuff A: oh sion of variation in the language used by dialogue participants: their age. While age 50+ A: well quite and I’d have to come back as wellaaaaaaaa previous work showed differences at var- B: that’s of course ious linguistic levels between age groups A: and make up for you know when experimenting with written dis- course data (e.g., blog posts), previous Figure 1: Example dialogue snippets from speak- work on dialogue has largely been limited ers of different age groups in the British National to acoustic information related to voice Corpus. We conjecture that stylistic and lexical and prosody. Detecting fine-grained lin- differences between age groups can be detected. guistic properties of human dialogues is Here, we experiment at the level of the utterance. of crucial importance for developing AI- based conversational systems which are of a particular individual or group of users con- able to adapt to their human interlocu- tinue to pose more of a challenge. Recent exam- tors. We therefore investigate whether, ples of this line of research include adaptation at and to what extent, current text-based NLP style level (Ficler and Goldberg, 2017), persona- models can detect such linguistic differ- specific traits (Zhang et al., 2018), or other traits ences, and what the features driving their such as sentiment (Dathathri et al., 2020). predictions are. We show that models Personalised interaction is of crucial importance achieve a fairly good performance on age- to obtain systems that can be trusted by users and group prediction, though the task appears perceived as natural (van der Goot and Pilgrim, to be more challenging compared to dis- 2019), but most of all to be accessible to varying course. Through in-depth analysis of the user profiles, rather than targeted at one particular best models’ errors and the most predic- user group (Zheng et al., 2019; Zeng et al., 2020). tive cues, we show that, in dialogue, differ- In this work, we focus on one particular as- ences among age groups mostly concern pect that may influence conversational agent suc- stylistic and lexical choices. We believe cess: user age profile. We investigate whether these findings can inform future work on the linguistic behaviour of conversational partic- developing controlled generation models ipants differs across age groups using state-of-the- for adaptive conversational systems. art NLP models on purely textual data, without considering vocal cues. We aim to detect age from 1 Introduction characteristics of language use and adapt to this Research on developing conversational agents has signal, rather than work from ground-truth meta- experienced impressive progress, particularly in data about user demographics. This is in the inter- recent years (McTear, 2020). However, artifi- est of preserving privacy, and from the perspective cial systems that can tune their language to that that while age and language use may have a rela- tionship, this will not be linear (Pennebaker and Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- Stone, 2003) and there are individual differences. ternational (CC BY 4.0). Previous work on age detection in dialogue has focused on speech features, which are known to age #samples #tokens mean L (± sd) min-max L systematically vary across age groups. For exam- 19-29 33,641 381,195 11.3 (±15.98) 1-423 ple, Wolters et al. (2009) learn logistic regression 50+ 33,641 406,157 12.1 (±21.62) 1-1246 age classifiers from a small dialogue dataset us- all 67,282 787,352 11.7 (±19.0) 1-1246 ing different acoustic cues supplemented with a small set of hand-crafted lexical features, while Li Table 1: Descriptive statistics of the dataset. L et al. (2013) develop SVM classifiers using acous- means length, i.e., number of tokens in a sample. tic and prosodic features extracted from scripted utterances spoken by participants interacting with 2 Data an artificial system. In contrast to this line of work, we investigate whether different age groups can be We use a dataset of dialogue data where informa- detected from textual linguistic information rather tion about the age of the speakers involved in the than voice-related cues. We explore whether, and conversation is available (see the dialogue snip- to what extent, various state-of-the-art NLP mod- pets in Figure 1), i.e., the spoken partition of the els are able to capture such differences in dialogue British National Corpus (Love et al., 2017). This data as a preliminary step to age-group adaptation partition includes spoken informal open-domain by conversational agents. conversations between people that were collected We build on the work of Schler et al. (2006), between 2012 and 2016 via crowd-sourcing, and who focus on age detection in written discourse then recorded and transcribed by the creators. Di- using a corpus of blog posts. The authors learn alogues can be between two or more interlocu- a Multi-Class Real Winnow classifier leveraging a tors, and are annotated along several dimensions set of pre-determined style- and content-based fea- including age and gender together with geographic tures, including part-of-speech categories, func- and social indicators. Speaker ages are catego- tion words, and the 1000 unigrams with the high- rized in ten brackets: 0-10, 11-18, 19-29, 30-39, est information gain in the training set. They 40-49, 50-59, 60-69, 70-79, 80-89, and 90-99. find that content features (lexical unigrams) yield We focus on conversations that took place be- higher accuracy (74%) than style features (72%), tween two interlocutors, and only consider dia- while their best results (76.2%) are obtained with logues between people of the same age group. We their combination. We extend this investigation in then restrict our investigation to a binary opposi- several key ways: (1) we leverage state-of-the-art tion: younger vs. older age group. We split the NLP models that allow us to learn representations dialogues into their constituent utterances (e.g., end-to-end, without the need to specify concrete from each dialogue snippet in Figure 1 we extract features in advance; (2) we apply this approach three utterances), and further pre-process them by to dialogue data, using a large-scale dataset of removing non-alphabetical characters. Only sam- transcribed, spontaneous open-domain dialogues, ples which are not empty after pre-processing are and also use this approach to replicate the exper- kept. For the younger group, we consider the iments of Schler et al. (2006) on disccourse; (3) 19-29 bracket, which contains 138,662 utterances. we show that text-based models can indeed detect For the older, we group conversations from five age-related differences, even in the case of very brackets: 50-59, 60-69, 70-79, 80-89, and 90-99 sparse signal at the level of dialogue utterances; (hence, 50+), which sums up to a total of 33,641 and finally (4) we carry out an in-depth analysis of utterances. The choice of grouping these brackets the models’ predictions to gain insight on which is a trade-off between experimenting with fairly elements of language use are most informative.1 distinct age groups (the age difference between Our work can be considered a first step toward them is at least 20 years) and obtaining large- the modeling of age-related linguistic adaptation enough data for each of them. by AI conversational systems. In particular, our results can inform future work on controlled text We randomly sample 33,614 utterances from generation for dialogue agents (Dathathri et al., the 19-29 group in order to experiment with a bal- 2020; Madotto et al., 2020). anced number of samples per group. The resulting dataset, that we use for our experiments, includes 1 Code and data available at: https://github.com/ around 67K utterances with an average length of lennertjansen/detecting-age-in-dialogue 11.7 tokens. Descriptive statistics are in Table 1. 3 Method Model Accuracy F1 (19−29) F1 (50+) ↑ better ↑ better ↑ better We frame the problem as a binary classification Random 0.500 0.500 0.500 task: given some text, we seek to predict whether the age class of its speaker is younger or older. unigram 0.701 (0.007) 0.708 (0.009) 0.693 (0.004) bigram 0.719 (0.002) 0.724 (0.003) 0.714 (0.003) trigram 0.722 (0.001) 0.727 (0.003) 0.717 (0.001) 3.1 Models LSTM 0.693 (0.003) 0.696 (0.005) 0.691 (0.007) We experiment with various models, that we BiLSTM 0.691 (0.009) 0.702 (0.017) 0.679 (0.007) briefly describe below. Details on model training BERTf rozen 0.675 (0.003) 0.677 (0.008) 0.673 (0.010) and evaluation are given at the end of the section. BERTF T 0.729 (0.002) 0.730 (0.011) 0.727 (0.010) n-gram Our simplest models are based on n- Table 2: Test set results averaged over 5 random grams, which have the advantage of being highly initializations. Format: average metric (standard interpretable. Each data entry (i.e., a dialogue ut- error). Values in bold are the highest in the col- terance) is split into chunks of all possible con- umn; in blue, the second highest. tiguous sequences of n tokens. The resulting vec- torized features are used by a logistic regression model to estimate the odds of a text sample be- ent random initializations. All models are trained longing to a certain age group. We experiment on an NVIDIA TitanRTX GPU. with unigram, bigram and trigram models. A bi- The n-gram models are trained in a One-vs-Rest gram model uses unigrams and bigrams, and a tri- (OvR) fashion, and optimized using the Limited- gram model unigrams, bigrams, and trigrams. memory Broyden–Fletcher–Goldfarb–Shanno (L- BFGS) algorithm (Liu and Nocedal, 1989), with a LSTM and BiLSTM We use a standard Long maximum of 106 iterations. The n-gram models Short-Term Memory network (LSTM) (Hochre- are trained until convergence or for the maximum iter and Schmidhuber, 1997) with two lay- number of iterations. ers, embedding size 512, and hidden layer size LSTMs and BERT models are optimized using 1024. Batch-wise padding is applied to variable Adam (Kingma and Ba, 2015), and trained for length sequences. The original model’s bidirec- 10 epochs, with an early stopping patience of 3 tional extension, the bidirectional LSTM (BiL- epochs. The RNN-based models’ embeddings are STM) (Schuster and Paliwal, 1997), is also used. jointly trained, and optimal hyperparameters (i.e., Padding is similarly applied to this model, and the learning rate, embedding size, hidden layer size, following optimal architecture is experimentally and number of layers) are determined using the found: embedding size 64, 2 layers, and hidden validation set and a guided grid-search. BERTF T layer size 512. Both RNN models are found to is fine-tuned on the validation set for 10 epochs, or perform optimally for a learning rate of 10−3 . until the early stopping criterion is met. BERT has a maximum input length of 512 tokens. Sequences BERT We experiment with a Transformer- exceeding this length are truncated. based model, i.e., BERT (Devlin et al., 2019). BERT is pre-trained to learn deeply bidirectional 4 Results language representations from massive amounts of unlabeled textual data. We experiment with We report accuracy and F1 for each age group the base, uncased version of BERT, in two set- in Table 2. As can be seen, the performance of tings: by using its pre-trained frozen embeddings all models is well beyond chance level, which in- (BERTf rozen ) and by fine-tuning the embeddings dicates that age-related linguistic differences can on our age classification task (BERTF T ). BERT be detected, to some extent, even by a simple embeddings are followed by dropout with proba- model based on unigrams. At the same time, bility 0.1 and a linear layer with input size 768. BERT fine-tuned on the task turns out to be the best-performing model both in terms of accuracy Experimental details The dataset is randomly (0.729) and F1 scores, which confirms the effec- split into a training (75%), validation (15%), and tiveness of Transformer-based representations to test (10%) set. Each model with a given configura- encode fine-grained linguistic differences. How- tion of hyperparameters is run 5 times with differ- ever, it can be noted that the model based on tri- % cases avg. length (±std)* the very good performance of the trigram model both correct 63.17% 13.51 (±18.98) suggests that leveraging ‘local’ linguistic features both wrong 19.78% 5.82 (±8.33) captured by n-grams is extremely effective in dia- only trigram correct 7.91% 10.44 (±11.66) logue. This could indicate that differences among only BERT correct 9.14% 11.53 (±12.12) various age groups are at the level of local lexical constructions. This deserves further analysis, that Table 3: Percentage cases of (non-)overlapping we carry out in the next section. (in)correctly predicted cases between trigram and BERTF T . *Utterance length measured in tokens. 5 Analysis We compare the two best-performing models, i.e., grams is basically on par with BERT in terms of BERTF T and the one using trigrams, and aim to accuracy (0.722), and well above both the LSTM shed light on what cues they use to solve the task. and BiLSTM models (0.693 and 0.691, respec- We first compare the prediction patterns of the two tively). A similar pattern is observed for F1 models, which allows us to detect easy and hard scores, where BERTF T and the trigram model examples. Second, we focus on the trigram model achieve comparable performance, with LSTMs and report the n-grams that turn out to be most being overall behind. informative to distinguish between age groups. Overall, our results indicate that text-based models are effective, to some extent, in predict- 5.1 Comparing Model Predictions ing the age group to which a speaker involved We split the data for analysis by whether or not in a dialogue belongs. This complements previ- both models make the same correct or incorrect ous evidence that age-related features can be de- prediction, or whether they differ. Table 3 shows tected in discourse (Schler et al., 2006), and shows the breakdown of these results. As can be seen, a that in dialogue the task appears to be somehow quite large fraction of samples are correctly clas- more challenging: The improvement in accuracy sified by both models (63.17%), while in 19.78% with respect to the majority/random baseline is cases neither of the models make a correct predic- lower in our dialogue results (+22.9%) as com- tion. The remaining cases are almost evenly split pared to what observed in discourse both by Schler between cases where only one of the two is cor- et al. (2006) (+32.4%) and by us (+27%) when rect. As shown in Figure 2, the 19-29 age group replicating their study using the models and exper- appears to be be slightly easier compared to the imental setup described in Section 3.1. Similarly 50+ group, where models make more errors. to dialogue, BERTF T achieves the highest results To qualitatively inspect what the utterances in discourse (0.742). In contrast, both LSTMs falling into these classes look like, in Table 4 (0.663) and n-grams (0.625) significantly lag be- we show a few cherry-picked cases for each age hind it. Note that, although based on the same group. We notice that, not surprisingly, both mod- corpus of texts, i.e., the Blog Authorship Corpus,2 els have trouble with backchanneling utterances and the same 3 age groups, i.e., 13-17, 23-27, and consisting of a single word, such as yeah, mm, or 33+, our replicated results are not fully compara- really?, which are used by both age groups. For ble to those by Schler et al. (2006). Due to our example, both models seem to consider yeah as more cautious data pre-processing, we experiment a ‘young’ cue, which leads to wrong predictions with more samples than they do (677K vs. 511K), when yeah is used by a speaker in the 50+ group. which in turn leads to a different majority baseline. As for the utterance really?, BERTF T assigns it There can be several reasons why age group de- to the 50+ group, while the trigram model makes tection is more challenging in dialogue than in dis- the opposite prediction. This indicates that certain course. For example, in dialogue there may be utterances simply do not contain sufficient distin- dimensions of variation, such as turn-taking pat- guishing information, and model predictions that terns, that are not captured by our models and are based on them should therefore not be con- experimental setup. Yet, the present results do sidered reliable. This seems to be particularly the reveal a few interesting insights. In particular, case for short utterances. Indeed, through com- 2 The corpus contains blog posts appeared on https:// paring the average length of the utterances incor- www.blogger.com, gathered in or before August 2004. rectly classified by both models (rightmost column age both correct both wrong only BERTF T correct only trigram correct 19-29 I don’t know? sounded crazy that’s a lot of people for one house yeah okay really? 19-29 yeah well there you go oh I’m not very good at that I’ve got a pen I’ve got a pen 19-29 do you have exams again? mm empty promises isn’t it? day of death and ice-cream 50+ and as I say yeah really? well if I were you 50+ yes that would be controversial yeah it seems to that’s it 50+ oh really? he’s got that already that we caused it oh I thought you said Godzilla Table 4: Examples where both models are correct/wrong or only BERTF T /trigram is correct. of Table 3), we notice that they are much shorter n-grams in the older category will more likely use than those belonging to the other cases. This is in- yes, right, right right. A feature of younger lan- teresting, and indicates a key challenge in the anal- guage also apparent from these examples is in their ysis of dialogue data: on average, shorter utter- use of more informal language, which also extends ances contain less signal. On the other hand, short to the use of foul language, making up a percent of utterances can provide rich conversational signal the most informative unigrams shown in Table 5. in dialogue; for example, backchanneling, excla- Interestingly, while topic words make up many mations, or other acknowledging acts. As a con- of the most informative n-grams for older speakers sequence, using length alone as a filter is not an in Table 5, younger speakers are more defined by appropriate approach, as it can remove aspects of their use of slang words such as wanna, foul lan- language use key to differentiating speaker groups. guage, or adjectives such as cute, cool, and mas- sive. A key finding from Schler et al. (2006) is 5.2 Most Informative N-grams in the sentiment of language playing an important Analyzing the most informative n-grams used by role, something which some of the most informa- the trigram model allows us to qualitatively com- tive n-grams suggest may also be true for the di- pare the linguistic differences inherent to each alogue dataset. As Table 5 demonstrates, younger age group. In Table 5 we report the top 15 n- speakers use more dramatic language such as neg- grams per group. We find, firstly and intuitively, ative foul words, and positive love, cute, cool; all that colloquial language seems somewhat gener- words with a strong connotative meaning. We be- ational, with unigrams particularly indicative of lieve that further inspection is needed to determine younger speakers consisting of words such as cool whether the same sentiment pattern will be true of and massive, and for older speakers, words like wonderful. These unigrams are both informative 19-29 50+ to the model and indicative of differences in both coef. n-gram coef. n-gram formality and ‘slang’ use across age groups. -3.20 um 2.37 yes These most informative n-grams also indicate -2.84 cool 2.12 you know differences in back-channeling use between age -2.58 s**t 2.09 wonderful -2.12 hmm 1.90 how weird groups; younger speaker’s language is more char- -2.09 like 1.84 chinese acterized by the use of um, hmm, while the top -2.02 was like 1.73 right -1.96 love 1.71 building -1.96 as well 1.66 right right -1.88 as in 1.55 so erm -1.84 cute 1.43 mm mm -1.82 uni 1.41 cheers -1.79 massive 1.39 shed -1.79 wanna 1.37 pain -1.79 f**k 1.36 we know -1.72 tut 1.08 yeah exactly Table 5: Top 15 most informative n-grams per age group used by the trigram model. coef. is the coef- ficient (and sign) of the corresponding n-gram for the logistic regression model: the higher its abso- Figure 2: Distribution of predicted cases by tri- lute value, the higher the utterance’s odds to be- gram and BERTF T models, split by age groups. long to one age group. * indicates foul language. dialogue as it has been reported to be in discourse. References Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane 6 Conclusion Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language mod- We investigated whether, and to what extent, NLP els: A simple approach to controlled text generation. models can detect age-related linguistic features in In International Conference on Learning Represen- tations. dialogue data. We showed that, in line with what we observed for discourse, state-of-the-art mod- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and els are capable of doing so with a reasonable ac- Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- curacy, in particular when the dialogue fragment standing. In Proceedings of the 2019 Conference of is long enough to contain discriminative signal. the North American Chapter of the Association for At the same time, we found that much simpler Computational Linguistics: Human Language Tech- models based on n-grams achieve comparable per- nologies, Volume 1 (Long and Short Papers), pages formance, which suggests that, in dialogue, ‘lo- 4171–4186, Minneapolis, Minnesota, June. Associ- ation for Computational Linguistics. cal’ features can be indicative of the language of speakers from different age groups. We showed Jessica Ficler and Yoav Goldberg. 2017. Control- this to be the case, with both lexical and stylistic ling linguistic style aspects in neural language gen- eration. In Proceedings of the Workshop on Stylis- cues being informative to these models in this task. tic Variation, pages 94–104, Copenhagen, Denmark, While we performed the classification task at September. Association for Computational Linguis- the level of single dialogue utterances, future work tics. may take into account larger dialogue fragments, Sepp Hochreiter and Jürgen Schmidhuber. 1997. such as the entire dialogue or a fixed number of Long short-term memory. Neural computation, turns. This would make the setup more compa- 9(8):1735–1780. rable to discourse, but would require making ex- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A perimental choices and dealing with extra compu- method for stochastic optimization. In Yoshua Ben- tational challenges. Moreover, it could be tested gio and Yann LeCun, editors, 3rd International Con- whether the language used by a speaker is equally ference on Learning Representations, ICLR 2015, discriminative when talking to a same-age (this San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. work) or a different-age interlocutor. Finally, we believe our findings could inform Ming Li, Kyu J Han, and Shrikanth Narayanan. 2013. future work on developing adaptive conversational Automatic speaker age and gender recognition us- ing acoustic and prosodic level information fusion. systems. Since consistent language style differ- Computer Speech & Language, 27(1):151–167. ences were found between age groups (for exam- ple, at the level of exclamatives and acknowledg- Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. ments), systems whose language generation capa- Mathematical programming, 45(1):503–528. bilities aim to be consistent with a given age group should therefore reproduce these patterns. This R Love, C Dembry, A Hardie, V Brezina, and could be achieved, for example, by embedding one T McEnery. 2017. The Spoken BNC2014: design- ing and building a spoken corpus of everyday con- or more discriminative modules that control the versations. In International Journal of Corpus Lin- generation of a system’s output, which could lead guistics, 22(3):319–344. to better, more natural interactions between human Andrea Madotto, Etsuko Ishii, Zhaojiang Lin, Sumanth speakers and a conversational system. Dathathri, and Pascale Fung. 2020. Plug-and-play conversational models. In Findings of the Associa- Acknowledgements tion for Computational Linguistics: EMNLP 2020, pages 2422–2433, Online, November. Association This work received funding from the University of for Computational Linguistics. Amsterdam’s Research Priority Area Human(e) AI Michael McTear. 2020. Conversational AI: Dialogue and from the European Research Council (ERC) systems, conversational agents, and chatbots. Syn- under the European Union’s Horizon 2020 re- thesis Lectures on Human Language Technologies, search and innovation programme (grant agree- 13(3):1–251. ment No. 819455). James W Pennebaker and Lori D Stone. 2003. Words of wisdom: Language use over the life span. Journal of Personality and Social Psychology, 85(2):291–301. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. 2006. Effects of age and gender on blogging. In AAAI spring sympo- sium: Computational approaches to analyzing we- blogs, volume 6, pages 199–205. Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681. Margot J van der Goot and Tyler Pilgrim. 2019. Ex- ploring age differences in motivations for and accep- tance of chatbot communication in a customer ser- vice context. In International Workshop on Chatbot Research and Design, pages 173–186. Springer. Maria Wolters, Ravichander Vipperla, and Steve Re- nals. 2009. Age recognition for spoken dialogue systems: Do we need it? In Tenth Annual Con- ference of the International Speech Communication Association (Interspeech). Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang, Sicheng Wang, Ruisi Zhang, Meng Zhou, Jiaqi Zeng, Xiangyu Dong, Ruoyu Zhang, Hongchao Fang, Penghui Zhu, Shu Chen, and Pengtao Xie. 2020. MedDialog: Large-scale medical dialogue datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 9241–9250, Online, Novem- ber. Association for Computational Linguistics. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Per- sonalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204– 2213, Melbourne, Australia, July. Association for Computational Linguistics. Yinhe Zheng, Guanyi Chen, Minlie Huang, Song Liu, and Xuan Zhu. 2019. Personalized dialogue gener- ation with diversified traits. CoRR, abs/1901.09672.