<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Linguistic and Cognitive Approaches to Dialog Agents Workshop, Nov</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Generation of Interlocutor Profiling Sentences from Utterances and Their Implicit Context</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shinji Muraji</string-name>
          <email>shinjimuraji@ist.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafal Rzepka</string-name>
          <email>rzepka@ist.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toshihiko Itoh</string-name>
          <email>t-itoh@ist.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hokkaido University</institution>
          ,
          <addr-line>Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido, 060-0814</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>19</volume>
      <issue>2024</issue>
      <fpage>45</fpage>
      <lpage>55</lpage>
      <abstract>
        <p>Recently, it has been reported that the quality of chats can be improved by utilizing information about what kind of persona the chat interlocutor and what kind of personality the system itself mimics in order to generate responses using LLMs. In these studies, personality data is stored in a form of statements describing interlocutor's profile. In order to control LLM using profile describing sentences in actual chats, it is not enough to prepare them in advance, but it is necessary to actively add them through the chats. However, not enough research has been conducted on generating the profiling sentences from the utterances in the chat dialogs. In particular, there are no studies on the generation of profiling sentences that can only be generated with context the context. Therefore, in this study, we propose a task to generate profiling sentences from the target utterances based on context in a chat dialog in Japanese, and create a dataset for this task. Experiments on the created dataset and analysis show that LLM can generate profiling sentences while taking the context into account.</p>
      </abstract>
      <kwd-group>
        <kwd>Chat system</kwd>
        <kwd>Profiling sentence generation</kwd>
        <kwd>Persona modelling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, the performance of large language models (LLMs) has dramatically improved, leading
to significant advancements in dialog systems. However, in case of chat, dialog systems based solely on
language models sufer from limitations due to the inability to remember users in the long term and
inconsistent utterances in simulating personality. Related research [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] has shown that the quality of
dialog in a chatting system can be improved by maintaining personality information, such as what kind
of person one is talking to and what kind of personality one imitates, separately from the dialog history,
and utilizing this information for response generation . In studies on long-term memory, maintaining
personality information separately from the dialog history is useful for remembering the personality of
the conversation user enabling appropriate utterances tailored to that interlocutor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In the study of
utterance consistency, it is useful to learn the relationship between a personality and an utterance and
to map a candidate utterance to a coherent utterance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Many studies [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ] have employed personality
information in the form of natural language list (hereafter referred to as profiling sentences ) and have
shown its efectiveness.
      </p>
      <p>
        In order to utilize profiling sentences for response generation in actual conversations, it is essential
to prepare profiling sentences that express the personality of the interlocutor (hereinafter referred to as
user profiling sentences ) and profiling sentences that represent the personality that the system should
assume as its own personality (hereinafter referred to as system profiling sentences ), which have been
collected during previous dialogs. Previous studies have either prepared profiling sentences in advance
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or substituted sentences extracted by rules [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for profiling sentences, Although user profiling from
utterances is important for both dialog consistency and user adaptation, research on this topic is rather
scarce.
      </p>
      <p>
        In previous research on improving the consistency of the system’s own dialog by using system
profiling sentences, about five profiling sentences are prepared as a part of the personalities commonly used
in chats before starting a chat and are added to the input for dialog generation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, methods
that use only the pre-defined profiling sentences as the system profile do not control personalities
      </p>
      <p>CEUR
Workshop
Proceedings</p>
      <p>ceur-ws.org
ISSN1613-0073
beyond these sentences, and thus cannot maintain consistency over a long period of time. It is not
realistic to prepare in advance information about all aspects of a person’s personality as system profiling
sentences to maintain consistency, hence the system needs to make sure that the utterance it is about
to generate is consistent with past utterances.</p>
      <p>In addition, it is desirable for the system to be able to recognize user profiling sentences from
utterances during chats, since it would be a great burden to have the users themselves prepare their
own profiling sentences in advance.</p>
      <p>Thus, both long-term memory and utterance coherence research require the recognition of
individuality from utterances. We address this issue by generating profiling sentences from utterances as a task
of personality recognition.</p>
      <p>We propose a task to generate profiling sentences with LLMs that take into account the context
of chats, and construct a dataset by LLM generation and human consistency checks. Table 1 shows
examples of sentences inferred by human from utterances and their implicit context. In addition, we
propose a method to train LLM and generate profiling sentences on that dataset, and report on the
automatic and manual evaluation of the generated results. Furthermore, we confirm the importance
of considering the context of the chat by manually checking the dataset itself and the output of the
trained profiling sentence generation model are context-sensitive.</p>
      <p>The contributions of this paper are as follows:
• Proposed a task to generate profiling sentences considering the implicit chat context
• Proposed method for training profiling sentences generation with LLM
• Identified diferences between automatic and manual evaluation in profiling sentences
generation
• Confirmed that profiling sentences generation by LLM is dificult to take into account</p>
      <p>the chat context</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Currently, there are three main methods for linking utterances to profiling sentences, but the existing
methods are not suficient for the task of recognizing profiling sentences in chit-chat utterances.</p>
      <p>
        First, there is an approach that uses classification to recognize whether an utterance corresponds to a
profiling sentence. For example, some studies [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] treat the task of determining whether an utterance
contradicts a profiling sentence as a classification problem for implication relations. Other studies
[
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ] focus on preparing several profile sets, each consisting of about five sentences, and then selecting
the profile set that corresponds to the utterance. However, these methods are dificult to use because
the task is to classify the relationship between the profiling sentences prepared in advance and the
utterances, and to perform the classification in actual chats where any topic may appear, all kinds of
profiling sentences must be prepared in advance.
      </p>
      <p>
        Second, there is an approach to extract profiling sentences from dialog by rules. For instance, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
searched Reddit comments for sentences that fit several rules, such as “contains either the word I or
my,” and used them as a substitute for the commenter’s profiling sentence. Although such a method for
extracting literal profiling sentences is efective to a certain extent, it may extract sentences that require
contextual processing when applied to chat dialogs. (For example, “A is a college student.” in Table 1
cannot be extracted from target utterance only.) In addition, the creation of rules is labor-intensive, and
the same rules may not be applicable to some languages. In particular, the subject I is often omitted in
Japanese uterances, which is the target of this study, and many profiling sentences would be overlooked
if this rule is applied.
      </p>
      <p>
        Third, there is a method to generate profiling sentences from dialogs using LLM. However, existing
research [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] ignores the context of the previous chats and does not perform human evaluation of the
generated profiling sentences.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Profile generation task</title>
      <p>During a chat, humans recognize and update information about the personality of the person they are
talking to each time they receive the other person’s utterance. Regarding the recognition part, humans
can to some extent write down the information of the other party as short sentences for each utterance.
It is important to note that the profiling sentences written out by the human are not extracted solely
from the target utterance, as shown in Table 1. Our goal is to enable the system to extract and record
the interlocutor’s information as profiling sentences from the utterance, similarly to human recognition.
As in previous studies, we do not define the exact nature of a profiling sentence. Instead, we treat all
sentences that describe personality or personal facts as profiling sentences, consistent with their usage
and validation in prior research. It would be preferable generate human-like profiles in the system,
but the annotation becomes very complex when trying to rigorously create such tagged dialog data.
Therefore, this paper focuses on demonstrating the importance of generating profiling sentences by
LLM using context. In this section, we summarize the diferences between the ideal task we originally
wanted to perform, the conventional task, and the proposed task.</p>
      <sec id="sec-3-1">
        <title>3.1. Problems of previous studies</title>
        <p>First, we describe the ideal task. Let Δ  be one of the profile sentences of the interlocutor’s profiling
sentence can be inferred from the  th utterance   of the dialogue, and let Δ  = {Δ 1, Δ 2, Δ 3...} be
all the profile sentences from utterance   . In addition,   = { 1,  2,  3...} be all interlocutor’s profiling
sentences collected from the previous dialogs, updated by Δ  after receiving   .</p>
        <p>Note that some profiling sentences are updated, including the place of residence or hobbies, and that
  is not the union set of Δ  from  = 1 to  =  .</p>
        <p>Let   = { 1,  2,  3...} be the context other than the target utterance necessary to generate the
information about another person known from the utterance, then the profiling sentence generated
by model  ℎ , which has (theoretically) human-like capabilities, can be Δ  =  ℎ (  ,   ). The
ideal profiling sentence generation is to obtain  ℎ and generate human-like profiling sentences.
However,   can assume a variety of elements, which is problematic when creating a dataset. In the case
of a human predicting an interlocutor’s profiling sentence from utterance, the   can be assumed to be
the profile of a dialog interlocutor already known, the dialog history, or any other concept that is shared
with the interlocutor. The elements of   in this paper are the interlocutor’s profiling sentence  −1 held
before receiving an utterance and the history of utterances   = { 1,  2,  3...,  −1 }. When starting to talk,
 1,  1, and  0 are empty sets. In this study, we do not update the profile, but for the sake of explanation,
we denote the profile update model as   . The formula for profile update is   =   (Δ  ,  −1 ). In
summary, ideally the profiling sentence should be extracted and updated as follows:
• Extraction: Δ  =  (
• Update:   =   (Δ
 ,   ) =  (</p>
        <p>,  −1 ,   )
 ,  −1 )</p>
        <p>It is most desirable to create a dataset by storing   and Δ  for each   , but in reality, problems such as
the burden and cost of annotators appear. The longer the conversation is, the more profiling sentences
  are collected for each chat, and it is not easy to annotate each utterance while keeping track of all of
profiling sentences.</p>
        <p>
          On the other hand, the task setting of the profiling sentence generation in the previous study [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] can
be expressed as Δ  =  (  ), which is the ideal task minus the context information   . However, there
is also persona information for which contextual information is essential. An example that requires
contextual information is shown in Table 1. In a previous study, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] analyzed how much contextual
information is needed, and they found that humans use context in about 15% of profiling sentences when
inferring profiling sentences. Therefore, in a task setting that excludes contextual information, even
humans have an upper limit of 85% recall, which means that one out of every six profiling sentences is
impossible to be discovered in the first place.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task to generate profiling sentences from context and utterance</title>
        <p>Δ _
We propose a task to generate profiling sentences from a target utterance while considering the two
previous utterances as context. Here, the   profiling sentence is an important element, but we ignore
it in this study. In our task setting, it is crucial to efectively utilize the two preceding utterances as
context   for generation. In other words, with the utterance history   = { −1 ,  −2 }, our proposed
task can be expressed as follows.</p>
        <p>Δ  =  (  ,   ) =  (  ,   ) =  (  ,  −1 ,  −2 )</p>
        <p>This task aims to generate zero or more known profiling sentences Δ  from the target utterance
  using contextual information  −1 , −2 . Table 1 shows examples of utterances with 4 profiling
sentences, Table 2 shows examples of utterances with 0 and 1 profiling sentences. In the previous
study, utterances that humans inferred to be without profiling sentences were not included in the
target utterances. However, in this study, we included them to simulate a more realistic chatting
scenario. Since humans can also infer implicit contextual information  −1 and  −2 to predict profiling
sentences from target utterances   , the results are compared with the results of a human performing
Δ _ =     (  ,  −1 ,  −2 ) in the same task.</p>
        <p>We create a dataset for this task and provide a benchmark. We also compare and analyze Δ  and
in the proposed task to confirm the importance of context in profiling sentence generation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Data collection</title>
      <p>In this section, we first describe the chat data we use to build our dataset, and then explain how we
create profiling sentences from the chat data and ensure their quality.</p>
      <sec id="sec-4-1">
        <title>4.1. Original chat data</title>
        <p>
          We extend an existing chat dataset by creating a dataset that links target utterances with
contextsensitive profiling sentences. The original dataset for the extension is the JPersonaChat dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
This dataset is a Japanese version of the PersonaChat dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This dataset is a collection of data
from a dialogue between crowdworkers who were given a five-sentence pre-profiling sentence and who
play their asigned roles. In order to play their assigned role, crowd workers tend to make statements
related to their role. Therefore, we decide to utilize it because we considered it suitable for linking
utterances to profiles. It is important to note that our goal is to predict the profiling sentences that
will be inferred from utterance, but the profiling sentences given in advance are not necessarily the
profiling sentences that will be recognized from utterance. While playing the role, the crowdworker
may add a profiling sentence to the utterance that has not been given in advance. There are also cases
where the profiling sentences given in advance are not used in the target utterance. Therefore, we do
not use the profiling sentences for the roles given to the crowd workers in the original dataset. We use
only the dialogues as the chat dataset.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Interlocutor profiling sentence creation</title>
        <p>
          As described in the task setup section, we add the two most recent utterances in the utterance history,
including the interlocutor utterance, to a single target utterance as contextual information. We use
a subset of the original dataset divided by dialogue. The target dialogs for extraction are 500 dialogs
randomly selected from the JPersonaChat dataset, which contains 5,448 utterances. All utterances in
the obtained subset dialogs are considered as target utterances, and each target utterance and the two
utterances immediately preceding it are considered as one case of data. Here, we counted the number of
utterances required by one of the authors to infer profiling sentences for 100 target utterances randomly
selected from the original chat data, and found that there was only one target utterance that required
three or more utterances of context, so we set the number of utterances used for context to two. If the
target utterance is within two utterances from the start of the conversation, the entire dialogue history
is added. We want to obtain corresponding profiling sentences for each target utterance. However, it
takes a lot of efort to obtain a comprehensive and accurate profiling sentences all by hand. The method
in which a single annotator infers a profiling sentence for a single target utterance is unreliable. The
method in which multiple people check the profiling sentences inferred by one person is considered
to be more accurate, but it lacks comprehensiveness because it misses profiling sentences that were
not recognized by the annotator who made the inference. Ideally, the profiling sentences written by
several people should be merged, and the merged profiling sentences should be checked by several
people, but this is very costly. Therefeore, we use LLM to create profiling sentence dataset to decrease
the costs. This method of generating data using a language model and manually checking it is often
used in recent years [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and although limitations exist in the capabilities of LLMs, it is an eficient way
to create data. The profiling sentence is intended for use in LLM response control and profile updating,
which are subsequent processes, thus maintaining its accuracy is crucial. Therefore, LLM is used to
write out as many diferent profiling sentences as possible, and human verification is used to create the
dataset with guaranteed accuracy.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Generate candidate profiling sentences using LLM</title>
        <p>This section describes the process of writing profiling sentences from utterances using LLM. We use
gpt-4-0613 as the LLM when creating the dataset. As a preliminary experiment, we asked LLM to infer
profiling sentences using several prompts, and the results with instructions only (zero-shot) were more
comprehensive than the results generated with examples of correct answers (few-shot). Therefore,
we use a zero-shot setup for LLM inference. the LLM is given a prompt that is a concatenation of
the instruction, two prior utterances from the dialog history, and the target utterance, and is asked to
generate as many profiling sentences as possible for all the target utterances. The profiling sentences
include utterances for which no profiling sentences exist, but if they do not exist, they are output as none.
As noted in the task description section (2.2), the target utterances for this study include those in which
humans do not infer any profiling information. However, adding target utterances without profiling
sentences to the dataset creates a situation where the linkage between them is no longer 1-to-many,
rendering some existing automatic evaluation methods unusable. For a more accurate evaluation, the
dataset is checked manually for validity. The human evaluation also compares the results of human
inferences Δ _ =     (  ,  −1 ,  −2 ) on a small number of utterances (100).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation of profiling sentence candidates</title>
        <p>Next, to ensure the accuracy of the profiling sentences generated by the LLM, the correspondence
between the target utterance and the profiling sentences is manually checked. This annotation work
is done by a crowdworker. In order to check whether each profiling sentence can be inferred from
the target utterance, we first divide a “target utterance, two prior utternaces” (hereafter referred to
as “utterance group”) and a “profiling sentence” to achieve one-to-one alignment. Annotators are
assigned to evaluate pairs of utterance groups and profiling sentences. The information in the profiling
sentences, derive from the target utterance, is assessed by three raters using three categories: correct,
possibly correct, and incorrect. An example of an annotator’s decision is shown in Table 3. A majority
vote is used as the final decision, and in the case of a split decision by all workers, an intermediate
label, possibly correct, is adopted. Note that our goal is to generate profiling sentences derived from
the target utterance, thus any profiling sentences that are mentioned only in the utterance history
and are unrelated to the target utterance are judged as incorrect. profiling sentences with incorrect
Japanese or profiling sentences that can be applied to any utterance (e.g., “I can speak the language”)
are also judged to be incorrect. The total number of profiling statements inferred by the LLM is 16,971,
of which 9,475 (55.83%) are judged correct, 1,763 (10.39%) are judged possibly correct, and 5,733 (33.78%)
are judged incorrect. Data annotated as incorrect are not used in this experiment. The inter-annotator
agreement (three-way average of weighted kappa coeficients) is 0.557, indicating moderate agreement.
This suggests that there are some individual diferences in what is perceived as a profiling sentence.
After the annotation is completed, the dataset is created by reverting to data for each target utterance for
correct and potentially correct “utterance groups” paired with “profiling sentences”. When processing
the data for each target utterance, the profiling sentences are left blank for target utterances that don’t
have any profiling sentences associated with them, as shown in the Table 2. The dataset we created is
publicly available 1 .
1https://github.com/shingetsu-ak/generation-of-interlocutor-profiling</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Context-sensitive profiling sentence generation experiment</title>
      <p>We train models and conduct experiments to validate the datasets we create. In this section, we describe
the details of model construction and training, evaluation method, experimental results, and an analysis
of contextual influences.</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental settings</title>
        <p>For training, the dataset is randomly shufled for each target utterance and divided into subsets of
training, validation, evaluation = (8:1:1) by the number of target utterances. Although it is possible for
a target utterance to appear in the utterance history of other target utterance, since our task is to map
target utterances to profiling sentences, we have separated them in this way because we believe that
training, validation, and evaluation should be done on unique target utterances.</p>
        <p>
          Although the previous study [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] on profiling sentences, uses training and evaluation by concatenating
profiling sentences, there are several possible problems with simple concatenation of profiling sentences.
First, inference by the language model causally predicts tokens sequentially, but when predicting
multiple profiling sentences, later predictions may be influenced by earlier ones. When multiple
profiling sentences are obtained from a single utterance, it is unclear whether one profiling sentence
should be generated using another profiling sentence as the addition to an utterance. Therefore, we
propose an alternative method to training simple concatenations: training utterance groups and profiling
sentences on a one-to-one basis. For our experiments, we created models for both of these profiling
sentence generators and compared their performance.
        </p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Model training details</title>
          <p>
            In both the method for training simple concatenations (concat) and the method for training utterances
and profiling sentences on a one-to-one basis (profile-wise), profiling sentences are generated using
a Transformer-based decoder, following the approach in the previous study [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], and causal language
modeling (CLM) as the objective function. Specifically, the model is trained by LoRA fine-tuning of
cyberagent/open-calm-7b, a Japanese open source LLM2. LoRA fine-tuning uses the target utterance
group as input to output profile sentences. No instructions are used. As hyperparameters, the learning
rate is set to 5 − 5 , the rank r of LoRA to 32, the weight decay to 0.01, the number of warm-up epochs
to 1, AdamW is used for optimizer, and the batch size is set to 8. The model with the lowest loss out
of 10 epochs is used as the best model of each methodss. When testing the model, top-p sampling is
employed, with  = 0.95 . As Ribeiro et al., we do not compute the loss during training for utterance
groups, but only for the generation of profiling sentences.
          </p>
          <p>
            We propose an one-to-one training method that learns a profiling sentence from a target utterance,
but with this method the model can only generate one profiling sentence per inference. Therefore,
in order to obtain multiple profiling sentences in one utterance, it is necessary to let the model infer
multiple times and remove the same profiling sentence from the generated results. Therefore, to ensure
that the diversity of the learned profile sentences is reflected in the generation of the profile sentences,
they are generated 10 times. Since many of the 10 generated profiling sentences are semantically similar,
the semantic similarity is measured by sentenceBERT [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] , and those exceeding the threshold value
(experimentally set to 0.8 in this case) are integrated by eliminating them as duplicates. If the model
outputs nothing at least once out of 10 times, the absence of a profiling sentence is given priority and
the other outputs are discarded.
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Generative models compared</title>
          <p>We compare models that learn profiling sentences concatenated together, and models that learn utterance
groups and profiling sentences one-to-one. profiling sentences that are confusing to humans may
confuse the model during training, thus we created a model that uses profiling sentences that the
crowdworker judged to be “possibly correct” during dataset construction and a model that does not
use those profiling sentences for training, and made comparisons. Therefore, there are four models to
compare: concat (correct), concat (correct+possibly correct), profile-wise (correct), and profile-wise
(correct+possibly correct).</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Metrics</title>
        <p>
          We use not only the same automatic utterance-level metrics as in previous studies, but also human
ratings of the profiling sentences. In previous studies [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], profiling sentences generated for each
utterance were concatenated for automatic evaluation, but our interest is in the degree to which each
profiling sentence generated corresponds to the target utterance. Evaluating for simple concatenation
would give the same score to an utterance that produces only one short profiling sentence as to one
that produces multiple longer profiling sentences. This means that the more profiling sentences an
utterance generates, the lower score is associated with a single profiling sentence, and each profiling
sentence cannot be evaluated equally. Therefore, in addition to the evaluation of each target utterance
as in previous studies, we also conduct a human evaluation of each profiling sentence to see if there are
diferences between both types of evaluations.
        </p>
        <p>
          For automatic evaluation of each utterance, a test set of the created dataset is used. BLEU [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], ROUGE
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and BERT Score [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] are used as automatic evaluation indicators. However, since these evaluation
metrics cannot be applied to target utterances without profiling sentences, only target utterances with
profiling sentences are considered.
        </p>
        <p>For the human evaluation at the profiling sentence level, we use 100 target utterances randomly
selected from the test set of the created dataset. In this case, the target utterance without a profiling
sentence is also evaluated as a single case of data, with the case where no profiling sentence is generated
as the correct answer. The generated “profiling sentences” are mapped one-to-one to “target utterance,
two preceding utterances” and judged manually as correct or incorrect by annotator. Since our goal is
to generate all the profiling sentences that a human would infer from the target utterance, we include
profiling sentences that are possibly correct in the correct answer and make a binary decision. Three
annotators are hired for the human evaluation, and the decision is made by majority vote. We also
create a set of profiling sentences Δ _ , all of which are manually extracted for the 100 test sets used
in the manual evaluation. For Δ _ , three annotators are asked to write out profiling sentences, and
one of the authors checks for duplicates and removes them.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Experimental results</title>
        <p>The experimental results of the automatic evaluation are shown in Table 4. Comparing the combined
method and our one-to-one training method, the combined method scored higher overall in the automatic
evaluation. This indicates that training and outputting the combined profiling sentences is advantageous
for automatic evaluation against the combined correct sentences. Next, when comparing the model
trained on only correct answers with the model trained on both correct and possibly correct answers,
the model trained solely on correct answers obtained a lower score. This may be because the correct
sentence, referred to as the correct answer in the automatic evaluation, included profiling sentences
that could potentially be correct in the training. This likely favored the model with a more diverse
output.</p>
        <p>The experimental results of the human evaluation are shown in Table 5. Here, LLM+GOLD is the
result of the dataset itself. The total number of Δ _ is used to calculate the recall and F1 scores.
Overall, the results were diferent from those of the automatic evaluation. In particular, profile-wise
(correct), which had the lowest score in the automatic evaluation, resulted in the highest F1 score in the
human evaluation. This suggests that automatic evaluation at the utterance level is less correlated with
human evaluation at the profiling sentence level.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Context influence analysis</title>
        <p>The profiling sentences generated by the human evaluation are further analyzed to check whether the
model trained on the created dataset is able to generate profiling sentences while being influenced by
the context. First, for comparison, we manually count the profiling sentences that could not be predicted
without context in the Δ _ , a set of profiling sentences that are all extracted manually. The results
show that 20.92% (32/153) of the profiling sentences required context. Thus, we see that there are a
certain number of profiling sentences in the test set that require context. Next, we analyze the profiling
sentences included in the dataset we have created. The results show that 3.40% (5/147) of the profiling
sentences referred to the context. This confirms that the proportion of contextual references in LLM
profile writing is much lower than that in human profile writing. This is a limitation of using LLMs to
extend the dataset, but could be improved as the capabilities of the LLMs increase. The percentage of
profiling sentences generated by the learned model is as follows:
• concat (correct): 5.06% (4/79)
• concat (correct + possibly correct): 7.14% (6/84)
• profile-wise (correct): 3.95% (3/76)
• profile-wise (correct + possibly correct): 7.14% (6/84)
We see that the profiling sentence generation model is able to generate profiling sentences with reference
to the context at about the same rate as the dataset, although less than the human profiling sentences.
Comparing the model trained with only the correct answer and the model trained with both the correct
answer and the possibility of the correct answer, it is found that the model trained with both the correct
answer and the possibility of the correct answer is more context-referenced. On the other hand, as can
be seen from Table 5, this model has lower prediction accuracy, hence achieving both is an issue to be
solved in the future.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and future work</title>
      <p>In this study, we proposed a task to generate profiling sentences from target utterances by adding two
utterances as context, and created and evaluated a dataset linking profiling sentences that can be inferred
from a single utterance. For the generation of profiling sentences, we proposed a one-to-one training
method in which profiling sentences are learned from target utterances, in addition to the conventional
method of training profiling sentences that can be inferred from a single utterance by concatenating
them. For the evaluation, automatic evaluation at the utterance level and human evaluation at the
profiling sentence level were performed. The experimental results show distinct discrepancy between
automated and human evaluations for inferring profiling sentences. We also confirmed that the proposed
training method is as efective as, or even more efective than, existing approaches. Analysis of the
generated profiling sentences confirmed that the profiling sentence generation model is able to generate
implicit profiling sentences at approximately the same rate as the dataset. In the future, we plan to
work on creating a dataset that includes more contextually referenced profiling sentences. We also
intend to analyze the relationship between profiling sentences for each interlocutor by converting the
utterances in the created dataset into dialogue-by-dialogue data.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations</title>
      <p>We conducted our experiments in Japanese, but further verification is needed because the results may
change if the experiment is conducted in English or any other language. This paper has shown the
diferences between the automatic evaluation methods used in existing research and human evaluation,
but it is very costly. It is important to study more context-based one-to-many automatic evaluation
metrics. Although this research focuses only on the chat domain, the method itself may be applicable
to other domains such as opinion extraction. Experiments in other domains is desired in the future.
Our research is based on dialogue data created by crowdworkers who pretended to be non-existent
people. Therefore, our dataset also does not include profiles of real people. However, when considering
actual applications, collecting profiles of existing interlocutors may be a problem from the perspective
of privacy. We do not recommend collecting user information without their permission. To ensure
smooth and reliable interactions between users and systems, sentence profiling is essential, and this
research focuses on achieving that goal.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research was supported by JST Next Generation Challenging Researchers Program JPMJSP2119
and JST CREST Grant Number JPMJCR20D2, Japan.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Dinan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urbanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Szlam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Personalizing dialogue agents: I have a dog, do you have pets too?, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Melbourne, Australia,
          <year>2018</year>
          , pp.
          <fpage>2204</fpage>
          -
          <lpage>2213</lpage>
          . URL: https://aclanthology.org/P18-1205. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>P18</fpage>
          - 1205.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Szlam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Beyond goldfish memory: Long-term open-domain conversation</article-title>
          ,
          <source>in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>5180</fpage>
          -
          <lpage>5197</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>356</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .acl- long.356.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Liu,
          <article-title>BoB: BERT over BERT for training persona-based dialogue models from limited personalized data</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>177</lpage>
          . URL: https://aclanthology. org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>14</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .acl- long.14.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          , L. Coheur,
          <article-title>PGTask: Introducing the task of profile generation from dialogues</article-title>
          , in: S. Stoyanchev,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlangen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Dusek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kennington</surname>
          </string-name>
          , M. Alikhani (Eds.),
          <source>Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue</source>
          , Association for Computational Linguistics, Prague, Czechia,
          <year>2023</year>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>189</lpage>
          . URL: https:// aclanthology.org/
          <year>2023</year>
          .sigdial-
          <volume>1</volume>
          .17. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .sigdial-
          <volume>1</volume>
          .
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.-E.</given-names>
            <surname>Mazaré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Humeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <article-title>Training millions of personalized dialogue agents</article-title>
          , in: E.
          <string-name>
            <surname>Rilof</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , J. Tsujii (Eds.),
          <source>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>2775</fpage>
          -
          <lpage>2779</lpage>
          . URL: https://aclanthology.org/D18-1298. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>D18</fpage>
          - 1298.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Welleck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Szlam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Dialogue natural language inference</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3731</fpage>
          -
          <lpage>3741</lpage>
          . URL: https://aclanthology.org/P19-1363. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          - 1363.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Detecting speaker personas from conversational texts</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>1126</fpage>
          -
          <lpage>1136</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>86</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp- main.86.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shinji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Masashi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Toshihiko</surname>
          </string-name>
          ,
          <article-title>Verification of LLM's ability to extract speaker information from utterance and context during chats</article-title>
          ,
          <source>in: Proceedings of the Thirtieth Annual Meeting of the Association for Natural Language Processing</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>3050</fpage>
          -
          <lpage>3054</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mizukami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Arimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Narimatsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chiba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nakajima</surname>
          </string-name>
          , T. Meguro,
          <article-title>Empirical analysis of training strategies of transformer-based Japanese chit-chat systems</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2109</volume>
          .
          <fpage>05217</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <article-title>Keep me updated! memory management in long-term conversations</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>3769</fpage>
          -
          <lpage>3787</lpage>
          . URL: https: //aclanthology.org/
          <year>2022</year>
          .findings-emnlp.
          <volume>276</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings- emnlp.276.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using Siamese BERT-networks</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3982</fpage>
          -
          <lpage>3992</lpage>
          . URL: https://aclanthology.org/D19-1410. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          - 1410.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Manual and automatic evaluation of summaries</article-title>
          ,
          <source>in: Proceedings of the ACL-02 Workshop on Automatic Summarization</source>
          , Association for Computational Linguistics, Phildadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>51</lpage>
          . URL: https://aclanthology.org/W02-0406. doi:
          <volume>10</volume>
          .3115/1118162.1118168.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Artzi,</surname>
          </string-name>
          <article-title>BERTScore: Evaluating text generation with BERT</article-title>
          , CoRR abs/
          <year>1904</year>
          .09675 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1904</year>
          .09675. arXiv:
          <year>1904</year>
          .09675.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>