<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Alignment for Automatic Review-Response Generation in the Hospitality Domain Benchmarking Sentence Alignment Techniques for Automatic Review-Response Generation in the Hospitality Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Renate Hauser</string-name>
          <email>renate.hauser@uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tannon Kew</string-name>
          <email>kew@uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computational Linguistics, University of Zurich</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lugano</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently, online customer reviews have surged in popularity, placing additional demands on businesses to respond to these reviews. Conditional text generation models, trained to generate a response given an input review have been proposed to facilitate human authors in composing high quality responses. However, this approach has been shown to yield rather unsatisfying, generic responses while, in practice, responses are required to address reviews specifically and individually. We hypothesise that this issue could be tackled by changing the alignment paradigm and using sentence-aligned training data instead of document-aligned. Yet, finding correct sentence alignments in the review-response document pairs is not trivial. In this paper we investigate methods to align sentences based on computing the surface and semantic similarity between source and target pairs and benchmark performance for this rather challenging alignment problem.</p>
      </abstract>
      <kwd-group>
        <kwd>Hospitality</kwd>
        <kwd>both tasks face the challenge of modelling social skills in</kwd>
        <kwd>order to make the behaviour of the system more human-</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Online reviews have become an extremely popular One potential solution
and useful tool for both businesses and consumers. document level and investigate
Today, there are numerous online platforms such as generation at the sentence level.
TripAdvisor, Yelp, or Booking.com, where customers approach involves first extracting aligned sentence
can rate restaurants and hotels and write reviews pairs from document-level review responses. In this
about their visit. These reviews are an increasingly paper, we investigate sentence alignment methods
important source of information for potential or future for review-response texts in the hospitality domain.
customers [1]. This has lead to a growing emphasis Specifically, we consider two diferent approaches; one
review</p>
      <p>response</p>
      <p>However, such an
would be to go below the
on efective online customer feedback management,
which gives businesses the opportunity to influence
working at the surface level by making use of character
n-grams, the other leveraging sentence embeddings to
the public discourse.</p>
      <p>However, many businesses assess the semantic similarity of a given sentence pair.</p>
      <p>the document-level, alignments between semantic units
in the review text and the response text are often scarce.
lack the resources required to eficiently respond to
such a high influx of reviews. This has given rise to
research into how artificial intelligence can support the
process of writing a review-response. Katsiuba et al. 2. Related</p>
    </sec>
    <sec id="sec-2">
      <title>Work</title>
      <p>[2] investigate a neural sequence-to-sequence model
trained to automatically generate a full response for a Automatic Review-Response Generation
given review text. However, their proposed system tends cess of sequence-to-sequence (seq2seq) encoder-decoder
The
sucto produce generic responses rather than addressing
specific issues raised in the input review, which limits its
applicability in practice. Since review responses vary agents in industry as well as in academic research5[].
greatly in both style and specificity, we hypothesise
that the high degree of generic automatically generated ilar to that of conversational agents, where the goal is to
responses is due to a fundamental alignment problem: at generate an adequate response for a given input. Also,</p>
      <sec id="sec-2-1">
        <title>The task of automatic review response generation is simmodels [3] in the task of conversational modelling [4] has led to a significant interest in chatbots and conversational</title>
        <p>have been a popular choice for the task of automatic
review response generation in various domains at the
vidual or specific responses these proposed approaches published on TripAdvisor. Scripts to reproduce our data
have typically focused on extending the basic seq2seq will be made publicly available1.
architecture to incorporate additional contextual
information. Yet this fails to alleviate the problem entirely.</p>
        <sec id="sec-2-1-1">
          <title>3.2. Method</title>
          <p>Sentence Alignment Sentence-aligned parallel cor- In order to quantify an alignment, we rely on the intuition
pora are a crucial prerequisite for language transductionthat aligned sentences should be semantically similar. For
tasks such as machine translation (MT) or conversational example, a review sentence that praises the quality of a
modelling. Yet the quality of systems trained on parallel hotel bed should be aligned with a response sentence that
data is largely dependent on the quality of the training mentions sleep. As a first step, we need to segment the
data and poor alignments can severely harm the perfor-review-response pair documents into their constituent
mance of the downstream application 1[0]. Consequently, sentences. For this we usespaCy2. We keep
preprocessthere has been a great deal of research towards improvinging minimal and simply apply lowercasing since casing is
alignment algorithms. Algorithms operating on surface- of little importance for the alignment task 1[1]. Secondly,
level overlap as well as more complex neural approaches following our underlying assumption, we compute a
simithat consider deep semantic representations, have been larity score for each combination of review and response
proposed for MT [11, 12], automatic text simplification sentences in a document. To this end, we investigate
[13, 14] and paraphrasing [15]. To the best of our knowl- two diferent approaches, namely, surface-level similarity
edge, we are the first to study the alignment of sentence based on character n-gram overlap and semantic
simipairs from online review-response documents and thus larity based on dense sentence embeddings. While the
set out to benchmark this task. former ofers a computationally cheap approach, it fails
to account for sentences that are semantically similar
but expressed diferently, such as in the example above.
3. Review-Response Sentence Thus, we expect the latter approach to be most suitable.</p>
          <p>Alignment In a final step, we determine suitable thresholds for
classifying an alignment unit and derive alignments based
Review response pairs typically resemble paragraphs, of- on these scores (Section3.2).
ten containing multiple sentences or ideas. Yet responses
difer greatly in terms of how specifically and individ- Surface-Level Similarity To compute surface-level
ually they reply to a review. Additionally, responses similarity between source and target sentences, we use
frequently contain additional comments or information the chrF metric presented by Popovi[ć16]. The formula
that does not refer to a point mentioned in the review. of chrF is as follows:
This leads to an inherent alignment problem at the
document level, where semantically aligned units between the chrF = (1 + ) ∗ (chrP ∗ chrR)
review and response are scarce and sometimes nonexis- ( 2 ∗ chrP + chrR)
tent. In contrast to other more common alignment tasks
[e.g. 12, 14], review-response pairs exhibit a number of Where chrP is the percentage of character n-grams in the
qualities that add complexity to this task. Firstly, align- hypothesis that are also present in the reference (i.e.
prements do not follow monotonicity constraints. Secondly, cision) and chrR is the percentage of character n-grams
there is no guarantee that for any given review sentence in the reference that is also present in the hypothesis (i.e.
a corresponding response sentence exists. This leads to recall). We investigate several settings. As we want to
a considerable number of documents that do not contain focus on content words rather than stopwords, we only
any alignment units. At the same time, N:M-alignments consider n-gram orders starting fro m=4. On the other
are also common due to a large degree of writing styles hand, too high n-gram lengths might be too restricting to
and the largely informal, free-form expression. derive any useful alignments. We therefore set an upper
limit of  =6.
3.1. Data</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Semantic Similarity To compute semantic similar</title>
        <p>With the overall aim of learning better semantic map- ity, we make use of BERT-based sentence embeddings
pings between reviews and appropriate responses, we (SBERT) [17]3 and compute the cosine similarity between
would like to identify and extract sentence pairs from a sentence pairs. We consider two alternate framings for
large collection of review-response document pairs. As a our task and compare SBERT models accordingly. The
ifrst step, we compile a dataset of approximately 500,000</p>
        <p>1https://github.com/renatehauseruzh/rev-resp-sentalign
documents consisting of review-response pairs for hotels 2https://spacy.io/</p>
        <p>3https://www.sbert.net/docs/pretrained_models.html
ifrst of these frames a response sentence as a paraphrase at least one review sentenceand one response sentence
of a review sentence for which we use theparaphrase- assigned by the aligner appears in an alignment in the
MiniLM-L3-v2 model. The second considers the task as gold standard.
a type of natural language inference (NLI), in which a
response sentence may be logically inferred by a review Similarity Thresholds Low thresholds lead to large,
sentence. Thus we also test the nli-mpnet-base-v2 unmotivated N:M-alignments, while high thresholds
conmodel. strain the space of possible aligned segments too harshly.</p>
        <p>Therefore, we considered thresholds ranging from 0.02
To Align or not to Align Since we cannot assume the to 0.16 and 0.1 to 0.6 for the chrF based approach and
alignments to be monotonic, every pair of review and the cosine similarity approach, respectively. Manual
inresponse sentences in a document is a potential candi- vestigation showed that 0.16 and 0.6, respectively, were
date. For each such pair, a similarity score needs to be reasonable thresholds, above which alignments were not
computed, resulting in a similarity matrix. An example is found.
provided in Appendix A. As the time complexity of this
comparison is O(|| ∗ | | ), where is the review text and Performance We consider the results for complete
 the response text, this is an expensive step. However, matches to be a measure for how well the alignments
since the vast majority of the review-response documents reflect the human judgement in the gold standard. As
contain less than ten sentences, this is still feasible. Given can be seen in Figure 1, the higher n-gram orders )(,
this matrix of similarity scores, the challenge is to deter-as well as higher thresholds )(, yield better measures
mine an appropriate threshold for classifying an aligned for complete matches. However, looking at Figure3,
sentence pair. In the following section we investigate we can see a clear trade of between the total number
suitable thresholds by inspecting the trade-of between of alignments and a high F1 score using the expected
precision and recall on a small manually-annotated gold- number of alignments of 130 from the gold standard. In
standard. fact, the three aligners wit h =6/ =0.06,  =5/ =0.08, and
 =4/ =0.12 yield comparable results in all three metrics
4. Experiments while yielding a reasonable number of alignments.
As the focus lies on a good precision, we consider
Gold Standard To be able to automatically validate thresholds that are above the critical point where
preour candidate aligners, we compiled a manually anno- cision exceeds recall for the semantic similarity based
tated gold standard containing 115 review-response pair approach, namely between =0.4 and  =0.5. This choice
documents. These pairs were randomly sampled from the is also confirmed by the number of extracted alignments
test split of our dataset. We then tasked two annotators, that starts to drop of steeply at  =0.4. As can be seen in
who were familiar with the alignment task, to annotate Figure2, while partial matches are consistently higher for
each review sentence with zero, one or multiple corre-the NLI-BERT model, performance in terms of complete
sponding response sentences. This is a non-trivial task, alignments is relatively equal for both SBERT models.
as there is often no obvious distinction between a vague,
generic response and no correspondence. The manual 4.1. Results
annotation yielded approximately 130 aligned sentence
pairs. To measure the inter-annotator agreement (IAA) To assess, how well the five most promising candidate
we used the Kappa statistic [18, 19]. The IAA for the gold aligners reflect the judgement of a human annotator, we
standard reached a Kappa value of 0.64. This rather low conducted a small manual evaluation. We randomly
samagreement reflects the dificulty of the task. pled 50 alignments produced by each aligner, including
1:1 and N:M-alignments. We manually labeled these with
either ”valid” or ”not valid”. For both surface-level and
Metrics We validate the output of the aligners with semantic approaches, misalignments are typically caused
precision, recall and F1 score. The total number of by occurrence of named entities such as hotel or
peralignments in the gold standard serve as the expected sonal names. This influence is particularly observable
number of alignments that an aligner should extract. Be- for the =4 chrF aligner. Meanwhile, NLI SBERT aligner
cause of the range of possible correct alignments, only is somewhat greedy and sufers from large, unmotivated
considering complete matches would be too restricting. N:M-alignments. Despite this, we found no evidence of
Therefore, we follow Jiang et al.[14] and report metrics one of the candidate aligners substantially
outperformfor completely matching alignments (vs. partially match- ing the others with all yielding between 27 and 34 valid
ing alignments + non-alignments) as well as for partially alignments out of 50.
+ completely matching alignments (vs. non-alignments).</p>
        <p>We considered an alignment to be partially correct, if</p>
        <sec id="sec-2-2-1">
          <title>4.2. Discussion &amp; Future Work</title>
          <p>Based on the results, we are able to derive five candi- [1] V. Browning, K. K. F. So, B. Sparks, The
Infludate approaches that yield approximately equally good ence of Online Reviews on Consumers’
Attriburesults when evaluated manually. Our qualitative evalu- tions of Service Quality and Control for Service
ation shows that our methods are capable of extracting Standards in Hotels, Journal of Travel &amp; Tourism
relatively high-precision alignments, but sufer in terms Marketing 30 (2013) 23–40. URL: https://doi.
of recall, leading to a low overall F1 score. org/10.1080/10548408.2013.750971. doi:10.1080/</p>
          <p>Given a large dataset of review-response documents, 10548408.2013.750971.
future work will benefit from the methods presented [2] D. Katsiuba, T. Kew, M. Dolata, G. Schwabe,
Suphere to derive aligned sentence pairs for training review- porting online customer feedback management
response generation models on the sentence level. We with automatic review response generation, in:
hope that this will encourage the model to learn more The 55th Hawaii International Conference on
Syssemantically related mappings between the source and tem Sciences, HICSS, 2022, pp. 226–236. URL: https:
target texts. //doi.org/10.5167/uzh-212773.</p>
          <p>We acknowledge that the gold standard used to vali- [3] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to
date a range of thresholds for our methods is relatively Sequence Learning with Neural Networks, in:
small and a larger gold standard would be beneficial for NIPS’14: Proceedings of the 27th International
Conenhancing the reliability of these results. Furthermore, ference on Neural Information Processing Systems,
evaluation of sentence-level review response generation volume 2, Montreal, 2014. URL: https://arxiv.org/
systems is also dependent on sentence-level test data. abs/1409.3215v3.</p>
          <p>Thus, additional human annotation is required to con- [4] O. Vinyals, Q. Le, A Neural Conversational Model,
struct a suitable evaluation set. arXiv:1506.05869 [cs] (2015). URL:http://arxiv.org/
abs/1506.05869.
5. Conclusion [5] P. Gentsch, Künstliche Intelligenz für Sales,
Marketing und Service, Springer Fachmedien
In this paper we investigated possible methods for deriv- Wiesbaden, Wiesbaden, 2018. URL: http://link.
ing aligned sentences from hospitality review-response springer.com/10.1007/978-3-658-19147-4. doi:10.
pairs. We believe that such alignments will be useful 1007/978-3-658-19147-4.
for improving the performance of downstream review [6] S. Diederich, M. Janßen-Müller, A. Brendel,
response generation models by better mapping seman- S. Morana, Emulating Empathetic Behavior in
Ontically related segments between the source and target line Service Encounters with Sentiment-Adaptive
texts. Automatic validation results and a small qualita- Responses: Insights from an Experiment with a
tive evaluation reveal that a relatively cheap character Conversational Agent, in: Proceedings of
Internan-gram overlap metric allows us to align sentence pairs tional Conference on Information Systems (ICIS),
based purely on surface-level similarity with comparable Munich, 2019. URL:https://aisel.aisnet.org/icis2019/
results to a more expensive approach based on semantic smart_service_science/smart_service_science/.2/
similarity. [7] C. Gao, J. Zeng, X. Xia, D. Lo, M. R. Lyu, I. King,
Automating App Review Response Generation, in:
Proceedings of the 34th IEEE/ACM International
Conference on Automated Software Engineering
(ASE), San Diego, USA, 2019, pp. 163–175. doi:10. matic MT evaluation, in: Proceedings of the Tenth
1109/ASE.2019.00025. Workshop on Statistical Machine Translation,
Lis[8] L. Zhao, K. Song, C. Sun, Q. Zhang, X. Huang, bon, 2015, pp. 392–395. URL: https://aclanthology.</p>
          <p>X. Liu, Review Response Generation in E- org/W15-3049. doi:10.18653/v1/W15-3049.
Commerce Platforms with External Product Infor-[17] N. Reimers, I. Gurevych, Sentence-BERT: Sentence
mation, in: The World Wide Web Conference on Embeddings using Siamese BERT-Networks, in:
- WWW ’19, San Francisco, 2019, pp. 2425–2435. Proceedings of the 2019 Conference on Empirical
URL: http://dl.acm.org/citation.cfm?doid=3308558. Methods in Natural Language Processing and the
3313581. doi:10.1145/3308558.3313581. 9th International Joint Conference on Natural
Lan[9] T. Kew, M. Amsler, S. Ebling, Benchmark- guage Processing (EMNLP-IJCNLP), Hong Kong,
ing Automated Review Response Generation for 2019, pp. 3980–3990. URL: https://www.aclweb.org/
the Hospitality Domain, in: Proceedings of anthology/D19-1410. doi:10.18653/v1/D19-1410.
Workshop on Natural Language Processing in E- [18] J. Cohen, A coeficient of agreement for nominal
Commerce, Barcelona, 2020, pp. 43–52. URL:https: scales, Educational and Psychological Measurement
//aclanthology.org/2020.ecomnlp-1.5. 20 (1960) 37–46. doi:https://doi.org/10.1177/
[10] H. Khayrallah, P. Koehn, On the Impact of Various 001316446002000104.</p>
          <p>Types of Noise on Neural Machine Translation, in: [19] J. Carletta, Assessing agreement on classification
Proceedings of the 2nd Workshop on Neural Ma- tasks: the kappa statistic, CoRR cmp-lg/9602004
chine Translation and Generation, Melbourne, 2018, (1996). URL: http://arxiv.org/abs/cmp-lg/9602004.
pp. 74–83. URL: https://aclanthology.org/W18-270 9.</p>
          <p>doi:10.18653/v1/W18-2709.
[11] R. Sennrich, M. Volk, MT-based Sentence Align- A. Appendix
ment for OCR-generated Parallel Texts, in:
Proceedings of the 9th Conference of the Association for
Machine Translation in the Americas: Research
Papers, Denver, 2010. URL:https://aclanthology.org/
2010.amta-papers.14.
[12] B. Thompson, P. Koehn, Vecalign: Improved
Sentence Alignment in Linear Time and Space, in:
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Hong Kong,
2019, pp. 1342–1348. URL: https://www.aclweb.org/
anthology/D19-1136. doi:10.18653/v1/D19-1136.
[13] W. Xu, C. Callison-Burch, C. Napoles, Problems in</p>
          <p>Current Text Simplification Research: New Data
Can Help, Transactions of the Association for
Computational Linguistics 3 (2015) 283–297. URL:
https://doi.org/10.1162/tacl_a_00139. doi:10.1162/
tacl_a_00139.
[14] C. Jiang, M. Maddela, W. Lan, Y. Zhong, W. Xu,</p>
          <p>Neural CRF Model for Sentence Alignment in Text
Simplification, in: Proceedings of the 58th
Annual Meeting of the Association for
Computational Linguistics, Online, 2020, pp. 7943–7960. URL:
https://aclanthology.org/2020.acl-main.709.doi:10.</p>
          <p>18653/v1/2020.acl-main.709.
[15] R. Barzilay, N. Elhadad, Sentence Alignment for</p>
          <p>Monolingual Comparable Corpora, in:
Proceedings of the 2003 Conference on Empirical Methods
in Natural Language Processing, Sapporo, Japan,
2003, pp. 25–32. URL: https://aclanthology.org/</p>
          <p>W03-1004.
[16] M. Popović, chrF: character n-gram F-score for
autoResponse
0 Dear Jim H, Thanks for your review, and feedback after your recent stay.,
1 We are happy to know you enjoyed your stay, and that our renovations at the hotel did not disturb you at all!,
2 We do apologize for the missteps with the housekeeping department, and appreciate your comments.,
3 We do our best to extend the Marriott standards, and quality our guests have come to expect, and have taken note of
your feedback for our team to review.,
4 We appreciate your review, and hope you will return to enjoy all of the new renovations being done at the hotel!</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>