1. Introduction

Alignment for Automatic Review-Response Generation in the Hospitality Domain Benchmarking Sentence Alignment Techniques for Automatic Review-Response Generation in the Hospitality Domain

Renate Hauser

renate.hauser@uzh.ch 0 1

Tannon Kew

kew@uzh.ch 0 1 0 Department of Computational Linguistics, University of Zurich , Zurich , Switzerland 1 Lugano , Switzerland

Recently, online customer reviews have surged in popularity, placing additional demands on businesses to respond to these reviews. Conditional text generation models, trained to generate a response given an input review have been proposed to facilitate human authors in composing high quality responses. However, this approach has been shown to yield rather unsatisfying, generic responses while, in practice, responses are required to address reviews specifically and individually. We hypothesise that this issue could be tackled by changing the alignment paradigm and using sentence-aligned training data instead of document-aligned. Yet, finding correct sentence alignments in the review-response document pairs is not trivial. In this paper we investigate methods to align sentences based on computing the surface and semantic similarity between source and target pairs and benchmark performance for this rather challenging alignment problem.

Hospitality both tasks face the challenge of modelling social skills in order to make the behaviour of the system more human-

1. Introduction

Online reviews have become an extremely popular One potential solution and useful tool for both businesses and consumers. document level and investigate Today, there are numerous online platforms such as generation at the sentence level. TripAdvisor, Yelp, or Booking.com, where customers approach involves first extracting aligned sentence can rate restaurants and hotels and write reviews pairs from document-level review responses. In this about their visit. These reviews are an increasingly paper, we investigate sentence alignment methods important source of information for potential or future for review-response texts in the hospitality domain. customers [1]. This has lead to a growing emphasis Specifically, we consider two diferent approaches; one review

response

However, such an would be to go below the on efective online customer feedback management, which gives businesses the opportunity to influence working at the surface level by making use of character n-grams, the other leveraging sentence embeddings to the public discourse.

However, many businesses assess the semantic similarity of a given sentence pair.

the document-level, alignments between semantic units in the review text and the response text are often scarce. lack the resources required to eficiently respond to such a high influx of reviews. This has given rise to research into how artificial intelligence can support the process of writing a review-response. Katsiuba et al. 2. Related

Work

[2] investigate a neural sequence-to-sequence model trained to automatically generate a full response for a Automatic Review-Response Generation given review text. However, their proposed system tends cess of sequence-to-sequence (seq2seq) encoder-decoder The sucto produce generic responses rather than addressing specific issues raised in the input review, which limits its applicability in practice. Since review responses vary agents in industry as well as in academic research5[]. greatly in both style and specificity, we hypothesise that the high degree of generic automatically generated ilar to that of conversational agents, where the goal is to responses is due to a fundamental alignment problem: at generate an adequate response for a given input. Also,

The task of automatic review response generation is simmodels [3] in the task of conversational modelling [4] has led to a significant interest in chatbots and conversational

have been a popular choice for the task of automatic review response generation in various domains at the vidual or specific responses these proposed approaches published on TripAdvisor. Scripts to reproduce our data have typically focused on extending the basic seq2seq will be made publicly available1. architecture to incorporate additional contextual information. Yet this fails to alleviate the problem entirely.

3.2. Method

Sentence Alignment Sentence-aligned parallel cor- In order to quantify an alignment, we rely on the intuition pora are a crucial prerequisite for language transductionthat aligned sentences should be semantically similar. For tasks such as machine translation (MT) or conversational example, a review sentence that praises the quality of a modelling. Yet the quality of systems trained on parallel hotel bed should be aligned with a response sentence that data is largely dependent on the quality of the training mentions sleep. As a first step, we need to segment the data and poor alignments can severely harm the perfor-review-response pair documents into their constituent mance of the downstream application 1[0]. Consequently, sentences. For this we usespaCy2. We keep preprocessthere has been a great deal of research towards improvinging minimal and simply apply lowercasing since casing is alignment algorithms. Algorithms operating on surface- of little importance for the alignment task 1[1]. Secondly, level overlap as well as more complex neural approaches following our underlying assumption, we compute a simithat consider deep semantic representations, have been larity score for each combination of review and response proposed for MT [11, 12], automatic text simplification sentences in a document. To this end, we investigate [13, 14] and paraphrasing [15]. To the best of our knowl- two diferent approaches, namely, surface-level similarity edge, we are the first to study the alignment of sentence based on character n-gram overlap and semantic simipairs from online review-response documents and thus larity based on dense sentence embeddings. While the set out to benchmark this task. former ofers a computationally cheap approach, it fails to account for sentences that are semantically similar but expressed diferently, such as in the example above. 3. Review-Response Sentence Thus, we expect the latter approach to be most suitable.

Alignment In a final step, we determine suitable thresholds for classifying an alignment unit and derive alignments based Review response pairs typically resemble paragraphs, of- on these scores (Section3.2). ten containing multiple sentences or ideas. Yet responses difer greatly in terms of how specifically and individ- Surface-Level Similarity To compute surface-level ually they reply to a review. Additionally, responses similarity between source and target sentences, we use frequently contain additional comments or information the chrF metric presented by Popovi[ć16]. The formula that does not refer to a point mentioned in the review. of chrF is as follows: This leads to an inherent alignment problem at the document level, where semantically aligned units between the chrF = (1 + ) ∗ (chrP ∗ chrR) review and response are scarce and sometimes nonexis- ( 2 ∗ chrP + chrR) tent. In contrast to other more common alignment tasks [e.g. 12, 14], review-response pairs exhibit a number of Where chrP is the percentage of character n-grams in the qualities that add complexity to this task. Firstly, align- hypothesis that are also present in the reference (i.e. prements do not follow monotonicity constraints. Secondly, cision) and chrR is the percentage of character n-grams there is no guarantee that for any given review sentence in the reference that is also present in the hypothesis (i.e. a corresponding response sentence exists. This leads to recall). We investigate several settings. As we want to a considerable number of documents that do not contain focus on content words rather than stopwords, we only any alignment units. At the same time, N:M-alignments consider n-gram orders starting fro m=4. On the other are also common due to a large degree of writing styles hand, too high n-gram lengths might be too restricting to and the largely informal, free-form expression. derive any useful alignments. We therefore set an upper limit of =6. 3.1. Data

Semantic Similarity To compute semantic similar

With the overall aim of learning better semantic map- ity, we make use of BERT-based sentence embeddings pings between reviews and appropriate responses, we (SBERT) [17]3 and compute the cosine similarity between would like to identify and extract sentence pairs from a sentence pairs. We consider two alternate framings for large collection of review-response document pairs. As a our task and compare SBERT models accordingly. The ifrst step, we compile a dataset of approximately 500,000

1https://github.com/renatehauseruzh/rev-resp-sentalign documents consisting of review-response pairs for hotels 2https://spacy.io/

3https://www.sbert.net/docs/pretrained_models.html ifrst of these frames a response sentence as a paraphrase at least one review sentenceand one response sentence of a review sentence for which we use theparaphrase- assigned by the aligner appears in an alignment in the MiniLM-L3-v2 model. The second considers the task as gold standard. a type of natural language inference (NLI), in which a response sentence may be logically inferred by a review Similarity Thresholds Low thresholds lead to large, sentence. Thus we also test the nli-mpnet-base-v2 unmotivated N:M-alignments, while high thresholds conmodel. strain the space of possible aligned segments too harshly.

Therefore, we considered thresholds ranging from 0.02 To Align or not to Align Since we cannot assume the to 0.16 and 0.1 to 0.6 for the chrF based approach and alignments to be monotonic, every pair of review and the cosine similarity approach, respectively. Manual inresponse sentences in a document is a potential candi- vestigation showed that 0.16 and 0.6, respectively, were date. For each such pair, a similarity score needs to be reasonable thresholds, above which alignments were not computed, resulting in a similarity matrix. An example is found. provided in Appendix A. As the time complexity of this comparison is O(|| ∗ | | ), where is the review text and Performance We consider the results for complete the response text, this is an expensive step. However, matches to be a measure for how well the alignments since the vast majority of the review-response documents reflect the human judgement in the gold standard. As contain less than ten sentences, this is still feasible. Given can be seen in Figure 1, the higher n-gram orders )(, this matrix of similarity scores, the challenge is to deter-as well as higher thresholds )(, yield better measures mine an appropriate threshold for classifying an aligned for complete matches. However, looking at Figure3, sentence pair. In the following section we investigate we can see a clear trade of between the total number suitable thresholds by inspecting the trade-of between of alignments and a high F1 score using the expected precision and recall on a small manually-annotated gold- number of alignments of 130 from the gold standard. In standard. fact, the three aligners wit h =6/ =0.06, =5/ =0.08, and =4/ =0.12 yield comparable results in all three metrics 4. Experiments while yielding a reasonable number of alignments. As the focus lies on a good precision, we consider Gold Standard To be able to automatically validate thresholds that are above the critical point where preour candidate aligners, we compiled a manually anno- cision exceeds recall for the semantic similarity based tated gold standard containing 115 review-response pair approach, namely between =0.4 and =0.5. This choice documents. These pairs were randomly sampled from the is also confirmed by the number of extracted alignments test split of our dataset. We then tasked two annotators, that starts to drop of steeply at =0.4. As can be seen in who were familiar with the alignment task, to annotate Figure2, while partial matches are consistently higher for each review sentence with zero, one or multiple corre-the NLI-BERT model, performance in terms of complete sponding response sentences. This is a non-trivial task, alignments is relatively equal for both SBERT models. as there is often no obvious distinction between a vague, generic response and no correspondence. The manual 4.1. Results annotation yielded approximately 130 aligned sentence pairs. To measure the inter-annotator agreement (IAA) To assess, how well the five most promising candidate we used the Kappa statistic [18, 19]. The IAA for the gold aligners reflect the judgement of a human annotator, we standard reached a Kappa value of 0.64. This rather low conducted a small manual evaluation. We randomly samagreement reflects the dificulty of the task. pled 50 alignments produced by each aligner, including 1:1 and N:M-alignments. We manually labeled these with either ”valid” or ”not valid”. For both surface-level and Metrics We validate the output of the aligners with semantic approaches, misalignments are typically caused precision, recall and F1 score. The total number of by occurrence of named entities such as hotel or peralignments in the gold standard serve as the expected sonal names. This influence is particularly observable number of alignments that an aligner should extract. Be- for the =4 chrF aligner. Meanwhile, NLI SBERT aligner cause of the range of possible correct alignments, only is somewhat greedy and sufers from large, unmotivated considering complete matches would be too restricting. N:M-alignments. Despite this, we found no evidence of Therefore, we follow Jiang et al.[14] and report metrics one of the candidate aligners substantially outperformfor completely matching alignments (vs. partially match- ing the others with all yielding between 27 and 34 valid ing alignments + non-alignments) as well as for partially alignments out of 50. + completely matching alignments (vs. non-alignments).

We considered an alignment to be partially correct, if

4.2. Discussion & Future Work

Based on the results, we are able to derive five candi- [1] V. Browning, K. K. F. So, B. Sparks, The Infludate approaches that yield approximately equally good ence of Online Reviews on Consumers’ Attriburesults when evaluated manually. Our qualitative evalu- tions of Service Quality and Control for Service ation shows that our methods are capable of extracting Standards in Hotels, Journal of Travel & Tourism relatively high-precision alignments, but sufer in terms Marketing 30 (2013) 23–40. URL: https://doi. of recall, leading to a low overall F1 score. org/10.1080/10548408.2013.750971. doi:10.1080/

Given a large dataset of review-response documents, 10548408.2013.750971. future work will benefit from the methods presented [2] D. Katsiuba, T. Kew, M. Dolata, G. Schwabe, Suphere to derive aligned sentence pairs for training review- porting online customer feedback management response generation models on the sentence level. We with automatic review response generation, in: hope that this will encourage the model to learn more The 55th Hawaii International Conference on Syssemantically related mappings between the source and tem Sciences, HICSS, 2022, pp. 226–236. URL: https: target texts. //doi.org/10.5167/uzh-212773.

We acknowledge that the gold standard used to vali- [3] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to date a range of thresholds for our methods is relatively Sequence Learning with Neural Networks, in: small and a larger gold standard would be beneficial for NIPS’14: Proceedings of the 27th International Conenhancing the reliability of these results. Furthermore, ference on Neural Information Processing Systems, evaluation of sentence-level review response generation volume 2, Montreal, 2014. URL: https://arxiv.org/ systems is also dependent on sentence-level test data. abs/1409.3215v3.

Thus, additional human annotation is required to con- [4] O. Vinyals, Q. Le, A Neural Conversational Model, struct a suitable evaluation set. arXiv:1506.05869 [cs] (2015). URL:http://arxiv.org/ abs/1506.05869. 5. Conclusion [5] P. Gentsch, Künstliche Intelligenz für Sales, Marketing und Service, Springer Fachmedien In this paper we investigated possible methods for deriv- Wiesbaden, Wiesbaden, 2018. URL: http://link. ing aligned sentences from hospitality review-response springer.com/10.1007/978-3-658-19147-4. doi:10. pairs. We believe that such alignments will be useful 1007/978-3-658-19147-4. for improving the performance of downstream review [6] S. Diederich, M. Janßen-Müller, A. Brendel, response generation models by better mapping seman- S. Morana, Emulating Empathetic Behavior in Ontically related segments between the source and target line Service Encounters with Sentiment-Adaptive texts. Automatic validation results and a small qualita- Responses: Insights from an Experiment with a tive evaluation reveal that a relatively cheap character Conversational Agent, in: Proceedings of Internan-gram overlap metric allows us to align sentence pairs tional Conference on Information Systems (ICIS), based purely on surface-level similarity with comparable Munich, 2019. URL:https://aisel.aisnet.org/icis2019/ results to a more expensive approach based on semantic smart_service_science/smart_service_science/.2/ similarity. [7] C. Gao, J. Zeng, X. Xia, D. Lo, M. R. Lyu, I. King, Automating App Review Response Generation, in: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, USA, 2019, pp. 163–175. doi:10. matic MT evaluation, in: Proceedings of the Tenth 1109/ASE.2019.00025. Workshop on Statistical Machine Translation, Lis[8] L. Zhao, K. Song, C. Sun, Q. Zhang, X. Huang, bon, 2015, pp. 392–395. URL: https://aclanthology.

X. Liu, Review Response Generation in E- org/W15-3049. doi:10.18653/v1/W15-3049. Commerce Platforms with External Product Infor-[17] N. Reimers, I. Gurevych, Sentence-BERT: Sentence mation, in: The World Wide Web Conference on Embeddings using Siamese BERT-Networks, in: - WWW ’19, San Francisco, 2019, pp. 2425–2435. Proceedings of the 2019 Conference on Empirical URL: http://dl.acm.org/citation.cfm?doid=3308558. Methods in Natural Language Processing and the 3313581. doi:10.1145/3308558.3313581. 9th International Joint Conference on Natural Lan[9] T. Kew, M. Amsler, S. Ebling, Benchmark- guage Processing (EMNLP-IJCNLP), Hong Kong, ing Automated Review Response Generation for 2019, pp. 3980–3990. URL: https://www.aclweb.org/ the Hospitality Domain, in: Proceedings of anthology/D19-1410. doi:10.18653/v1/D19-1410. Workshop on Natural Language Processing in E- [18] J. Cohen, A coeficient of agreement for nominal Commerce, Barcelona, 2020, pp. 43–52. URL:https: scales, Educational and Psychological Measurement //aclanthology.org/2020.ecomnlp-1.5. 20 (1960) 37–46. doi:https://doi.org/10.1177/ [10] H. Khayrallah, P. Koehn, On the Impact of Various 001316446002000104.

Types of Noise on Neural Machine Translation, in: [19] J. Carletta, Assessing agreement on classification Proceedings of the 2nd Workshop on Neural Ma- tasks: the kappa statistic, CoRR cmp-lg/9602004 chine Translation and Generation, Melbourne, 2018, (1996). URL: http://arxiv.org/abs/cmp-lg/9602004. pp. 74–83. URL: https://aclanthology.org/W18-270 9.

doi:10.18653/v1/W18-2709. [11] R. Sennrich, M. Volk, MT-based Sentence Align- A. Appendix ment for OCR-generated Parallel Texts, in: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, 2010. URL:https://aclanthology.org/ 2010.amta-papers.14. [12] B. Thompson, P. Koehn, Vecalign: Improved Sentence Alignment in Linear Time and Space, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 2019, pp. 1342–1348. URL: https://www.aclweb.org/ anthology/D19-1136. doi:10.18653/v1/D19-1136. [13] W. Xu, C. Callison-Burch, C. Napoles, Problems in

Current Text Simplification Research: New Data Can Help, Transactions of the Association for Computational Linguistics 3 (2015) 283–297. URL: https://doi.org/10.1162/tacl_a_00139. doi:10.1162/ tacl_a_00139. [14] C. Jiang, M. Maddela, W. Lan, Y. Zhong, W. Xu,

Neural CRF Model for Sentence Alignment in Text Simplification, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 2020, pp. 7943–7960. URL: https://aclanthology.org/2020.acl-main.709.doi:10.

18653/v1/2020.acl-main.709. [15] R. Barzilay, N. Elhadad, Sentence Alignment for

Monolingual Comparable Corpora, in: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, 2003, pp. 25–32. URL: https://aclanthology.org/

W03-1004. [16] M. Popović, chrF: character n-gram F-score for autoResponse 0 Dear Jim H, Thanks for your review, and feedback after your recent stay., 1 We are happy to know you enjoyed your stay, and that our renovations at the hotel did not disturb you at all!, 2 We do apologize for the missteps with the housekeeping department, and appreciate your comments., 3 We do our best to extend the Marriott standards, and quality our guests have come to expect, and have taken note of your feedback for our team to review., 4 We appreciate your review, and hope you will return to enjoy all of the new renovations being done at the hotel!