<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>modation Review Ranking using Sentence Embeddings and Nearest-Neighbor Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rajorshi Chaudhuri</string-name>
          <email>rajorshi.chaudhuri@bookmyshow.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pranav Bhatki</string-name>
          <email>pranav.bhatki@bookmyshow.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yash Dubal</string-name>
          <email>yash.dubal@bookmyshow.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Recommender Systems, Bari, Italy.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BookMyShow</institution>
          ,
          <addr-line>Mumbai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Recommender Systems</institution>
          ,
          <addr-line>RecTour 2024 Challenge, Tourism, Hotels, Accommodations, Reviews, SBERT, BallTree</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our 2nd place solution of the RecTour 2024 Challenge. The competition task was to rank the reviews of accommodations for users based on the characteristics of the user and the accommodation. For the final solution, our team ”BMS Hunters”, used a combination of sentence embeddings with nearest-neighbor implementation of our model is available on GitHub. 1 With the vast number of reviews available for popular accommodations, users often struggle to quickly discover the most relevant and helpful ones. An intuitive solution to this problem would be to display the reviews solely in chronological order or based on ”helpfulness” votes. However, this does not account for the preferences or specific needs of the individual user. For example, a family might value diferent aspects of a hotel compared to a solo traveler, and the ranked reviews should reflect these nuances. Furthermore, many reviews lack ”helpfulness” votes, leading to presentation bias where only a few reviews dominate visibility.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset Description</title>
      <p>The competition training dataset consisted of three files:
1. Users - This file contains information regarding anonymized users and accommodation features.</p>
      <p>It has over 1.6 million unique rows and includes the following columns:
• user_id : Unique identifier for the user.
• accommodation_id : Unique identifier for the accommodation.</p>
      <p>• guest_type: Type of guest (e.g., solo, couple, family).</p>
      <p>CEUR</p>
      <p>ceur-ws.org
• guest_country: Country of the guest.
• room_nights: Number of room nights booked.
• month: Month of the booking.
• accommodation_type: Type of accommodation.
• accommodation_country: Country of the accommodation.
• accommodation_score: Overall score of the accommodation.
• accommodation_star_rating: Star rating of the accommodation.
• location_is_ski: Indicator if the location is a ski resort.
• location_is_beach: Indicator if the location is a beach destination.</p>
      <p>• location_is_city_center : Indicator if the accommodation is located in the city center.
2. Review - This file contains information regarding reviews. It has over 1.6 million rows and
includes the following columns:
• review_id: Unique identifier for the review.
• accommodation_id: Unique identifier for the accommodation being reviewed.
• review_title: Title of the review.
• review_positive: Positive comments in the review.
• review_negative: Negative comments in the review.
• review_score: Score given in the review.</p>
      <p>• review_helpful_votes: Number of helpful votes the review received.
3. Matches – This file contains the true labels for the dataset, representing positive examples of
user-accommodation-review relationships. It has over 1.6 million rows and includes the following
columns:
• user_id: Unique identifier for the user.
• accommodation_id: Unique identifier for the accommodation.</p>
      <p>• review_id: Unique identifier for the review.</p>
      <p>
        Each accommodation in the dataset is associated with a minimum of 10 unique reviews. To ensure
data quality, each review is analyzed for at least 3 distinct topics using the text2topic method[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Consequently, reviews that are too simplistic, such as those containing only the word ”awesome,” are
ifltered out because they do not provide suficient informative content.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Model Architectures</title>
      <p>Our approach to the RecTour 2024 challenge evolved through several stages. We describe our three
main models below:</p>
      <sec id="sec-3-1">
        <title>3.1. Baseline</title>
        <p>Our initial approach was a simple ”review score” based ranking model. This model assumed that reviews
with higher ratings were more informative and helpful to users. Consequently, the reviews were ranked
in the descending order of their scores for each accommodation, without incorporating additional user
or accommodation characteristics. As a baseline model, it provided a reference point for evaluating
more advanced models.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sentence-BERT for User and Review Embeddings</title>
        <p>Although the initial review score-based model provided a solid baseline, it did not take advantage of
the textual content of the reviews or the detailed profiles of users and accommodations. So, to improve
the personalization of review rankings for users, we wanted to incorporate the review content.</p>
        <p>
          To achieve this, we selected Sentence-BERT[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] for its ability to generate high-quality sentence
embeddings. Sentence-BERT is particularly well suited for capturing the semantic meaning of sentences,
which is crucial for understanding and ranking reviews based on their relevance to user preferences.
        </p>
        <p>We used the following features to generate user and review embeddings:
• User Features: guest_type, guest_country, accommodation_type, accommodation_country.
• Review Features: review_title, review_positive, review_negative.</p>
        <p>For user embeddings, we selected these features because we did not have a direct user history. Instead,
we aimed to create user cohorts based on the guest type (e.g., solo, family, etc.), the origin of the guest,
and the common travel patterns between guest types and the host countries they usually visit. This
allowed us to approximate user preferences more efectively.</p>
        <p>We then encoded these user profiles and review texts using the Sentence-BERT model. By computing
the cosine similarity between the encoded user profiles and reviews, we ranked the reviews according
to their relevance to each user’s profile. This method allowed us to capture deeper contextual relevance
between users and reviews beyond simple scoring.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. ProfileRec: Sentence-BERT + BallTree</title>
        <p>Our final model, ProfileRec, extended the previous approach by incorporating additional user features
and utilizing BallTree for eficient nearest-neighbor search. BallTree is a spatial data structure that
partitions the data into a binary tree for faster querying of nearest neighbors. This is particularly
advantageous when working with embeddings generated by Sentence-BERT, as it speeds up the process
of finding similar items.</p>
        <p>In addition to the features used in the previous model, we included ”room_nights” and ”month” as
additional user features. The decision to incorporate room nights was informed by our observation of a
positive trend between review ratings and the number of room nights booked, as illustrated in Figure 1.</p>
        <p>This scatter plot shows that accommodations with higher review scores tend to have more room
nights, suggesting that users who stay longer might leave more detailed and potentially higher-rated
reviews. We included ”month” as a temporal feature to account for potential seasonal variations in
review scores and booking patterns.</p>
        <p>The inclusion of these features, combined with the BallTree-based nearest-neighbor search, enhanced
the ability of the model to rank reviews based on a more nuanced understanding of user preferences
and accommodation characteristics.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we present our solution for the RecTour 2024 Challenge. Our final approach,
ProfileRec, combined Sentence-BERT embeddings with BallTree for an eficient nearest-neighbor search,
significantly enhancing ranking accuracy. We hope that our approach provides valuable insights and
contributes to the development of cohort-based recommendation systems in the RecSys field.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors wish to thank the organizers of the RecTour 2024 Challenge for the opportunity to participate
in this exciting competition.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Livne</surname>
          </string-name>
          , E. Fainman, Booking.com
          <article-title>rectour 2024 challenge</article-title>
          , in: ACM RecSys RecTour '24,
          <string-name>
            <surname>Bari</surname>
          </string-name>
          , Italy,
          <year>2024</year>
          . Retrieved from https://workshops.ds-ifs.tuwien.ac.at/rectour24/rectour-2024-challenge/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beladev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kleinfeld</surname>
          </string-name>
          , E. Frayerman,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shachar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fainman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Assaraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizrachi</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Wang,</surname>
          </string-name>
          <article-title>Text2topic: Multi-label text classification system for eficient topic detection in user generated content with zero-shot capabilities</article-title>
          , in: M.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , I. Zitouni (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>