=Paper= {{Paper |id=Vol-2974/invited1 |storemode=property |title=Presentation of the expedia group RecTour Research Dataset |pdfUrl=https://ceur-ws.org/Vol-2974/invited1.pdf |volume=Vol-2974 |authors=Adam Woznica,Jan Krasnodebski |dblpUrl=https://dblp.org/rec/conf/rectour/WoznicaK21 }} ==Presentation of the expedia group RecTour Research Dataset== https://ceur-ws.org/Vol-2974/invited1.pdf
  Expedia Group RecTour Research Dataset

  ADAM WOZNICA∗ and JAN KRASNODEBSKI∗ , Expedia Group, Switzerland
  This document provides details on the dataset that Expedia Group released to the RecTour community at the 15th ACM Conference
  on Recommender Systems. This dataset is based on real traveler lodging searches and bookings on Brand Expedia websites, which
  have been anonymized to protect identities of consumers and suppliers. The intention is to provide the recommendation system
  research community, and more specifically travel researchers, an open and rich dataset for their work. The motivation for this dataset
  was multiple requests originating from Expedia Group-sponsored competitions, where participants wanted to use the data that was
  provided for research purposes. This dataset was designed to meet that specific demand while preserving confidentiality.

  Additional Key Words and Phrases: datasets

  1   INTRODUCTION
  Expedia Group is the world’s travel platform that offers consumers a broad selection of travel products across brands
  such as Expedia, Hotels.com and Vrbo. 2019 bookings were over $107 billion while serving hundreds of millions of
  travelers [4].
      To foster research in recommendation systems for travel, Expedia Group has provided a real world dataset that
  consists of lodging shopping and purchase data. This builds upon Expedia Group’s previous efforts in the area of sharing
  data for recommendation system and tourism researchers via competitions [6, 7] and educational challenges [1, 3].
  Participants were often interested in using the data from the contest for additional research of their own. However,
  datasets from contests are not directly fit for general research as they are designed for the smooth operation of a
  specific competition. This places various requirements on them not related to research uses such as doctorate theses or
  academic research. The authors consulted with leading researchers from the RecTour community [2] to create a dataset
  inspired by these competitions that was oriented towards research use. There was also a perceived desire within the
  wider RecSys community for datasets similar in concept to MovieLens [5] in other fields, in order to provide diversity
  and additional avenues for recommendation research.
      The dataset is available under a Creative Commons license, subject to appropriate acknowledgement.

  2   DATASET
  The Expedia Group dataset consists of global lodging shopping and purchase data from consumers in multiple countries
  across tens of thousands of destinations. The data are organized around a set of ”search result impressions”, i.e. the
  ordered list of properties that a consumer sees after a lodging search at one of the Brand Expedia websites. The user
  response is provided as a click on a property or/and a purchase of a property room. Only clicks and purchases that
  occurred after a search and before the next search within a 180 minute time limit are attributed to a search.
      A property refers to one of over a million hotels, vacation rentals, apartments, B&Bs, hostels and other properties
  appearing on Brand Expedia’s websites. Room types are not distinguished and the data can be assumed to apply to the
  least expensive room type.
      The data span a period from 2021-06-01 to 2021-07-31 and contain searches for a random sample of consumers who
  made at least one click during the above time frame. Consumers who booked more than 4 distinct properties during
  ∗ Both authors contributed equally to this research.


  Authors’ address: Adam Woznica, awoznica@expediagroup.com; Jan Krasnodebski, jkrasnodebski@expediagroup.com, Expedia Group, Rue du 31
  Décembre 40-42, Geneva, Switzerland, 1207.




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                       A. Woznica and J. Krasnodebski




                                                 Fig. 1. Data labels as seen on Brand Expedia sites.




            this period are excluded. The data span more than 800k unique users and approx. 2.5M searches and include desktop
            and mobile device traffic. The data include traveler inputs such as adding filters and selecting specific sort types, such
            as price ascending.
               Figure 1 outlines the relationship between the search and property data in the dataset with the values impressed on
            the Brand Expedia site. Figure 2 outlines the click and purchase pathways on Brand Expedia’s site.


            2.1    Data Anonymization and Resampling
            Several steps have been taken to anonymize the data and obfuscate the true data distribution to protect users and
            commercial sensitivities.
               First, the point_of_sale, geo_location_country and destination_id columns were mapped to frequency based indexes.
            The prop_id column was indexed based on a random order. Next, distributions of the following categorical attributes
            were obfuscated by randomly changing proportions of users:

                  • point_of_sale
                  • geo_location_country
                  • destination_id
                  • sort_type
                  • is_mobile




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
  Expedia Group RecTour Research Dataset




                                Fig. 2. Representation of click and purchase on Brand Expedia sites.


  For example, the proportion of mobile searches (given by the is_mobile column) is similar but not identical to the ”true”
  proportion. Finally, we changed proportions of the num_clicks and is_trans ”label” attributes at the property (prop_id)
  level. In other words, the click through rate (CTR) and conversion rate (CVR) at the property level computed based on
  the above attributes do not exactly match the ”true” CTR and CVR values.

  2.2   Attributes
  In this section we provide a detailed list of attributes.

                                                   Table 1. Attribute description.


    Attribute Name               DataType       Description                                            Comments
    user_id                      String         Unique user id (i.e. browser cookie)
    search_id                    String         Unique search id
    search_timestamp             Timestamp      Date and time of the search                            Rounded to minutes




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                        A. Woznica and J. Krasnodebski



              point_of_sale               Integer       ID of the Expedia point of sale (i.e. Expedia.com,       Frequency based index-
                                                        Expedia.co.uk, Expedia.fr, ...)                          ing. Obfuscated true dis-
                                                                                                                 tribution.
              geo_location_country        Integer       The ID of the country the consumer is located            Frequency based index-
                                                                                                                 ing. Obfuscated true dis-
                                                                                                                 tribution.
              is_mobile                   Boolean       Whether the search was made from a mobile                Obfuscated true distri-
                                                        device                                                   bution.
              destination_id              Integer       ID of the destination where the hotel search was         Obfuscated true distri-
                                                        performed                                                bution.
              checkin_date                Date          Stay start date
              checkout_date               Date          Stay stop date
              adult_count                 Integer       The number of adults specified in the search
              child_count                 Integer       The number of children specified in the search
              infant_count                Integer       The number of infants specified in the search
              room_count                  Integer       Number of rooms specified in the search
              sort_type                   String        Sort type                                                Obfuscated true distri-
                                                                                                                 bution.
              applied_filters             String        Pipe delimited list of applied filters. Each filter is   Anonymized Property
                                                        identified by its name and value. Sample value: Name and Point of Inter-
                                                        STAR:4.0|LODGING:HOTEL                                   est filters.
                                                        ”|” delimited list of impressions. Each impression
                                                        consist of the following ”,” delimited attributes:
                                                        • rank
                                                        • prop_id
                                                        • is_travel_ad
                                                        • review_rating
              impressions                 List[Impr]    • review_count
                                                        • star_rating
                                                        • is_free_cancellation
                                                        • is_drr
                                                        • price_bucket
                                                        • num_clicks
                                                        • is_trans

              Impr.rank                   Integer       Hotel position on Expedia’s search results page.
              Impr.prop_id                Long          The ID of the property. It matches prop_id from          Indexed based on a ran-
                                                        Table 2.                                                 dom order.
              Impr.is_travel_ad           Boolean       If the impressed property is a travel ad (labelled
                                                        "Ad", pay per click advertisement).




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
  Expedia Group RecTour Research Dataset



      Impr.review_rating          Float       The mean customer review score for the prop-
                                              erty on a scale out of 5, rounded to nearest in-
                                              tegers. A 0 means there have been no reviews,
                                              null that the information is not available.
      Impr.review_count           Integer     The number of reviews for the property rounded
                                              to the nearest 25.
      Impr.star_rating            Float       The star rating of the hotel, from 1 to 5. A null in-
                                              dicates the property has no stars, the star rating
                                              is not known or cannot be publicized.
      Impr.is_free_cancellation   Boolean     If a booking can be cancelled without extra fees.
      Impr.is_drr                 Boolean     If the property had a discount price reduction
                                              specifically displayed ("strikeout" price).
      Impr.price_bucket           Integer     Price bucket (1-5) based on percentile of the
                                              distribution of impressed prices; lower values of
                                              price_bucket correspond to lower prices. A null
                                              value means that the property was not available.
      Impr.num_clicks             Integer     Number of clicks within 180 minutes                     Obfuscated true distri-
                                                                                                      bution.
      Impr.is_trans               Boolean     If there was a transaction within 180 minutes           Obfuscated true distri-
                                                                                                      bution.




  2.2.1 Property amenities. In addition to the main dataset from Table 1 we also released a property amenities dataset
  described in Table 2. This dataset spans approximately 1.5 million properties. Properties from the main table which
  cannot be matched with properties from the amenities table can be assumed to have missing amenities.


  3     CONCLUSIONS
  Expedia Group has provided a dataset based on real traveler behavior specifically for academic researchers and students.
  This dataset should address the demand that has been expressed in the past for it during competitions and events. This
  dataset can also be used by instructors for courses. Feedback is welcome on how we can improve this dataset in the
  future, and what other datasets may be useful for the RecTour and recommendation system research community.


  4     ACKNOWLEDGEMENTS
  We would like to acknowledge Julia Niedhardt for her initiative with the idea of creating an industry-based real world
  dataset for recommendation system and tourism researchers. And for her efforts to make it a reality at RecTour 2021.
  We also thank Dr. Wolfgang Wörndl for his contribution to this project.




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                           A. Woznica and J. Krasnodebski

                                                                    Table 2. Property amenities table.

                                    Attribute Name                        DataType                      Comments
                                    prop_id                                 Long           It matches Impr.prop_id from Table 1.
                                    AirConditioning                       Boolean
                                    AirportTransfer                       Boolean
                                    Bar                                   Boolean
                                    FreeAirportTransportation             Boolean
                                    FreeBreakfast                         Boolean
                                    FreeParking                           Boolean
                                    FreeWiFi                              Boolean
                                    Gym                                   Boolean
                                    HighSpeedInternet                     Boolean
                                    HotTub                                Boolean
                                    LaundryFacility                       Boolean
                                    Parking                               Boolean
                                    PetsAllowed                           Boolean
                                    PrivatePool                           Boolean
                                    SpaServices                           Boolean
                                    SwimmingPool                          Boolean
                                    WasherDryer                           Boolean
                                    WiFi                                  Boolean


            REFERENCES
            [1] 2021. EXPEDIA GROUP X ENTER21 Data Science Competition Socially Responsible and Inclusive Tourism. https://enter-conference.org/compete/
                expedia-group-x-enter21/.
            [2] 2021. RecTour: Workshop on Recommenders in Tourism. https://recsys.acm.org/recsys21/rectour/.
            [3] American Statistical Association. 2017. ASA DataFest 2017. https://www.dropbox.com/s/eafdup47fpcqvam/UofT%20Stats%20data%20than%20v5%20-
                %20FINAL.mp4?dl=0.
            [4] Expedia Group. 2020. Form 10-K. https://s27.q4cdn.com/708721433/files/doc_financials/2020/ar/Expedia-Group-Annual-Report.pdf.
            [5] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (Dec.
                2015), 19 pages. https://doi.org/10.1145/2827872
            [6] Adam Woznica and Jan Krasnodebski. 2013. Personalize Expedia Hotel Searches - ICDM 2013 Learning to rank hotels to maximize purchases.
                https://www.kaggle.com/c/expedia-personalized-sort.
            [7] Adam Woznica and Jan Krasnodebski. 2016. Expedia Hotel Recommendations. Which hotel type will an Expedia customer book? https://www.
                kaggle.com/c/expedia-hotel-recommendations.




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).