Recommending the Duration of Stay in Personalized
Travel Recommender Systems
Abhishek Agarwal1,2 , Linus W. Dietz1,2
1
    Department of Informatics, Technical University of Munich, Garching, 85748, Germany
2
    Both authors equally contributed to the paper


                                         Abstract
                                         The main focus of recommender systems research has been recommending fitting items to the users.
                                         However, in some domains, not only which item but also the quantity the target user should consume
                                         could be part of the recommendation. In this work, we tackle the under-researched problem of recom-
                                         mending the duration of stay in the domain of destination recommendation. Using two data sets, one
                                         based on hotel bookings and the other on mobility derived from geotagged Tweets, we perform extensive
                                         feature engineering with unsupervised learning to discover types of users and graph embeddings of the
                                         cities. In our experiments, we compare the performance of supervised learning algorithms with varying
                                         features to statistical baselines for predicting the duration of stay at a destination. The results underline
                                         the task’s difficulty: we obtain the best results for the hotel bookings data set using personalized mobility
                                         embeddings with CatBoost. At the same time, the simple strategy of recommending the mode duration
                                         of all users is competitive in the noisy Twitter data set.

                                         Keywords
                                         destination recommendation, duration of stay, feature engineering, graph embeddings, offline evaluation,


1. Introduction
The mainstream of recommender systems research has been to use sparse matrices of ratings to
predict the suitability of items for a given user with specific needs. Since the early beginnings
of Collaborative Filtering, recommendation algorithms have reached tremendous maturity [1]
to the extent that it is unclear whether novel approaches can outperform long-established
algorithms in terms of ranking accuracy [2]. Current recommendation challenges include
ensuring fairness [3, 4], multi-sided markets [5], and sequential recommendations [6].
   Unlike these recommendation research problems, which are concerned with choosing suitable
items and determining rankings, we analyze the problem of recommending the quantity in
which a given user should consume an item. Concretely, our goal is to evaluate algorithms that
compute the optimal duration of stay at a touristic destination based on hotel booking data and
traveler mobility from location-based social networks. In tourism recommendations, this is a
significant challenge as determining the duration of stay at specific locations is a cornerstone
of online tour planners [7, 8]. Note that this problem differs from the question of how many


RecSys Workshop on Recommenders in Tourism (RecTour 2022), September 22th, 2022, co-located with the 16th ACM
Conference on Recommender Systems, Seattle, WA, USA
Envelope-Open abhishek.agarwal@tum.de (A. Agarwal); linus.dietz@tum.de (L. W. Dietz)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
items we should recommend, e.g., Knapsack optimization approaches for resolving the Tourist
Trip Design Problem [8, 9], or recommending items multiple times within a travel package [10].
   To compute a personalized duration of stay at a travel destination, Dietz and Wörndl suggest
an algorithm that relies on a statistical analysis of a user’s previous trip duration relative to other
travelers [11]. Our work has a similar goal; however, we present a systematic experimental
design based on two different data sets. We compare the predictive performance of multiple
statistical baselines – including the method proposed in [11] – to classic supervised machine
learning algorithms with different features that we derive from unsupervised machine learning.
This feature engineering process is specific to this domain and gives further insights into
essential factors for algorithmic travel planning. Framing the recommendation problem as a
quantitative prediction task instead of using a survey to evaluate the algorithms makes the
results reproducible and comparable for future research.
   The structure of the paper is the following: In the upcoming Section 2, we survey the limited
prior work on this open research problem. Section 3 is devoted to the two data sets and the
pre-processing steps employed. In Section 4, we describe our features engineering efforts, which
we will use in our experiments described in Section 5. We discuss our results in Section 6 and
finally draw our conclusions in Section 7.


2. Related Work
Actively recommending the duration of item consumption received limited attention from the
scientific community. The expected duration of item consumption plays a role in sequence-aware
recommendation scenarios, such as video recommendation [12] or content recommendation on
social media to predict dwell times [13]. It is distinct from our approach since the problem in
our paper is to recommend the duration of stay, not which destination to visit. The approaches
mentioned above use the predicted item consumption duration to optimize business metrics
such as the overall time spent on the platform [14].
   Most literature mentioning recommending the amount of item consumption can be found
in the area of travel recommendation, where it is often resolved in an ad-hoc manner [9, 15]
or regarded as future work [16, 17]. For example, in a region recommender system, the travel
regions are first scored based on the user preferences and combined into a composite trip [9].
The authors’ proposal to determine the duration of stay is to apply a gradual decrease in the
score of the travel regions by 5-10% per week, concluding the stay at the current region as soon
as the score of the following region exceeds the score of the first.
   The duration of stay is also necessary when planning a stay in points of interest. However,
the approaches we found in the literature also do not systematically address this question.
For example, the typical duration of stays for tourists in different point-of-interest categories
has been analyzed using a Foursquare data set to improve context-aware services [18], but
this approach is not personalized. Other systems incorporate average durations of stay to
recommend trips within a city [15, 19] but only report the durations to the user as part of
the overall trip duration. These durations can stem from commercial services such as Google
Maps, which showcase information based on human mobility derived from mobile phones.
Such information is not personalized, and due to the aggregation, we can not use it to compute
recommendations for individual users.
   It leaves us with the approach proposed by Dietz and Wörndl [11], which analyzes past user
trips from Foursquare1 , a location-based social network, to understand their pace of travel. The
mean percentile duration of the previously visited cities quantifies a given user’s pace. To find
the corresponding period of stay for the target destination, the mean percentile is computed
using the duration distribution for all the users in that destination. In our work, we aim to
systematically evaluate this approach with our novel machine learning efforts on two data sets
described in the upcoming section.


3. Data Sets and Preprocessing
Currently, there are few suitable data sets for determining the duration of stay at a destination on
a global level. We conduct experiments on two real-world data sets, one released by Booking.com
and a self-collected data set of trips derived from geotagged Tweets.

3.1. Booking.com
The Booking.com Multi-Destination Trips Dataset [17] was made available as a part of the
Booking.com WSDM WebTour 2021 Challenge2 . The original challenge involved the sequential
recommendation problem of predicting the last destination of a four-destination trip; however,
the data set also includes the duration of stay at the individual destinations. Overall, the data set
consists of over a million anonymized hotel bookings making it a suitable data set for predicting
the duration of stays. Unfortunately, the anonymization masks the real destinations, which
prevented us from gaining additional insights. Another artifact of the release of this data set is
that the users originate from only five countries but have traveled to 107 different countries in
total. Since this data set includes trips comprising exactly four destinations, it is a clean data set
with a rather specific portion of the reality of international travel.

3.2. Twitter
To complement the Booking.com data set, we used the Twitter API3 to query user timelines
who have enabled sharing their geolocation in their Tweets. Using the tripmining library4 , we
segmented the users’ mobility into periods of being at home and away from their home city [20].
Consecutive periods abroad are regarded as trips, as long as specific data quality criteria are
fulfilled. Given the nature of check-in-based data, we only know the user’s location when they
tweet, resulting in an incomplete view of the periods between two check-ins. However, given
the large corpus of mobility data available to us, we could work with the subset of trips with
at least one check-in each day, thus ensuring high data quality. Furthermore, to match the
characteristics of the Booking.com hotel reservation data set, we only included multi-destination
trips with four or more destinations in the analysis. Similar to the Booking.com data set, we

    1
      https://foursquare.com
    2
      https://www.bookingchallenge.com
    3
      https://developer.twitter.com/en/docs/twitter-api
    4
      https://github.com/LinusDietz/tripmining
did not include the available geographic information as training features, despite knowing the
real city names in this data set. Thus, we know that most users come from countries where
the platform is prevalent, notably the USA, Europe, and Japan. Since these trips do not suffer
from an artificial selection bias as the Booking.com challenge data set, the Twitter trips have
a balanced ratio of 97 origin countries to 105 destination countries; however, most trips are
domestic.

3.3. Preprocessing
Although more information would be available for the Twitter data set, we decided to use the
same basic features in both data sets. The list of available features for each stay is as follows:
user ID, checkin date, check-out date, user country, destination country, destination ID, and trip
ID.

Table 1
Overview of the two data sets.
                                              Booking.com                 Twitter
                #Users                                96,643                24,146
                #Trips                               734,102               852,131
                #Origin Countries                          5                    97
                #Destination Countries                   107                   105
                Domestic trips                          4.5%                 91.3%
                Date Range               Jan 2016 – Feb 2017   Oct 2010 – Jul 2021


   Even though we have many trips available in both data sets, some travelers and destinations
have an insufficient number of trips. Thus, we discard destinations and users with few trips to
eliminate users and items still in a cold-start phase. For Booking, we keep travelers who have
visited five unique destinations and destinations visited at least 15 times. In the more noisy
Twitter data set, we enforce ten unique destinations per user, and each destination needs to
be visited at least 25 times. We present the overview of the preprocessed data sets in Table 1.
Of the total trips, 95.5% are international trips in the Booking.com data set, whereas 91.3% are
domestic trips in the Twitter data set. The difference shows that the two data sets need to be
handled and analyzed separately since the Booking.com trips are biased towards international
trips. In contrast, the Twitter trips reflect that most travel is actually domestic.
   An initial look at the distribution of the stay duration in Figure 1 shows that most travelers
visit a given destination for a relatively short period, with the mode being 1 in both data sets.

3.4. Splitting into Training and Test Data
As is common in machine learning, we randomly split the stays into a training and testing set.
The test set consists of 20% of the randomly chosen stays of each user. The remaining trips are
used for feature engineering to extract advanced representations based on the preferences and
mobility patterns of the users. These representations and other features are then used to train
different statistical and machine learning models to predict the duration of stays. Consequently,
the predictive performance is evaluated with unseen test data.
                  (a) Booking.com                                    (b) Twitter

Figure 1: Distribution of the duration of stay


4. Feature Engineering
Nowadays, the performance of machine learning models heavily relies on the information
the algorithms can use in the form of directly available but also derived features [21]. For
this problem, we derive additional features based on the mobility patterns of the travelers:
Using cluster analysis, we uncover types of travelers and compute graph embeddings of the
destinations and users. In this way, we transform the information about travelers, trips, and
destinations into a representation suitable for the machine learning algorithms to use this
information to predict the duration of stays.

4.1. Traveler Types Characterization
The characterization of traveler types can be essential for personalized travel recommender
systems. The objective is to discover the users’ travel behavior based on their past check-in
durations and frequencies. There is quite some research about traveler types [22, 23, 24, 25];
however, most relevant for our work is a clustering approach to identify traveler types by
analyzing check-in data from location-based social networks [20], which we adopt to identify
such groups of travelers that we use later to predict the duration of the next trip of a given user.
   Cluster analysis is an unsupervised learning approach to discover the new groups such
that members in the same groups are more similar to other data members in the same group
than those in other groups. After extensive experimentation, we decided to use six features
for the clustering task: number of domestic trips, number of unique domestic cities visited,
mean duration of domestic trips, number of international trips, number of unique international
cities visited, and mean duration of international trips. Since we use the Euclidean Distance,
we normalize these features using min-max scaling. Using the K-Means algorithm to cluster
the data, we separate users into groups in which the users are similar to each other, but the
groups are different from each other. Since this algorithm requires the number of clusters to be
specified, we run the K-means clustering algorithm by varying the number of clusters between
2 and 20 and assessing the cluster fit quality using the average silhouette score. The silhouette
                 (a) Booking.com                                    (b) Twitter

Figure 2: Traveler Types Clustering - Silhouette Scores


score helps us understand how close each point in one cluster is to the neighboring clusters
and, thus, acts as a suitable metric for assessing the number of clusters. The Silhouette score is
normalized from -1 to 1, with 1 being a perfect score, i.e., a dense cluster well-separated from
other clusters. A lower score indicates overlapping clusters with samples very close to the
decision boundary of the neighboring clusters. We plot the two silhouette scores and the sum
of squared errors (SSE) for the two data sets in Figure 2 and Figure 3. Based on these plots,
we can infer that a solution of 𝐾 = 6 clusters is suitable for both data sets, as there is a sharp
decrease in the silhouette score for higher values.


                 (a) Booking.com                                    (b) Twitter

Figure 3: Traveler Types Clustering - Sum of squared error (SSE)


   The final silhouette plots for the discovered clusters for 𝐾 = 6 of Booking.com and Twitter
are presented in Figure 4. The six clusters each represent a distinct type of traveler that is not
uniformly distributed. In the Booking.com data set, for example, the two large clusters, 1 and
5, represent the users who travel relatively rarely and take only short international trips with
an average of 1.28 and 2.07 days. While for Twitter, the largest cluster represents the users
that frequently travel to different cities in their country. To summarize, we have successfully
                          (a) Booking                         (b) Traveler Types Clustering - Twitter

Figure 4: Traveler Types Clustering - Final Silhouette Plot


characterized the different traveler types in our data sets using the clustering approach. The
characterization can help us predict the duration of future trips based on the traveler type label
a user has.

4.2. Traveler Mobility Patterns
The analysis of human mobility patterns has a wide range of applications in several domains
like urban planning, traffic forecasting, epidemic prevention, and location-based services [26].
Since our domain is strongly influenced by mobility, we derive several metrics based on the
mobility patterns manifested in the users’ trips. These patterns can give us insights into the
users’ traveling habits and reveal interconnections between the destinations. For example, it has
been shown that coherent travel regions can be discovered from past user trips [27] and such
features have been proven helpful in the Booking.com WSDM WebTour 2021 Challenge [28].
   However, simply clustering the destinations into regions might not be helpful when dealing
with the anonymized Booking.com data set, as we cannot map the destinations to previously
identified regions. Also, this would lead to further scalability problems in the encoding since the
number of dimensions will significantly increase using, e.g., one-hot encoding when training
different machine learning-based models. Additionally, one-hot encoding treats the categorical
variables as independent and does not consider their relationship. To retain relationship
information between destinations, we automatically learn the representation of such categorical
features in multi-dimensional spaces using entity embeddings [29].

4.2.1. Graph Embeddings
Several studies on learning latent representations of graphs have emerged in recent years. We
can use embedding-based models to learn latent representations from complex data structures
like graphs. These representations map each node in the graph into a low-dimensional space,
providing insights into nodes’ similarity and network structure. The embedding captures the
semantics of the input graph by placing similar nodes close together in the embedding space.
Such embeddings are applicable in many problems ranging from detecting protein-to-protein
interaction in biological networks to friendship recommendations in social networks [30].
Therefore, we explore the idea of learning such latent representations of users’ travel behavior
using node embeddings to predict the duration of the next trip.

4.2.2. Destination Travel Graph Creation
We represent the destination cities as nodes in an undirected graph. To quantify the weight of
the edges between two nodes, we use the ”traveled-together” relation, as our data sets comprise
multi-destination trips. Therefore, we draw an edge between each pair of cities a user visits
on the same trip. For example, if a user first visits New York City to Philadelphia and then
Philadelphia to Washington D.C, we add an edge from New York City to Washington D.C. We
argue that this additional transitive connection captures the underlying mobility better than
omitting the information that the user visited these destinations within the same trip. The final
weight of each edge is the number of co-occurrences of two nodes in the same trip for all trips
in the data set. We account for repetition, i.e., if the same user took the same multi-destination
trip twice during the observation period, it will be counted twice in the travel graph. To reduce
the noise in the graph, we remove all the edges with a weight equal to one, i.e., only one user
has traveled between these two destinations.

Table 2
Destination Travel Graph Metrics
                                           Booking.com     Twitter
                                 #Nodes            5,046     3,523
                                 #Edges           88,623    62,260
                                 Density          0.0069      0.01

   Some key metrics of the resulting graphs of the two data sets are tabulated in Table 2.
Booking.com has slightly more nodes and edges, resulting in Twitter being slightly denser;
however, both graphs are very sparse. Furthermore, analyzing the degree distribution depicted
in Figure 5, one can see that there are relatively few high-degree nodes.


                   (a) Booking                                       (b) Twitter

Figure 5: Empirical Cumulative Distribution Function (ECDF) for the Destination Travel Graphs
4.2.3. Destination Embeddings
The goal of destination embedding is to encode the cities represented by nodes in the graph
so that machine learning algorithms can predict and discover new patterns in such complex
networks. Therefore, finding a method that can efficiently generate representations for these
networks is essential. Popular methods like DeepWalk [31] and Node2Vec [32] use random
walks to generate such embeddings, which is very suitable for capturing mobility patterns.
Other deep learning methods like Structural Deep Network Embeddings [33] and Hierarchical
Attention Networks [34] employ autoencoders and attention mechanisms, which we don’t
expect to add value due to the nature of our task and the size of available datasets and number
of features.
   Experimenting with DeepWalk and Node2Vec, we find these approaches do well at encoding
the destination travel graphs we constructed in the previous section. Both software libraries
transform each node in a graph into a vector by relying on direct encoding and use a decoder
based on the inner product. The local context of nodes is captured using the statistics of
random walks, i.e., random walks of size 𝑙 try to simulate the context of a word with window
size 𝑙 used in SkipGram. DeepWalk runs fixed-length, unbiased random walks starting from
each node, whereas Node2Vec uses flexible, biased random walks that can trade-off between
local and global views of the network. It uses the parameter 𝑝 to control the likelihood of
the walk immediately revisiting a node, and 𝑞 controls the likelihood of the walk revisiting a
node’s one-hop neighborhood. This additional feature made us choose Node2Vec to learn a
low-dimensional destination representation in the two data sets. A fixed-length vector of size
100 represents each destination.
   To create a visual overview, we map the learned Node2Vec city embeddings to 2-D space
using t-distributed stochastic neighbor embedding (t-SNE) [35]. The color-coding in Figure 7
represents countries, although due to the anonymization of the Booking.com data set, we do
not know which country is which. However, we have this information present in our Twitter
data set. As one can see in Figure 7b, countries like USA and Canada, as well as Philipines and
Myanmar, are close in embedding space, indicating awareness of geographical context despite
that this information was not available to the algorithm. Further zooming in, we can also see
that embeddings also capture the relative positions of cities well, as shown in an example with
the cities in the Netherlands in Figure 6.

4.2.4. Traveler Embeddings
The previous section defined an approach to represent the destination; however, we can also
extend the use of such embeddings to personalize the prediction of the duration of stay. For
example, it is relatively common in natural language processing to create a sentence embedding
using the weighted average of word embeddings [36]. Similarly, it is common to summarize a
section of the graph in graph neural networks by averaging the node embeddings and using
these embeddings for several downstream tasks later on [37]. We follow a similar idea to create
traveler embedding that computes the average embedding of all the past cities the user has
visited. The average embedding can help us personalize the recommended duration of stay for
each user since it is a feature of their past travel behavior.
Figure 6: T-SNE Projection for City Embeddings for the Netherlands generated using Destination Travel
Graph


5. Experiments
In this section, we describe how we evaluate the performance of various algorithms utilizing
different settings of the embeddings as mentioned above against some statistical baselines.

5.1. Evaluation Settings
In our experiments, we use both Mean Absolute Error (MAE) and Root Mean Square Error
(RMSE) to measure the performance of our different approaches [38]. Both these metrics
have been widely used to assess the performance of such problems. RMSE does not use the
absolute value, which makes it desirable in many cases, mainly when calculating the gradient
or sensitivity for certain model parameters. Thus, combining both these metrics helps us assess
the performance better.
   The prediction task is to determine the number of calendar days at a destination, as this
information is available in both data sets. It also means that all algorithms that output continuous
values need to be rounded to integers since the durations in the test set are also integers. It
would be misleading to include the decimal places when computing the evaluation metrics [39].
Therefore, we also analyze the effect of rounding on the overall performance of our models.

5.2. Baseline Methods
We compare the machine-learning-based approaches with four simple baselines. These comprise
simple statistical measures and two variations of the method proposed by Dietz and Wörndl [11].

    • User Mean: The mean value of the duration of all the past trips for a given user.
    • User Mode: The mode value of the duration of all the past trips for a given user.
    • User Percentile - Country Level: Computes the user’s trip duration by comparing their
      mean percentile to the quantiles of all other travelers who have visited the same country
      as proposed in [11].
    • User Percentile - City Level: Computes the user’s trip duration by comparing their
      mean percentile to the quantiles of all other travelers who have visited the same city as
      proposed in [11].

5.3. Features Variations
To learn which features are useful for recommending the duration of stay, we evaluate three
variations that take the above-discussed advanced representations and basic features as inputs.
The basic features are trip type (domestic or international), user home country, and destination
country. The three different variations of the representation of the advanced features are as
follows:

    • Mobility One-Hot Encoded (M-OHE): a baseline machine learning-based approach that
      uses the conventional one-hot encoding of the booker and destination country instead of
      embeddings.
    • Mobility Global Embeddings (M-GE): uses city embeddings discussed in Section 4.2.3
      instead of the booker and destination country.
    • Mobility Personalized Embeddings (M-PE): uses city embeddings as well as traveler
      embeddings discussed in Section 4.2.3 and Section 4.2.4, respectively.


Table 3
Features used in the different evaluation scenarios.
                                                            M-OHE         M-GE      M-PE
                               Trip type                        Y           Y            Y
                               Traveler type clustering         Y           Y            Y
                               User home country                Y           N            N
                               Destination country              Y           N            N
                               City embeddings                  N           Y            Y
                               Traveler embeddings              N           N            Y

We summarize the features used in the above approaches in Table 3. We always use the trip
and the traveler types as input features, as they are fundamental to the approach. The M-GE
and M-PE implicitly encode the home and destination countries via the city embeddings; thus,
we do not input this information once more to avoid redundancy.

5.4. Algorithms
To evaluate these variations of features and encodings, we use the decision trees from scikit-
learn5 and Gradient Boosting from CatBoost6 . CatBoost is a state-of-the-art open-source library
    5
        scikit-learn decision trees: https://scikit-learn.org/stable/modules/tree.html
    6
        CatBoost library: https://catboost.ai
for Gradient Boosting on Decision Trees developed by Yandex, which has similar or better
performance than most other Gradient Boosting libraries, even with default parameters [40]. We
chose these algorithms to gain additional analytic insights into the importance and success of the
different embedding strategies. Even though these tree-based models are long-established, their
performance is on par with deep neural network approaches, and they are easier to tune [41]. To
speed up the training process, we use GPU-accelerated learning in the Google Cloud Platform7 .
Finally, we use four-fold cross-validation to tackle overfitting and determine the best set of
hyperparameters for each model towards the best RMSE value using scikit-optimize package,
which employs a Bayesian search method [42]. We summarize the optimal parameters for the
different models in Table 4.

Table 4
Optimal hyperparameters as determined using the Bayesian search method of scikit-optimize .
                                        CatBoost                               Scikit Decision Tree
 Data set     Approach
                          itr   l2_leaf_reg   lr     max_depth   max_depth   min_samples_leaf min_samples_split
              M-OHE      1100           49   0.049          11          49                39                 48
 Booking      M-GE       2100           50   0.024          12          40                39                 48
              M-PE       2100           50   0.231          15          14                45                 81
              M-OHE      1000           29   0.034           8          44                50                 76
 Twitter      M-GE        550           26   0.048          12          49                39                 48
              M-PE       1050           30   0.046          15          24                42                 71


The recommended duration of stay is the outcome of the machine learning regression models
of the respective algorithms using the different feature configurations.

5.5. Discussion
This evaluation procedure comprises two data sets, four statistical baselines, and two algorithms
with three encoding variations as independent variables. This setup results in 20 experimental
variations for which we compute the evaluation using four dependent variables.
   The main goal is to measure the difference between the different embeddings, providing
insights into which input features are helpful and which are not. Furthermore, we are interested
in whether more features improve the results in a significant way, as it might be that the training
costs of tuning a model with more parameters might not be worth the effort due to diminishing
improvements in accuracy. Finally, we emphasize that the input for all models is solely based
on traveler mobility. We can further improve the embeddings in real-world scenarios with
contextual info and metadata, typically available on commercial platforms.


6. Results
We evaluate the overall performance of both data sets. The MAE and RMSE values for all
methods of the test set are presented in Tables 5 and 6 for the Booking.com and Twitter data
sets, respectively.

    7
        Google Cloud Platform https://cloud.google.com/
6.1. Booking.com
For the Booking.com data set, we can observe that our proposed approaches consistently
outperform the baselines in terms of MAE and RMSE. Interestingly, the user mean duration
provides worse results than the mode, which we attribute to the mode being more robust
against outliers. Recall that one-day stays at a destination are most frequent across all trips.
Furthermore, the user percentile approach [11] performs better when the aggregation target is
on a country level instead of the city level. Analyzing the embedding-based methods, M-GE
and M-PE perform better than the M-OHE approach, indicating the node embeddings are better
in capturing travelers’ mobility patterns than the conventional one-hot encoding of the travel
destinations. The traveler embedding in the M-PE approach helps the model understand the
user’s preferences better than the M-GE approach, which only considers the city embeddings.
In the direct comparison of the decision tree model, CatBoost performs better than the scikit-
learn Decision Tree in the embedding-based encoding, with similar performance for one-hot
encoding. We can also observe that the MAE Rounded is much lower than MAE, while RMSE
Rounded is higher than RMSE. Thus, rounding the predicted values helps make more accurate
recommendations for most users, except for outlier users who travel for more extended periods.

Table 5
Prediction error metrics for Booking.com. The best scores are bold with a dagger.
        Approach                      MAE      MAE Rounded        RMSE      RMSE Rounded
        User Mean                      0.835              0.821     1.168            1.214
        User Mode                      0.742              0.742     1.233            1.233
        User Percentile – Country      0.726              0.726     1.185            1.185
        User Percentile – City         0.769              0.769     1.209             1.21
                   Scikit-DT          0.678                 0.6    0.955               1.01
        M-OHE
                   CatBoost           0.678              0.601     0.954               1.01
                   Scikit-DT          0.545              0.483     0.787             0.837
        M-GE
                   CatBoost           0.541              0.475     0.777             0.827
                   Scikit-DT          0.563              0.491     0.804             0.853
        M-PE
                   CatBoost          †0.534             †0.466    †0.767            †0.815


6.2. Twitter
We can not replicate the clear results of the Booking.com data set in the Twitter data set. First
of all, the data set seems to be noisier since, without exception, all error metrics are higher
compared to the Booking.com data. Further, our proposed methods only outperform the baseline
in terms of RMSE and RMSE Rounded, while the baseline approach of simply taking the mode
of the stay duration of the user achieves the lowest MAE score. We attribute this surprising
result to the characteristics of check-in-based data as opposed to hotel bookings: If a user visits
two or more cities on the same day, each destination will be recorded with a stay duration of
one calendar day, even though it was only a few hours. This effect does not happen with the
Booking.com hotel reservations, as one typically does not book two accommodations for the
same night. Besides that, we observe similar trends for machine learning methods as for the
Booking.com data set, with M-PE performing better than M-GE and M-OHE. Again, the best
model in terms of the more outlier-robust RMSE metric is CatBoost using the M-PE encoding.

Table 6
Prediction error metrics for Twitter. The best scores are bold with a dagger.
        Approach                       MAE      MAE Rounded         RMSE        RMSE Rounded
        User Mean                      0.962              0.958      1.307              1.336
        User Mode                     †0.806             †0.806       1.49               1.49
        User Percentile – Country      0.823              0.823      1.471              1.472
        User Percentile – City         0.856              0.856      1.442              1.443
                   Scikit-DT            0.932              0.974     1.261              1.286
        M-OHE
                   CatBoost             0.931              0.975     1.259              1.283
                   Scikit-DT            0.884              0.872     1.229              1.268
        M-GE
                   CatBoost             0.882              0.876     1.222              1.262
                   Scikit-DT            0.892              0.898     1.236              1.278
        M-PE
                   CatBoost             0.844              0.820    †1.183             †1.228


6.3. Discussion
We tested ten different models to recommend the optimal duration of stay at a destination
and evaluated them using two data sets. Our most relevant conclusion is that based on the
performance of both data sets, we can see that more advanced representations based on traveler
types characterization and mobility patterns improve the predictive performance regarding the
duration of stay for travelers.
   We obtained clear results from the Booking.com data sets that more sophisticated feature
engineering efforts lead to increased predictive performance. We further confirm this in Table 7
with a brief ablation analysis where we revisit the impact of various features of the M-PE model
in The values are computed using the CatBoost library and are based on the influence of the
individual input features. We observe that both the embedding types play an important role in
predicting the duration of the stay. The City Embeddings have the highest importance among
all features, followed by Traveller Embeddings and Traveler Cluster. The Trip Type (domestic or


Table 7
Importance of the different features used in CatBoost M-PE model for the Booking.com and Twitter
data sets
                                                       Importance
                                 Feature
                                                  Booking.com Twitter
                           City Embedding                  44.84%     46.32%
                           Traveler Embedding              26.45%     44.77%
                           Traveler Cluster                28.60%      7.54%
                           Trip Type                        0.09%      1.35%
international) feature does not play a role in the predictions, as both data sets are predominately
formed of one trip type.
   In the case of the Twitter data, the results of Table 6 underline the task’s difficulty compared
to simple baselines, such as taking the mode duration. Recall from Figure 1 that the distribution
of the stays at individual destinations is negatively skewed, with about half of all stays being
only one day. This circumstance makes it easy for the baselines to perform well. While this
is roughly the same for both data sets, the most international (95.5%) hotel reservations from
Booking.com are way less noisy than the mostly domestic (91.3%) mobility derived from Twitter
users. For this reason, we see limited direct comparability between the two data sets and observe
that the duration of hotel reservations is better suited to predict with the proposed algorithms
than unconstrained mobility observed from a location-based social network.
   When it comes to the generalizability of the results, we have strong doubts about the Book-
ing.com challenge data set as stemming from a data science challenge; it is probably too clean to
be a realistic benchmark for the challenge of recommending the personalized duration of stay.
The Twitter data set is likely a closer reflection of reality, which is a sobering insight: Due to the
reality of most trips in the data lasting precisely one day, a strategy of simply recommending
one day – essentially the user mode approach – can be competitive with machine learning
approaches.


7. Conclusions
In this paper, we explored how we can effectively leverage traveler behavior and mobility
patterns to model the user’s preferences when predicting the duration of stay at a destination.
Using two data sets, one based on hotel bookings and one on check-in-based traveler mobility
from location-based social networks, we evaluate machine learning algorithms with domain-
driven feature engineering against statistical baselines. The unavailability of high-quality data
impeded the feature engineering; however, we could automatically identify different destination
embeddings even though the Booking.com data set was anonymized. Consequently, to the best
of our knowledge, we are first to evaluate the duration of stays for destination recommendation
systematically. Given the input features used to do the feature engineering, one can see the
potential to generalize our approaches and experimental setup to any check-in-based data
source.
   In the future, we plan to extend this work to not only recommending the duration of stay at
one destination but determining the optimal durations for all legs of composite trips in relation
to each other [43, 9]. Furthermore, we are interested in exploring the benefits of additional
methods to embed user mobility patterns and experiment with deep learning to optimize the
prediction task’s performance further.
   Recommending the amount of item consumption is not only relevant on a destination-level,
but could also be analyzed on a finer level, e.g., determining how long to stay at specific
points-of-interest [44] or the duration of exercise activities [45]. We believe our approach could
supplement existing approaches and establish an evaluation standard for this problem.
References
 [1] Y. Koren, R. Bell, Advances in collaborative filtering, in: Recommender Systems Handbook,
     Springer US, 2015, pp. 77–118. doi:10.1007/978- 1- 4899- 7637- 6_3 .
 [2] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we really making much progress? a
     worrying analysis of recent neural recommendation approaches, in: 13th ACM Conference
     on Recommender Systems, RecSys’19, ACM, New York, NY, USA, 2019, pp. 101–109.
     doi:10.1145/3298689.3347058 .
 [3] A. Singh, T. Joachims, Fairness of exposure in rankings, in: 24th ACM SIGKDD Inter-
     national Conference on Knowledge Discovery & Data Mining, KDD ’18, Association for
     Computing Machinery, New York, NY, USA, 2018, pp. 2219–2228. doi:10.1145/3219819.
     3220088 .
 [4] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of attention: Amortizing individual fairness
     in rankings, in: The 41st International ACM SIGIR Conference on Research, SIGIR ’18,
     ACM, New York, NY, USA, 2018, pp. 405–414. doi:10.1145/3209978.3210063 .
 [5] H. Abdollahpouri, G. Adomavicius, R. Burke, I. Guy, D. Jannach, T. Kamishima,
     J. Krasnodebski, L. Pizzato, Multistakeholder recommendation: Survey and research
     directions, User Modeling and User-Adapted Interaction 30 (2020) 127–158. doi:10.1007/
     s11257- 019- 09256- 1 .
 [6] S. Wang, L. Hu, Y. Wang, L. Cao, Q. Z. Sheng, M. Orgun, Sequential recommender systems:
     Challenges, progress and prospects, in: 28th International Joint Conference on Artificial
     Intelligence, International Joint Conferences on Artificial Intelligence Organization, 2019.
     doi:10.24963/ijcai.2019/883 .
 [7] D. Herzog, L. W. Dietz, W. Wörndl, Tourist trip recommendations – foundations, state
     of the art and challenges, in: M. Augstein, E. Herder, W. Wörndl (Eds.), Personalized
     Human-Computer Interaction, de Gruyter Oldenbourg, Berlin, Germany, 2019, pp. 159–182.
     doi:10.1515/9783110552485- 006 .
 [8] D. Gavalas, C. Konstantopoulos, K. Mastakas, G. Pantziou, A survey on algorithmic
     approaches for solving tourist trip design problems, Heuristics 20 (2014) 291–328. doi:10.
     1007/s10732- 014- 9242- 5 .
 [9] D. Herzog, W. Wörndl, A travel recommender system for combining multiple travel
     regions to a composite trip, in: CBRecSys, 2014, pp. 42–48.
[10] M. Xie, L. V. S. Lakshmanan, P. T. Wood, Composite recommendations: from items to pack-
     ages, Frontiers of Computer Science 6 (2012) 264–277. doi:10.1007/s11704- 012- 2014- 1 .
[11] L. W. Dietz, W. Wörndl, How long to stay where? On the amount of item consumption in
     travel recommendation, in: ACM RecSys 2019 Late-breaking Results, 2019, pp. 31–35.
[12] P. Covington, J. Adams, E. Sargin, Deep neural networks for YouTube recommendations,
     in: 10th ACM Conference on Recommender Systems, RecSys’16, ACM, New York, NY, USA,
     2016, pp. 191––198. URL: https://doi.org/10.1145/2959100.2959190. doi:10.1145/2959100.
     2959190 .
[13] X. Yi, L. Hong, E. Zhong, N. N. Liu, S. Rajan, Beyond clicks: Dwell time for personalization,
     in: 8th ACM Conference on Recommender Systems, RecSys’14, ACM, New York, NY, USA,
     2014, pp. 113–120. URL: https://doi.org/10.1145/2645710.2645724. doi:10.1145/2645710.
     2645724 .
[14] Y. Tian, K. Zhou, D. Pelleg, What and how long: Prediction of mobile app engagement, ACM
     Transactions on Information Systems 40 (2022) 1–38. URL: https://doi.org/10.1145/3464301.
     doi:10.1145/3464301 .
[15] D. Kitayama, K. Ozu, S. Nakajima, K. Sumiya, A Route Recommender System Based on
     the User’s Visit Duration at Sightseeing Locations, Springer, Cham, 2014, pp. 177–190.
     doi:10.1007/978- 3- 319- 11265- 7_14 .
[16] L. W. Dietz, Data-driven destination recommender systems, in: 26th Conference on User
     Modeling, Adaptation and Personalization, UMAP ’18, ACM, New York, NY, USA, 2018,
     pp. 257–260. doi:10.1145/3209219.3213591 .
[17] D. Goldenberg, P. Levin, Booking.com multi-destination trips dataset, in: 44th International
     ACM SIGIR Conference on Research and Development in Information Retrieval, ACM,
     New York, NY, USA, 2021. doi:10.1145/3404835.3463240 .
[18] J. Melià-Seguí, R. Zhang, E. Bart, B. Price, O. Brdiczka, Activity duration analysis for
     context-aware services using foursquare check-ins, in: International Workshop on Self-
     aware Internet of Things, Self-IoT ’12, ACM, New York, NY, USA, 2012, pp. 13–18. doi:10.
     1145/2378023.2378027 .
[19] D. Herzog, C. Laß, W. Wörndl, Tourrec: A tourist trip recommender system for individuals
     and groups, in: 12th ACM Conference on Recommender Systems, RecSys ’18, ACM, New
     York, NY, USA, 2018, pp. 496–497. doi:10.1145/3240323.3241612 .
[20] L. W. Dietz, A. Sen, R. Roy, W. Wörndl, Mining trips from location-based social networks
     for clustering travelers and destinations, Information Technology & Tourism 22 (2020)
     131–166. doi:10.1007/s40558- 020- 00170- 6 .
[21] L. Bernardi, T. Mavridis, P. Estevez, 150 successful machine learning models: 6 lessons
     learned at Booking.com, in: 25th ACM SIGKDD International Conference on Knowledge
     Discovery & Data Mining, KDD ’19, ACM, New York, NY, USA, 2019, pp. 1743–1751.
     doi:10.1145/3292500.3330744 .
[22] H. Gibson, A. Yiannakis, Tourist roles: Needs and the lifecourse, Annals of Tourism
     Research 29 (2002) 358–383.
[23] J. Neidhardt, R. Schuster, L. Seyfang, H. Werthner, Eliciting the users’ unknown preferences,
     in: 8th ACM Conference on Recommender Systems, RecSys ’14, ACM, New York, NY, USA,
     2014, pp. 309–312. doi:10.1145/2645710.2645767 .
[24] M. Sertkan, J. Neidhardt, H. Werthner, Mapping of tourism destinations to travel be-
     havioural patterns, in: B. Stangl, J. Pesonen (Eds.), Information and Communication
     Technologies in Tourism, Springer, Cham, 2017, pp. 422–434.
[25] M. Sertkan, J. Neidhardt, H. Werthner, Eliciting touristic profiles: A user study on picture
     collections, in: 28th ACM Conference on User Modeling, Adaptation and Personalization,
     UMAP ’20, ACM, New York, NY, USA, 2020, pp. 230–38. doi:10.1145/3340631.3394868 .
[26] M. C. González, C. A. Hidalgo, A.-L. Barabási, Understanding individual human mobility
     patterns, Nature 453 (2008) 779–782. doi:10.1038/nature06958 .
[27] A. Sen, L. W. Dietz, Identifying travel regions using location-based social network check-in
     data, Frontiers in Big Data 2 (2019). doi:10.3389/fdata.2019.00012 .
[28] D. Goldenberg, K. Kofman, P. Levin, S. Mizrachi, M. Kafry, G. Nadav, Booking.com wsdm
     webtour 2021 challenge, in: ACM WSDM Workshop on Web Tourism, WebTour’21, ACM,
     New York, NY, USA, 2021.
[29] C. Guo, F. Berkhahn, Entity embeddings of categorical variables, arXiv preprint
     arXiv:1604.06737 (2016).
[30] W. L. Hamilton, R. Ying, J. Leskovec, Representation learning on graphs: Methods and
     applications, arXiv preprint arXiv:1709.05584 (2017).
[31] B. Perozzi, R. Al-Rfou, S. Skiena, Deepwalk: Online learning of social representations, in:
     Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery
     and data mining, 2014, pp. 701–710.
[32] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings
     of the 22nd ACM SIGKDD international conference on Knowledge discovery and data
     mining, 2016, pp. 855–864.
[33] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: International Conference
     on Knowledge Discovery and Data Mining, ACM, 2016. doi:10.1145/2939672.2939753 .
[34] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks
     for document classification, in: 2016 Conference of the North American Chapter of the
     Association for Computational Linguistics: Human Language Technologies, ACL, 2016,
     pp. 1480–1489. doi:10.18653/v1/n16- 1174 .
[35] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning
     Research 9 (2008) 2579–2605.
[36] Q. Chen, Z.-H. Ling, X. Zhu, Enhancing sentence embedding with generalized pooling, in:
     27th International Conference on Computational Linguistics, ACL, Santa Fe, New Mexico,
     USA, 2018, pp. 1815–1826. URL: https://aclanthology.org/C18-1154.
[37] S. Ivanov, E. Burnaev, Anonymous walk embeddings, in: J. Dy, A. Krause (Eds.), 35th
     International Conference on Machine Learning, volume 80 of Proceedings of Machine
     Learning Research, PMLR, 2018, pp. 2186–2195.
[38] T. Chai, R. R. Draxler, Root mean square error (rmse) or mean absolute error
     (mae)?–arguments against avoiding rmse in the literature, Geoscientific Model Develop-
     ment 7 (2014) 1247–1250.
[39] W. Wang, Y. Lu, Analysis of the mean absolute error (mae) and the root mean square
     error (rmse) in assessing rounding model, in: IOP conference series: materials science and
     engineering, volume 324, IOP Publishing, 2018.
[40] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, A. Gulin, CatBoost: unbiased
     boosting with categorical features, Advances in neural information processing systems 31
     (2018).
[41] R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning is not all you need, Information
     Fusion 81 (2022) 84–90. doi:10.1016/j.inffus.2021.11.011 .
[42] R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, I. Guyon, Bayesian
     optimization is superior to random search for machine learning hyperparameter tuning:
     Analysis of the black-box optimization challenge 2020, in: H. J. Escalante, K. Hofmann
     (Eds.), NeurIPS 2020 Competition and Demonstration Track, volume 133 of Proceedings of
     Machine Learning Research, PMLR, 2021, pp. 3–26.
[43] R. Roy, L. W. Dietz:, Triprec – a recommender system for planning composite city trips
     based on travel mobility analysis, in: ACM WSDM Workshop on Web Tourism, WebTour’21,
     ACM, New York, NY, USA, 2021.
[44] D. Herzog, Recommending a sequence of points of interest to a group of users in a mobile
     context, in: 11th ACM Conference on Recommender Systems, RecSys ’17, ACM, New
     York, NY, USA, 2017, pp. 402–406. doi:10.1145/3109859.3109860 .
[45] B. Smyth, A. Lawlor, J. Berndsen, C. Feely, Recommendations for marathon runners: on
     the application of recommender systems and machine learning to support recreational
     marathon runners, User Modeling and User-Adapted Interaction (2021). doi:10.1007/
     s11257- 021- 09299- 3 .
(a) Booking.com – Due to the pseudonymous labels, it is impossible to judge the correctness of the
    projection. Compared to the Twitter plot below, the number of cities per country is quite balanced,
    and the cities within one country are very close to each other, indicating a high quality of the
    embeddings.


(b) Twitter – Here we have the true labels for the countries and cities. Unsurprisingly, most data is from
    the United States, followed by Great Britain. In the top center, one can observe the proximity of
    Canada to the US (pink and blue clusters).

Figure 7: T-SNE Projection of City Embeddings generated using Destination Travel Graphs