Characterizing Multi-Source Data for Effective Urban Mobility Modelling: The Case of New York City Tonny Rutayisire1,∗ , Jun Liu1,† , Samuel Moore1,† and Carlos Balsas2,† 1 School of Computing, Ulster University, UK 2 School of Architecture & Built environment, Ulster University, UK Abstract To explore a more holistic view of urban human mobility, a few researchers have previously attempted to build models driven by a fusion of multi-source datasets. Yet, owing to limited understanding of the intrinsic characteristics and relationships underlying multi-source datasets, most of such interventions naively merge these datasets, producing sub-optimal prediction models. To address this issue, we propose a three-step methodological framework to capture urban mobility from an integrative point of view. The novelty of our framework is three-fold: (i) a systematic characterization of the multi-source data to leverage hidden relationships when integrating the data; (ii) integrating data with contextual information of the urban environments to improve explainability; and (iii) conforming the model building to incremental learning paradigm to adapt to changing patterns of mobility data. Using the New York City (NYC) case study, extensive analyses show salient relationships within datasets which could form a basis for optimal data fusion and subsquently improving the mobility modelling pipeline in line with the proposed three-step methodological framework. Keywords human mobility, data fusion, multi-view modelling, 1. Introduction Human mobility patterns in urban settings are indicative of various phenomena such as travel demand, traffic congestion, resource allocation, CO2 emission, and infectious disease spread. Therefore, under- standing such patterns is crucial to applications such as transportation management, urban planning, resource optimization, and epidemiology among others. In recent years, the pervasive use of sensing technologies has produced an enormous amount of human mobility data that researchers have used to answer critical questions pertaining to when, why, and how people move from one place to the other [1]. A wide range of data-driven models and techniques have been proposed to capture urban human mobility, based on a variety of dataset: taxi trip records [2] [3], call detail records (CDRs) [4] [5], social media [6], [7]. Despite the large number of studies on urban human mobility, majority of existing techniques have typically been built on single-source empirical data in isolation from other mobility flows. Human mobility, especially in urban settings is multi-faceted, mainly due to diverse travel modes and different spatial/temporal scales. It inevitably introduces a bias, against uninvolved flows, when the capture of urban-scale mobility is single source data-driven. To explore a more holistic view of urban mobility dynamics, a few researchers have attempted to build models and techniques driven by a fusion of multi-source datasets. Zhang et al. [8] proposed coMobile, a multi-view learning framework based on integration of transit view and cellphone view. The approach outperforms 2 single-view models WHERE and TRANSIT by 51% and 58% in terms of Mean Average Percetage error (MAPE), respectively. Jiang et al. [9] also utilized taxi GPS trajectories, smart card transaction data of subway and bus from Beijing to model human mobility in space. It is reasonable to assume that the fusion of these mobility datasets addresses the biases to some degree and provides a representative view of a broad spectrum of the population. Yet, owing to limited understanding of the intrinsic characteristics and relationships of AICS’24: 32nd Irish Conference on Artificial Intelligence and Cognitive Science, December 09–10, 2024, Dublin, Ireland ∗ Corresponding author. † These authors contributed equally. $ rutayisire-t@ulster.ac.uk (T. Rutayisire); j.liu@ulster.ac.uk (J. Liu); s.moore2@ulster.ac.uk (S. Moore); c.balsas@ulster.ac.uk (C. Balsas) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings these multi-source datasets, a considerable number of such interventions naively merge these datasets, which end up introducing hidden drawbacks to the data fusion and subsequently producing sub-optimal prediction models. For instance, in scenarios where space is a limitation, merging all multi-source datasets can become an issue, yet correlations within different datasest can be leveraged to use part of the dataset to infer the rest. Moreover, simplistic merging of datasets could actually worsen the sampling bias problem where datasets with low levels of representativeness contribute equally or more, to the data fusion. To this end, we propose a three-step methodological framework to capture urban mobility from an integrative point of view: (i) In the first step, we leverage intrinsic characteristics of multi-source data to learn, select, and integrate the most optimal vectors of the data for effective urban mobility modelling; (ii) we then integrate the modelling process with prior knowledge contexts of the domain to minimize the amount of data requirement, manage the uncertainties, and achieve model interpretability; (iii) and in the third step, we conform the modelling process to incremental learning paradigm where the model learns and updates by integrating new data streams as and when they become available. For multi-view learning, we argue that the effectiveness of the model can be enhanced when the above-mentioned aspects are incorporated together in the modeling process. The rest of the paper is organized as follow: Section 2 summarizes the challenges and our motivation for this research. In Section 3, we introduce our generic methodological framework for integrative human mobility modeling. Section 4 introduces our case study area, the mobility datasets, and the approaches we have used to characterize the multi-source datasets. The results and their discussions are presented in Section 5, while the conclusion and future work are provided in Section 6. 2. Motivation Urban mobility is multi-faceted in nature, mainly due to different modes of transport. As such, we have recently seen a growing shift from single source to multi-source data-based models for human mobility modelling in urban settings [9] , [10], [11], [12]. In reality, single data sources provide a limited perspective on human mobility. For instance, taxi trip data may not represent the people who cannot afford taxi and choose to use other means of transport such as bus, ride-sharing, and subway. On the other hand, data from bike-sharing apps is not as representative during the winter season, and or across unfavourable topography. Figure 1 shows the differences in trip counts for Yellow taxi & Green taxi across NYC taxi zones. Even with the same mode of transport (but owing to different urban contexts), we observe a huge divergence in the flow of mobility produced by yellow and green taxi, hitting as high as 380,000 trips in some zones. Hence, whereas fusion of multiple data sources may provide a more representative understanding of urban mobility patterns, simply adding together a series of features extracted from multi-source datasets lays a sub-optimal foundation for multi-view mobility modeling. It can’t simply be assumed that all multi-source datasets contribute equally to the mobility flow aspect being learned. In [11] , authors highlight that the representativeness of each source largely depends on the demographics of the service users in relation to the demographics of the local population, and this dynamic is yet to be fully investigated. An effective approach to multi-view mobility modeling has to capture the true spatial-temporal context of the datasets, and their respective qoutas for the fusion. We argue that leveraging intrinsic characteristics of the datasets, coupled with existing domain knowledge could result in an appropriate weighting mechanism when multi-source mobility datasets are being fused. Therefore, our motivation stems from the need for a novel framework to effectively integrate and learn multi-source data dynamically with domain knowledge for optimal multi-view mobility modeling. 3. Proposed Integrative Methodological Framework To optimize multi-view mobility modeling, we propose an integrative three-step methodological frame- work, as shown in Figure 2. The framework combines multi-source mobility data for a more representa- tive picture of travel behaviours, then integrates it with existing domain-knowledge to minimize data requirement while enhancing explainability, and finally conforms the mobility modeling to incremental Figure 1: Difference in trip counts by NYC taxi zones learning, adapting to ever-changing urban mobility conditions. Each component is briefly overviewed in the following subsequent sections. 3.1. Multi-Source Data Fusion In the context of multi-view learning, data fusion captures the comprehensive view of urban mobility patterns, and also diffuses the limitations that come along with relying solely on single-source data. However, most existing works make general assumption of the multi-source data when it is being fused. For example in [13] , Zhang et al. (2014) used the transport data to simply complement modeling primarily based on cellphone data, and later in [8], Zhang et al. (2017) adjusted and treated the two kinds of data sources equally, as two independent views. This is bound to integrate the data in proportions that don’t reflect the true representation of each view. Increasingly, authors in this field [9] [11] [14], acknowledge that data fusion requires a full understanding of each dataset and their efficient utilization, which is currently still limited. In our proposed framework, the characteristic context of multi-source data is leveraged to workout an effective weighting scheme upon which data fusion is done. This will ensure spatial-temporal patterns are realistically and dynamically captured. 3.2. Data – Knowledge Integration It is widely known that data-driven models are only as good as the quantity and quality of data they are trained on. Given issues relating to incompleteness, inconsistencies, sparsity and uncertainity, it is often difficult to acquire large amounts of quality mobility data. However, integrating empirical data with domain-specific knowledge often validates the data, enhancing model reliability and interpretability, which is still an open challenge in human mobility research. To the best of our knowledge, very few mobility modeling approaches exist in literature today, which incorporate domain-specific knowledge. In MobTCast [15], the influence of semantic, social, and geographical contexts is incomporated with historical data to tackle the sparsity problem, while predicting POIs. Likewise in [16], a framework was proposed to leverage a knowledge graph to accommodate the influence of users, locations and semantics, in next location recommendation. Meanwhile in [17], authors proposed an MaaS framework Figure 2: Integrative methodological framework for multi-view mobility modeling to combine multi-source data with an understanding of travelers’ contexts to present them personalized and explainable services based on their preferences. Challenges withstanding, their framework aims to achieve the possibility of providing the right information to the right user with understandable explanations. The common principal about these methods is the integration of empirical data with one or more kinds of contextual understanding of the urban environment. Whereas existing efforts have focused on the influence of somewhat stable-static contexts, our integrative framework goes a step further to incorporate situational contexts which is rather a dynamically changing context. 3.3. Incremental Learning Urban mobility is highly dynamic, with context and patterns that typically change from time to time. Conventional data-driven models, though they have shown considerable performance, they often struggle to keep at per with evolving urban environments, where mobility patterns keep on changing due to changing policy, transportation systems, road networks, weather conditions, and events. As an emerging approach, Incremental Learning (IL) which enhances adaptability of models by learning continuously from new incoming data streams, is poised to address these challenges suffered by static models. In our framework, the knowledge-enhanced mobility model will be tuned to continuously update on only new incoming data, as and when it becomes available, while retaining previously learned knowledge. 4. Case Study: Urban Mobility in New York City In the previous section, we give an overview of the proposed integrative three-step methodological framework. Note however that in the present work, the focus is to preliminarily use the case of New York City (NYC) to analyze the underlying characteristics of multi-source data, setting up an optimal foundation for effective data fusion and subsquent multi-view mobility modeling in the line of the proposed framework. 4.1. Study Area In this work, we characterize and compare mobility flows extracted from NYC’s taxi and Citi bike trip records. NYC is the largest city in the United States, known for its fast pace, dynamism and cultural diversity. It continues to be a global hub for finance, culture, commerce, and innovation. It covers a total area of 784 km2 with an approximate population of 8.5 million people as of 2023. The city is composed of five boroughs: Manhattan, Brooklyn, Queens, Bronx, and Staten Island, each with its own unique character, and it is demarcated into 263 zones, as shown in Figure 3. The city is covered with an extensive and a well-integrated transport system that is comprised of public transit, commuter rails, taxis, and bicycles. A number of researchers have utilized NYC as a case study to investigate different aspects of urban mobility. For example, in [2] F. Miao et al. designed and evaluated the performance of the data-driven vehicle balancing framework on four years’ long taxi trip data from NYC, in [18, 19] authors narrowed in on mobility data from NYC in 2019 and 2020 to analyze the impact of COVID-19 on people’s mobility, whereas in [20] Dong et al. utilized anonymous mobile phone location and crash report data in NYC to study the association of human mobility and road crashes. We also chose NYC as a case study for this work for three main reasons; (1) open data policy/portal of NYC, (2) shape files for NYC demarcations, and (3) NYC mobility survey report data. Figure 3: NYC Boroughs & Zones 4.2. Datasets In this section, the three NYC datasets used in this study, are summarized. The Yellow taxi and Green taxi datasets are provided by the Taxi & Limousine Commission of NYC, while the bike-sharing dataset is provided by Citi bike NYC. For all datasets, each record is a distinctive trip extracted, with an origin, destination, duration, distance and fare among other trip attributes. After pre-processing of the datasets, a total of 10,047,135 trips were successfully extracted from the Yellow taxi, 1,080,844 trips from the Green taxi, and 1,000,000 trips from the NYC Citi bikes. To ensure temporal compatibility, the collection period was the same for all the three datasets, running from April 1, 2017 to April 30, 2017. 4.3. Approach We characterize multi-source datasets based on three meta-metrics: (1) correlation; and (2) representa- tiveness. We argue that examining and leveraging these three meta-metrics will form a basis for optimal data fusion and subsquently improving the multi-view mobility modelling pipeline. 4.3.1. Trip extraction Firstly, we extract trips – which is the basic unit of mobility for this study. With all the three datasets having pickup and drop-off attributes, extracting origins and destinations (OD) is simple and straight- forward. However, to reduce the size and address errors, irrelevant attributes and errant records were intuitively eliminated right away before any form of feature engineering was done. For example, all records with trip duration < 5 minutes and/or trip distance > 30 miles, were automatically dropped for all datasets. 4.3.2. Correlation analysis We examine mobility (travel demand) to evaluate whether there exist temporal correlation between Yellow taxi, Green taxi, and Citi bikes trip patterns in NYC. Owing to significant difference in scale among the datasets , we observe in Figure 4 that both Green taxi and Citi bike data temporally lag the Yellow taxi data, but there exist similar mobility trends worth further investigation. To provide fair and meaningful comparison, we normalize the data to capture the hourly variations as a proportion of total mobility flow for each dataset in Figure 5. We then use three different measures (Cosine-similarity, Pearson Correlation, and Spearman Rank Correlation), to explore similarities in their temporal distribution. These measures have been widely used in previous studies to quantify the similarity between two vectors in a multi-dimensional space, depending on the scale and context of the data. In our context, we opted for them because they quantify similarity in patterns regardless the difference in scale. Figure 4: Temporal variation of mobility flow (a) Yellow taxi, (b) Green taxi, and (c) Citi bikes 4.3.3. Representation analysis Taxi and Bike-sharing mode choices provide NYC residents with certain levels of convenience, timeliness, and flexibility. However, they co-exist and complement other modes in an integrated transport system. It is thus crucial to examine if travel demand extracted from the three datasets reflect their actual respective mobility activity in NYC. We perform a spatial coverage analysis to ensure that all significant zones are truly represented. For each dataset, we compute travel demand as the proportion of total outgoing trip counts at the zone level. We then use a choropleth map to visualize and compare the spatial distribution of these values, for the three datasets. We then use official statistics from the 2017 citywide mobility survey of NYC, to benchmark patterns derived from our datasets. Particularly we compare the trip proportions extracted Figure 5: Normalized hourly trip counts for Yellow taxi, Green taxi, and Citi bikes Figure 6: Spatial distribution of trip counts for (a) Yellow taxi, (b) Green taxi, and (c) Citi bike in NYC from the datasets to the relative trip proportions extracted from the 2017 citywide mobility survey of NYC. And consequently we perform statistical tests on some aspects of the observed trips in comparison with surveyed trips. 5. Results & Discussions This section summarizes the results obtained along with detailed discussions. 5.1. Correlation Figure 5 depicts the temporal distribution of normalized travel demand as extracted from Yellow taxi, Green taxi, and Cite bike trip records. Considerable relationships can be observed among the three datasets within the time dimension. For instance, we can see a similar drop in demand from midnight up until 5 a.m., across all the datasets. We also see a shoot-up in demand at 8 a.m. and another one at 6 p.m. To probe these relationships further, Table 1 summarizes results of three well-known measures of similarity against pairs of the three datasets. From the table 1, we can see that travel demand across all pairs of datasets generally has a positive correlation in the temporal space. Particularly, travel demand patterns from Yellow taxi & Green taxi exhibit a very strong alignment across time, for all the three measures utilized, whereas the correlation Table 1 similarity measures for hourly distribution of trips extracted from Yellow taxi, Green taxi, and Citi bike datasets Cosine-similarity Pearson Correlation Spearman Rank Correlation yellow - green 0.978 0.919 0.906 green - bike 0.899 0.681 0.657 yellow - bike 0.914 0.759 0.710 between green taxi & cite bikes is relatively weak but still considerably above average. 5.2. Representativeness Figure 6 depicts the spatial distribution of travel demand extracted from the 3 datasets. The hue progressions represent the number of outgoing trip counts aggregated at each zone, divided by the total trip counts for each of the dataset. Yellow taxi & Green taxi are regulated to mainly serve distinct areas of NYC. Yellow taxi primarily serves the core (lower) Manhatten and the major airports, whereas the Green taxi are regulated to operate in upper Manhatten and outer boroughs. We can see how the illustrations in the Figure 6 are compatible with the obvious expectation owing to the above fact. For example in Figure 6.a, we can observe a very sharp concetration of yellow taxi trips in inner Manhatten, JFK Airport, and LaGuardia Airport, in contrast with other zones. Figure 7: Comparison of Trip data and Survey data To validate this representativeness, we statistically compare our trip data with the 2017 citywide mobility survey of NYC. Figure 7 illustrates variations in the mode composition as extracted from the trip data and survey data. We observe that among the three modes, Green taxi exhibits a strong fit, Figure 8: Comparison of cumulative distributions for trip duration when trip data is compared against survey data. This is further collaborated by the CDF distribution of their respective trip durations as shown in Figure 8. We can observe that though Yellow taxi and Citi bikes both have considerable similar distribution (p-value=>0.05), Green taxi once again exhibit remarkable values of Kolmogorov-Smirnov (KS) test, which demonstrating a strong representativeness of the Green taxi records against the actual mode split as by the 2017 city-wide mobility survey of NYC. 6. Conclusions, Limitations & Future Work In this study, we proposed a three-step methodological framework for modelling urban mobility from an integration of multi-source data. The preliminary step of our framework involves the systematic characterization of multi-source data as a foundation for a weight-based data fusion. Using NYC as a case study, we characterized three mobility datasets based on two meta-metrics: (1) correlation; and (2) representativeness. Our findings revealed that there exist strong correlation between datasets, which can be leveraged in data fusion. For instance, we see a strong temporal correlation between Yellow taxi & Green taxi, in terms of relative travel demand, which could be used to infer missing data in one dataset using the other. It could also be used in weighted average fusion of the data, where weights are based on the correlation. Also revealed via our findings is the level of representativeness between the respective mobility datasets and the actual urban scenarios. For example we see in our case study how the Green taxi has the most alignment with actual sample data surveyed from the population in the 2017 city wide mobility survey of NYC. This could also be leveraged in the weight scheme to ensure the most representative and reliable dataset contributes most to the data fusion. It should be noted that this work does not attempt to examine all the three steps of the proposed framework, but rather to give its overview, and the characterization of the multi-source data to be used as ingredients in the subsequent steps of the framework. One of the limitations of this work however, results from the fact that currently, we only focus on data sources from NYC. If matching multi-source data from other mega cities like Shenzhen and Beijing was readily available, a comparison of mobility trends from different urban cities against the two meta-metrics would further validate the characterization of multi-source data. Another limitation stems from the age of data used (2017), given major mobility shifts that have occurred with the coming Covid-19 pandemic. While very recent mobility data is available at the NYC open data portal, for the interest of measuring representativeness of the data, we wanted to avoid a time difference of data collections between trip records data and the mobility survey data, given that the city-wide mobility survey data available is of 2017. In our future work, while acknowledging and working around the above limitations, we plan to leverage the findings in this preliminary work, to design an effective weighting scheme for the data fusion. We shall then improvise effective ways to incorporate contextual knowledge in the fused data, and finally employ incremental learning to build an efficient mobility prediction model in line with the proposed methodological framework for different application contexts. References [1] Y. Zhou, B. P. L. Lau, C. Yuen, B. Tuncer, E. Wilhelm, Understanding urban human mobility through crowdsensed data, IEEE Communications Magazine 56 (2018) 52–59. doi:10.1109/MCOM. 2018.1700569. [2] F. Miao, S. Han, A. M. Hendawi, M. E. Khalefa, J. A. Stankovic, G. J. Pappas, Data-driven distri- butionally robust vehicle balancing using dynamic region partitions, in: Proceedings of the 8th International Conference on Cyber-Physical Systems, 2017, pp. 261–271. [3] H. Rong, X. Zhou, C. Yang, Z. Shafiq, A. Liu, The rich and the poor: A markov decision process ap- proach to optimizing taxi driver revenue efficiency, in: Proceedings of the 25th ACM international on conference on information and knowledge management, 2016, pp. 2329–2334. [4] C. Rodrigues, M. Veloso, A. Alves, C. Bento, Using cdr data to understand post-pandemic mobility patterns, in: EPIA Conference on Artificial Intelligence, Springer, 2023, pp. 438–449. [5] Y. Li, Z. Ran, L. Tsai, S. Williams, Using call detail records to determine mobility patterns of different socio-demographic groups in the western area of sierra leone during early covid-19 crisis, Environment and Planning B: Urban Analytics and City Science 50 (2023) 1298–1312. [6] Y. Jiang, X. Huang, Z. Li, Spatiotemporal patterns of human mobility and its association with land use types during covid-19 in new york city, ISPRS International Journal of Geo-Information 10 (2021) 344. [7] J. C. Wo, E. M. Rogers, M. T. Berg, C. Koylu, Recreating human mobility patterns through the lens of social media: using twitter to model the social ecology of crime, Crime & Delinquency 70 (2024) 1943–1970. [8] D. Zhang, T. He, F. Zhang, Real-time human mobility modeling with multi-view learning, ACM Transactions on Intelligent Systems and Technology (TIST) 9 (2017) 1–25. [9] S. Jiang, W. Guan, W. Zhang, X. Chen, L. Yang, Human mobility in space from three modes of public transportation, Physica A: Statistical Mechanics and its Applications 483 (2017) 227–238. [10] D. Zhang, T. He, F. Zhang, National-scale traffic model calibration in real time with multi-source incomplete data, ACM Transactions on Cyber-Physical Systems 3 (2019) 1–26. [11] X. Huang, Z. Li, Y. Jiang, X. Ye, C. Deng, J. Zhang, X. Li, The characteristics of multi-source mobility datasets and how they reveal the luxury nature of social distancing in the us during the covid-19 pandemic, International Journal of Digital Earth 14 (2021) 424–442. [12] Z. Huang, X. Ling, P. Wang, F. Zhang, Y. Mao, T. Lin, F.-Y. Wang, Modeling real-time human mobility based on mobile phone and transportation data fusion, Transportation research part C: emerging technologies 96 (2018) 251–269. [13] D. Zhang, J. Huang, Y. Li, F. Zhang, C. Xu, T. He, Exploring human mobility with multi-source data at extremely large metropolitan scales, in: Proceedings of the 20th annual international conference on Mobile computing and networking, 2014, pp. 201–212. [14] J. Wang, X. Kong, F. Xia, L. Sun, Urban human mobility: Data-driven modeling and prediction, ACM SIGKDD explorations newsletter 21 (2019) 1–19. [15] H. Xue, F. Salim, Y. Ren, N. Oliver, Mobtcast: Leveraging auxiliary trajectory forecasting for human mobility prediction, Advances in Neural Information Processing Systems 34 (2021) 30380–30391. [16] Q. Guo, Z. Sun, J. Zhang, Y.-L. Theng, An attentional recurrent neural network for personalized next location recommendation, in: Proceedings of the AAAI Conference on artificial intelligence, volume 34, 2020, pp. 83–90. [17] E. Rajabi, S. Nowaczyk, S. Pashami, M. Bergquist, G. S. Ebby, S. Wajid, A knowledge-based ai framework for mobility as a service, Sustainability 15 (2023) 2717. [18] X. Hao, R. Jiang, J. Deng, X. Song, The impact of covid-19 on human mobility: A case study on new york, in: 2022 IEEE International Conference on Big Data (Big Data), IEEE, 2022, pp. 4365–4374. [19] A. A. Rajput, Q. Li, X. Gao, A. Mostafavi, Revealing critical characteristics of mobility patterns in new york city during the onset of covid-19 pandemic, Frontiers in Built Environment 7 (2022) 654409. [20] N. Dong, J. Zhang, X. Liu, P. Xu, Y. Wu, H. Wu, Association of human mobility with road crashes for pandemic-ready safer mobility: A new york city case study, Accident Analysis & Prevention 165 (2022) 106478.