<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Mining on the Use of Railway Stations</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computational Mathematics and Lomonosov Moscow State University GSP-1</institution>
          ,
          <addr-line>1-52, Leninskiye Gory, Moscow, 119991</addr-line>
          ,
          <institution>Russia and Center of digital high-speed transport systems Russian University of Transport (MIIT) Obraszova 9, bld. 9, 127994, Russia and National Competence Center for Digital Economy Lomonosov Moscow State University GSP-1</institution>
          ,
          <addr-line>1-52, Leninskiye Gory, Moscow, 119991</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article deals with the processing of data on the entrances and exits of passengers for railway stations in Moscow and the suburbs. Smart transport cards are used worldwide in transport applications as a payment tool. So, for railways (cities) its usage creates the big and constantly updated collections of transactions data from cards validation equipment. The deployment model for railways in Moscow region allows us to know exactly the starting and ending points of the each route. This detailed information allows us to obtain generalized information on the modes (models) of the actual use of the railway transport. The detected travel patterns could be mapped to the model of the social and economic behavior of residents of the capital region. And vice versa, we can use known artifacts of the behavior of the inhabitants of the region as the search patterns for transport data.The conclusion that mobility is one of the main characteristics and one of the key components of a smart city is a well-known fact.</p>
      </abstract>
      <kwd-group>
        <kwd>urban railways</kwd>
        <kwd>smart card</kwd>
        <kwd>transport cards</kwd>
        <kwd>data mining</kwd>
        <kwd>mobility</kwd>
        <kwd>smart city</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This paper deals with the processing of data on the check-in (entrances) and
check-out (exits) of passengers for railway stations in Moscow and the suburbs.
Within the framework of the existing model of moving around suburban and
urban railways, each passenger presents (validates) his travel document twice: at
the entrance to the railway station before the trip (check-in) and at the exit from
the railway station after the end of the trip (check-out). This feature, together
with the unique identity of the travel document, allows us to accurately know
the starting and ending points of the route. Accordingly, it becomes possible to
analyze this displacements data in order to obtain the generalized information on
the modes (models) of the actual use of the railway communication. It seems to
us that the information obtained during this analysis can be useful for railways
for assessing their activities and planning changes, and for city services to assess
the ongoing changes in the urban environment and planning future changes.</p>
      <p>The aim of the work is to search for patterns of travel of railway passengers
and map these patterns to the model of the social and economic behavior of
residents of the capital region. It is also possible to consider the inverse
problem - the search (the con rmation) of known artifacts of the behavior of the
inhabitants of the region in the data on the movements.</p>
      <p>
        As we pointed out above, in the existing system, a railway ticket (travel
document) is presented twice - at the entrance and at the exit. Accordingly, for
each trip, we know the station where the ticket was used at the entrance and
the station where the ticket was used at the exit. In terms of social networks, we
know the pair - check-in and check-out. It means, by the way, a pro table
difference, for example, from information on the validation of travel documents in
the metro or buses. There (in Moscow) we have only information about the
entrance. Accordingly, to obtain information about the starting and ending points
of the trip, it is necessary to use any heuristic algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, let
us describe from the point of view of the card validation system (e.g. the Troika
card - one of the main travel cards in Moscow) a typical person's trip from
home to work and back using the metro. In the morning, the passenger uses
his card to enter the metro. Thus, the starting point of the route, tied to the
card, appears. Further, having reached a workplace, the person is there until
the evening, after which the card is validated again. This gap (a time interval
between trips) lets us make a conclusion about the nal point of the route: it is a
rst check-in station after the gap. Naturally, this will be approximate data. For
data collected at railway stations, as indicated above, there is accurate
information about the starting and ending points of travel. Thus, these data become in
some way complete, they contain all information about trips. Note that travel
tickets (disposable or reusable) are anonymous, and there is no information on
the passengers themselves.
      </p>
      <p>To date, detailed information on the passengers' activity at railway stations
is practically not used. Perhaps, one of the few applications that could be
mentioned here is the calculation of the total numbers of passengers by stations. It
seems to us that the above-mentioned detailed data contain much more valuable
information that re ects not only the patterns of the use of railway transport in
the city but also allows us to identify some other artifacts of city life.</p>
      <p>If we can explain the relationship of information obtained on the basis of
registration data with some patterns of behavior in the city, then tracking the
changes in the data on the stations (which is technically feasible on the part of
the railway, for example) will allow us to determine (or predict) some changes in
the life of the city. In other words, changes that can be identi ed in the process of
constant monitoring of registration data will indicate a change in the processes in
the city. Accordingly, data on passages can be used to track changes in patterns
of behavior of urban residents (passengers). And vice versa, understanding the
relationships will allow us to predict how changes in the city will a ect the use
of the railway.</p>
      <p>The rest of the paper is organized as follows, In Section 2, we describe related
works. In Section 3, we describe railway data processing and discovered links with
urban life.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <p>
        Modeling urban behavior by mining geo-tagged data is a popular topic for
research [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. In the rst place, social network data is used to assess behavior.
The check-in conception has been introduced by social networks. Technically,
check-in data reveals information who spends time where and when. Also, they
could be used for detecting types of activities. Obtained data can be used to
describe city regions in terms of activity that takes place therein. And the next
natural question is how to distinguish one region from another via the types of
activity. The mathematical tools used here are mainly related to the
construction of clusters based on probabilistic models. For example, a group of users who
are more likely to be in the next moment in a given place or will engage in a
certain type of activity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        For transport data, the models should look slightly di erent. For example,
for urban railways, the routes of passengers are precisely known. Activities,
naturally, are also limited. The paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], contains a review of smart cards data
mining in Smart Cities. The typical tasks are tra c patterns detection, trips
generation, and routes-based studies. In our case, tra c patterns (transit
patterns) tasks are not applicable, because we deal again with the xed railroad's
routes only. As per trips generation - these problems can be partially interesting
because we can use the same mathematical tools.
      </p>
      <p>
        The paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] provides a rich overview of transport-related studies target
behavior extraction. While the main task for most of the transport studies is still
getting origin and destination pair (what is not an issue in our case), this paper
enumerates also other interesting models. E.g., it is a detection that movement
ow structure is polycentric; detection of power law ow distribution and
negative binomial law distribution of rides; spatial and temporal pattern mining. The
Hillinger coe cient could be used to measure the similarity of temporal patterns
of human mobility between each pair of days and provide a base for variability
analysis. As per provided studies, intra-urban trips have peak hours over a day,
are di erent between weekday and weekend (which is almost obvious), and have
a periodicity (which is not so obvious).
      </p>
      <p>In our calculations, we've used the following de nition for the Hillinger
coefcient. Let pi(x) be distribution of probability density function (i= 1,2, ... ,N ).
The Hillinger coe cient among these variables is:</p>
      <p>R =</p>
      <p>
        N
X(Y pi(i))1=N
x i=1
(1)
The value of R is between 0 and 1. The larger the R, the more related the
probability density functions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In general, for our kind of research, time-dependent analysis of urban
movement patterns [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] looks a bit more suitable. E.g., paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] describes the
temporallybased regularity of commuting measurement. The temporal patterns could be
detected by the similarity of departure time and the number of traveling days.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Railway data processing</title>
      <p>Data for analysis for 2016/2017 years on suburban and city stations of the
Moscow region was provided by the Center for Digital High-Speed
Transportation Systems of the Russian University of Transport. The data les contain the
following information for each pass (input or output):
date and time,
type of the event: entrance or exit,
a current station,
a station where the passenger arrived (if this is an exit),
type of the tari (price characteristic): full or preferential,
type of the ticket: one-time one-way ticket, one-time round-trip ticket,
subscription (reusable ticket, travel card)</p>
      <p>What was included in the rst phase of our research? Firstly, these are the
usage patterns of the stations. We can assume that there are di erences in
how passengers use railway stations. Moscow city (more precisely, workplaces
in Moscow) is the center of attraction, and accordingly, we can expect that for
the suburban stations (outside the city boundary in Figure 1) there will be a
peak at the entrance of passengers in the morning hours (Figure 2).</p>
      <p>These peaks at the entrance (see 1 in Figure 3) through the time t spent on
the road should pour into the peaks at the exits from the stations placed in the
city (see 2 in Figure 3).</p>
      <p>In general, the check-ins between the two peak hours, as well as after the
second peak hour, corresponds probably, to the non-obligatory activities. This
is the simplest and most obvious pattern. By attenuation of the peaks to the
exits, we can determine at which stations those who come to the city leave the
railway and transfer to other modes of transport. Of course, for each direction
of railroad, these stations will be di erent.</p>
      <p>Another possible direction for future research is the analysis of this damping.
At least, the primary results show that it is not absolutely stable for the chosen
direction. Passengers from time to time change their habits of leaving the train
in the morning. This lasts 1-2 days (not for all directions), after which everything
returns to the basic scheme.</p>
      <p>An additional nding in this connection: the detection of peaks on the exits
at the stations where the geo-information system does not show connections
with other modes of transport. Why do the passengers go to this station? A
possible explanation is the presence of some point of attraction (business center,
shopping center, etc.)</p>
      <p>
        According to this simple model, we should see the opposite picture in the
evening. The peaks at the entrances to the "internal" (urban) stations and the
peaks on the exits (with a time gap) at the "external" stations. Findings that
were made here: the picture is not symmetrical. Passengers do not necessarily
leave from the stations they came to (of course, we operate only with quantitative
di erences - there is no information on passengers). Probably, we can propose a
natural explanation for this. There is some mobility upon completion of work.
Outgoing tra c (large peaks) is tied to stations where there are connections to
other modes of transport (where transport accessibility is better).
It corresponds, for example, with results presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for a metro. As per
authors, for each metro station, the temporal trip patterns are in uenced by the
land uses around. For example, the homogeneity and high density of land uses
around a metro station will result in obvious morning and afternoon peak hours
in metro transportation. So there are some patterns over time by a station which
can be mainly characterized by check-in and check-out during peak hours and
working hours.
      </p>
      <p>Hypotheses that require veri cation: asymmetry in tra c is greater in the warm
season (higher mobility) and on Friday (for the same reasons).</p>
      <p>The analysis shows the presence of stations without pronounced peaks in the
morning and evening hours. At this moment, the reasonable explanation is the
conclusion that the stations are not connected with work tra c. For example,
for stations outside the city, this is more typical for holiday villages. In the city,
it is typical for stations in the former industrial zones (where mass housing
construction is only being developed). Another possible explanation for the absence
of peaks is the linking of the station to some large transport (interchange) nodes,
where there is always a large passenger tra c (so, working migration does not
add anything signi cant to the constantly existing tra c).</p>
      <p>There were no cases of the presence of one peak (regular) at the entrance and
/ or exit in the examined dataset. It should be noted that there is no reasonable
"urban" explanation for such a hypothetical situation.</p>
      <p>Another point related to a tra c, is the con rmation (or refutation) of the
above-mentioned found move pattern on the data for the day o (e.g.,
Sunday, Saturday). Obviously, for stations with predominantly "working" tra c,
we should see the absence of peaks in the morning and evening hours on
weekends. Single outbursts are possible and connected, most likely, with some mass
events held over the weekend.</p>
      <p>
        As per classi cation of stations by tra c patterns, we can follow the model
presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The standard score indicates how many standard deviations the
volume of the metro station is above or below the mean. It means that we can
use the mean of standardized volume in two peak hours and working hour as a
metric. It will explore the characteristics of railway transportation by station.
Check-in (check-out) in two peak hours and working hour could be compared
to the mean value of standardized volume by stations. There are three possible
situations: the volume is below the mean, about the same as mean, and above
the mean. The gures could be calculated, for example, for each 2 hours interval.
      </p>
      <p>The next moment, which was investigated in the work - is the ratio of
onetime tickets versus reusable tickets (travel cards). The idea of this comparison
is based on the following fact. Reusable tickets (travel cards) are cheaper.
Accordingly, those who travel constantly, will most likely use them. Therefore, a
greater percentage of travel cards corresponds to more constant (robust) tra c.
Tickets are bought before the trip, accordingly, it would be interesting to
compare the ratio of one-time and reusable tickets at the entrance to the stations.
Accordingly, stations with deviations from average ratios of ticket types were
identi ed. So far, the explanation that is being considered here is the
availability of interchange transport nodes near such stations, from where the "random"
passengers for the railway arrived.</p>
      <p>One-time tickets are of two types - one way and round-trip. The next possible
step is to analyze the ratio of such tickets. Also interesting is the question of
changing the ratio of one-time tickets and travel cards on the days of the week
( rst of all - comparing working days and days o ). The increase in the number
of one-time tickets at the weekends shows that these days, the railway is really
"acquiring" new passengers who do not use the railroad for a week.</p>
      <p>
        In addition to the above-mentioned Hillinger coe cient, we also used
methods of analyzing the similarity of time series. As per [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the measures for time
series similarity could be categorized as lock-step, elastic, threshold-based, and
patterns-based measures.
      </p>
      <p>So-called lock-step measure in this classi cation is well-known Euclidean
distance. It is de ned as the square root of the sum of the squared di erences
between corresponding data points in two time series data. As it is mentioned in
the all statistical papers, the main problem of the Euclidean distance is the need
to have the same length for time series. It is not a problem in our case, because
we can, for example, divide the day into ve-minute intervals and thus construct
the same length-of-time time series for the number of inputs and outputs.</p>
      <p>
        So-called elastic measures use dynamic programming to align sequences with
di erent lengths. The typical example is so-called DTW [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Threshold system
assumes that we have a user-provided threshold T and converts sequence data
to so-called threshold crossing. They are treated as points in two-dimensional
space composed only of data points above the introduced threshold T.
Patternbased measures rst nd some representative matching segments (called local
patterns), in a time series by focusing on amplitude and trajectories (up or
down). Actually, this approach takes into account such factors as the number of
local patterns, gap bound, time shifting factor, amplitude shifting factor, time
scale factor, and amplitude scale factor [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        In our work, we've successfully used a shape-based similarity measure,
socalled Angular Metric for Shape Similarity (AMSS) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This approach treats a
time series as a vector sequence and focus on the shape of the data and compares
data shapes by employing a variant of cosine similarity. It is illustrated in Fig.
4.
The paper analyzes data on the entrances and exits of passengers of railway
stations in the Moscow region. The main tasks that were considered in this
work were the building of mapping models of the results of processing
checkin/check-out data on the socio-economic aspects of the life of the inhabitants of
the region. In the article, methods of detection (extraction) of usage patterns of
railway stations linked with work tra c are considered. We classi ed the railway
stations according to the received usage patterns. Also, an approach is proposed
for assessing how changes in the city (for example, the construction of former
industrial zones) will be re ected (respectively, can be tracked) in the modes
of use of railway stations. Analyzing similarity distributions and methods for
measuring the similarity for time series were used as analysis tools. The results
of the work are of practical use in the development of the system of urban
railways in Moscow.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Namiot</surname>
          </string-name>
          , Dmitry, and
          <string-name>
            <surname>Manfred</surname>
          </string-name>
          Sneps-Sneppe.
          <article-title>"A Survey of Smart Cards Data Mining"</article-title>
          http://ceur-ws.
          <source>org/</source>
          Vol-1975
          <source>/paper33.pdf Retrieved: Apr</source>
          ,
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Zhang, Chao, et al.
          <article-title>"Gmove: Group-level mobility modeling using geo-tagged social media</article-title>
          .
          <source>" Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1305</fpage>
          -
          <lpage>1314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Namiot</surname>
            , Dmitry, and
            <given-names>Elena</given-names>
          </string-name>
          <string-name>
            <surname>Zubareva</surname>
          </string-name>
          .
          <article-title>"Data-driven Cities."</article-title>
          <source>International Journal of Open Information Technologies</source>
          <volume>4</volume>
          .12 (
          <year>2016</year>
          ):
          <fpage>79</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Eelikten</surname>
            , Emre, Graud Le Falher, and
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Mathioudakis</surname>
          </string-name>
          .
          <article-title>"Modeling urban behavior by mining geotagged social data</article-title>
          .
          <source>" IEEE Transactions on Big Data 3.2</source>
          (
          <year>2017</year>
          ):
          <fpage>220</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , et al.
          <article-title>"Zooming into individuals to understand the collective: A review of trajectory-based travel behaviour studies."</article-title>
          <source>Travel Behaviour and Society 1</source>
          .2 (
          <year>2014</year>
          ):
          <fpage>69</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yongxi</surname>
          </string-name>
          , et al.
          <article-title>"Exploring spatiotemporal characteristics of intra-urban trips using metro smartcard records." Geoinformatics (GEOINFORMATICS</article-title>
          ),
          <year>2012</year>
          20th International Conference on. IEEE,
          <year>2012</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , et al.
          <article-title>"Mining time-dependent attractive areas and movement patterns from taxi trajectory data</article-title>
          .
          <source>" Geoinformatics</source>
          ,
          <year>2009</year>
          17th International Conference on. IEEE,
          <year>2009</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Ma, Xiaolei, et al.
          <article-title>"Understanding commuting patterns using transit smart card data</article-title>
          .
          <source>" Journal of Transport Geography</source>
          <volume>58</volume>
          (
          <year>2017</year>
          ):
          <fpage>135</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hui</surname>
          </string-name>
          , et al.
          <article-title>"Querying and mining of time series data: experimental comparison of representations and distance measures</article-title>
          .
          <source>" Proceedings of the VLDB Endowment 1.2</source>
          (
          <year>2008</year>
          ):
          <fpage>1542</fpage>
          -
          <lpage>1552</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Namiot</surname>
          </string-name>
          ,
          <source>Dmitry. "Time Series Databases." DAMDID/RCDL</source>
          .
          <year>2015</year>
          . pp.
          <fpage>132</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tetsuya</surname>
          </string-name>
          , et al.
          <article-title>"A shape-based similarity measure for time series data with ensemble learning</article-title>
          .
          <source>" Pattern Analysis and Applications 16</source>
          .4 (
          <year>2013</year>
          ):
          <fpage>535</fpage>
          -
          <lpage>548</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>