<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Networks: The Case of St. Petersburg</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Boris Nizomutdinov</string-name>
          <email>boris-wels@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Begen</string-name>
          <email>petyabegen@mail.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daria Lipatova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>Kronverksky Pr. 49, Saint-Petersburg, 197101</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>235</fpage>
      <lpage>240</lpage>
      <abstract>
        <p>The paper presents an approach to extracting the address and type of incident from a text data array formed based on posts in social networks. Data was uploaded from the Vkontakte community dedicated to incidents in St. Petersburg. A total of 48,943 records were collected and processed. A service has been developed for automatic recognition of the post topic and address extraction (if available) using natural language processing and machine learning methods using the free natasha library for Russian-language texts. Using the Geocoding API service from Google, the existing addresses were geocoded, and an array in GeoJSON format was obtained, which allows working with the dataset in various map services in real time.</p>
      </abstract>
      <kwd-group>
        <kwd>1 natural language processing</kwd>
        <kwd>address extraction</kwd>
        <kwd>parser</kwd>
        <kwd>natasha</kwd>
        <kwd>yargy</kwd>
        <kwd>geocoding</kwd>
        <kwd>GeoJSON</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Currently, the interconnection of heterogeneous urban elements is supported primarily by
information technologies that provide communication links between residents, the management sector
and infrastructure. On the one hand, the modern information space allows residents to constantly
observe the life of the city, to meet their needs and interests in a mobile way, and on the other hand, it
creates high expectations in relation to the urban environment and increases the level of responsibility
of city authorities.</p>
      <p>Due to the lack of capacity to limit the impact of hazards, many cities still face a high level of
threats. As threats to cities increase, improving the resilience of cities becomes a major challenge. To
increase the sustainability of cities, there is a growing need for information that is relevant to all
stages of urban development. Thus, a better understanding of the spatio-temporal patterns of public
response is a key step towards reducing damage and improving the resilience of cities.</p>
      <p>
        So, a team of researchers from New York University analyzed incident data from two different
sources: from a traditional data provider that collects incident reports from multiple agencies, and user
messages from Twitter during Hurricane Sandy, which flooded many areas of New York in 2012 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The result showed that Twitter can provide detailed information about the location of a particular
incident, as well as its intensity, duration, etc.
      </p>
      <p>
        In recent years, interest in Twitter has been growing due to the fact that data in it is stored in real
time. Microblogs are increasingly attracting attention as an important source of information in
emergency management. Twitter is used as a way to predict accidents, natural disasters, and traffic.
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] For example, in China, local floods are studied using geo data from Twitter [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are methods
for monitoring the traffic situation in real time based on data in social networks using modern
machine learning algorithms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Metro passenger traffic forecasting is strategically important in the
management of the metro transit system. Predicting the occurrence of events turns into a very difficult
task, so today, forecasting in passenger transport is developing based on data from their social
networks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In most cases, Twitter data is used [
        <xref ref-type="bibr" rid="ref6 ref7 ref8 ref9">6-9</xref>
        ], the use of the social network Vkontakte has
not yet become widespread.
      </p>
      <p>The purpose of this work is to study and analyze the use of methods for extracting addresses from
Russian-language text messages about incidents in social networks Vkontakte to generate geospatial
data in the GeoJSON format, which can be used later in GIS systems or in the hardware-software
complex “Safe City”.</p>
      <p>
        The result of this approach is the distribution of incidents in the city on the map since the text will
consist of 2 entities: the address and the type of incident. At the first stage of the study, 5 main topics
were selected: Car theft, Accident, Fire, Robbery and Assault. Using this approach, we can identify
the most dangerous or problematic area in the city or find the area where the most theft occurs and so
on [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. By the way, this information is available in official sources, it is not always available to
urbanists and researchers and does not always reflect the current state of the city. This approach can
expand the data set for researchers and citizens.
      </p>
      <p>In addition, the data obtained can be used in the hardware-software complex “Safe City”, since
often the incident message does not arrive in the system immediately, and this approach allows you to
generate information online. For example, information about an accident can be included in the
statistics in a week or even a month, if the accident was registered according to the Euro protocol. But
active citizens quickly highlight road accidents and publish them into this community, which allows
to search for problem areas in the city online.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Development of an address extraction and incident recognition service</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Data preparation</title>
      <p>In the case of St. Petersburg, we considered the possibility of extracting data about incidents that
citizens write about in the social network called Vkontakte. The community about road accidents and
emergencies in St. Petersburg (https://vk.com/spb_today) was chosen as a site for the study.</p>
      <p>The collection of information for the study was carried out using a set of tools that included the
Vkontakte API, a content parser, and a public service. The method API Vkontakte “Wall.get” returns
a list of posts from the user's wall or community, using this method you can collect all the comments
in the community. When conducting research on social networks, there is a complex problem of
personal data security during parsing. Personal data, according to art. 3 of the federal law "On
Personal Data", is called " any information related directly or indirectly to a certain or identifiable
individual (subject of personal data)". No personal information was collected or stored in this report.
A total of 48,943 records were collected.</p>
      <p>For the collected sample, preliminary automatic processing of posts was carried out, which
consisted in deleting entries starting with the words “News of our metropolis:”. In this case, we did
not consider the daily news reports when forming the final sample for analysis, because its only
summed up the results of the day. The size of the final sample after deleting such posts was 48,408
entries.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Using natasha library and yargy parser for extracting address and incident topic from posts</title>
      <p>One of the tasks of the research was also the development of tools for automatic processing and
analysis of the data array of posts from social networks. The main functionality of the tool was the
automatic detection of the topic of the post and recognition of the address or its component part:
street, block, house number, district, etc. To get the primary results of the tool, it was decided to
recognize five topics of posts: Car theft, Accident, Fire, Robbery and Assault.</p>
      <p>
        To develop the toolkit, we used the open neural network library natasha (v.1.4.0), which was
updated in 2020, for recognizing addresses in Russian-language text [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. To recognize the one of 5
incident topics a yargy parser was configured. To extract user entities in yargy, special rules are
created using context-free grammars and specialized dictionaries. As part of the research work, simple
rules were added with ready-made parser predicates that recognize words for highlighting the topic:
“theft, stolen, stealing, car theft” – Car theft, or “accident, road accident, lapped, collided, crash” –
Accident.
      </p>
      <p>To recognize the topics of posts, we used self-written rules for the yargy parser. As a result of
setting up the parser, a sample of these posts was obtained, which contained only five topics. The
Figure 1 shows the ratio of posts with five selected topics to the total number of posts.</p>
      <p>CAR THEFT</p>
      <p>2%
OTHER
69%</p>
      <p>ACCIDENT
25%</p>
      <p>FIRE
4%</p>
      <p>ROBBERY</p>
      <p>0%
ASSAULT
0%</p>
      <p>Based on this sample in semi-automatic mode (SQL query + manual markup), the average
accuracy of the yargy parser recognition of the five selected topics was calculated, the results are
presented below in Table 1.</p>
      <p>The result of the average accuracy of determining the topics in the text using the configured yargy
parser is good (more than 90% except for Robbery). Among the disadvantages of this approach, the
long duration of the parser's operation time was highlighted, since which we can assume about the
slowness of the algorithm itself, especially when increasing the data sample. Thus, the next stage will
require optimization of the parsing rules and more efficient data processing.</p>
      <p>To recognize addresses in the text, the built-in “AddrExtractor” function from the natasha library
was used. Address recognition was performed on a sample of data, including posts with recognized
topics. A total of 15,167 records were selected in the sample. To calculate the average recognition
accuracy, the condition was set that if at least one part of the address is recognized in the text of the
post (for example, street, name, house number, etc.), then the address is considered recognized. The
results of address recognition are shown in Table 2.</p>
      <p>The table shows that the result of the average accuracy of address recognition in the Russian text is
satisfactory for solving the problem (more than 75%), but in the future, the algorithm and rules for
address recognition also require improvements to improve the accuracy and quality of determining the
recognized addresses. When recognizing the address, it was also noted that the highest percentage of
accuracy is achieved when determining the name “Moscow” for the type “street”, for example, in the
format “Moscow Street”. However, if you remove the marker word “street”, the recognition accuracy
will significantly decrease, even if there are other markers nearby, such as “house number”,
“building”, and other parts of the address.</p>
      <p>As a result of the analysis, we can conclude that it is sufficiently possible to use open and free
ready-made solutions that provide the functionality of flexible rule settings for performing tasks of
this kind of analysis.
2.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Representing incidents from the posts on a map</title>
      <p>
        Geocoding recognized addresses and formatting dataset to GeoJSON was the next step of work.
For this purpose, we used Geocoding API service from Google [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], that can convert address parts
and return its coordinates. One of disadvantage of Geocoding API is that it returns standard JSON, so
we also needed to convert it to GeoJSON format afterwards.
      </p>
      <p>As we need to store and show topic on a map, we used the Feature and FeatureCollection objects
according to GeoJSON specification. Here is the example of format we used:</p>
      <p>{"features": [{"geometry": {"coordinates": [30.36091, 59.931058], "type": "Point"}, "properties":
{"topic": "nan"}, "type": "Feature"}, {"geometry": {"coordinates": [30.516726, 59.73777], "type":
"Point"}, "properties": {"topic": "Угон"}, "type": "Feature"}], "type": "FeatureCollection"}</p>
      <p>The dataset of 15,1567 records with all recognized addresses was geocoded and converted to
GeoJSON. Below in Figure 2 you can see the visualizations of points according to five incident
topics.
(e)
Figure 2: Car theft (a); Accident (b); Assault (c); Robbery (d); Fire (e)</p>
      <p>Meanwhile we discovered that not all the address parts were recognized properly, for instance
there can be only house number without street location or only street, which can be long. Thus, for
future steps of research work we need to increase recognition accuracy and quality of address parts
extraction.</p>
    </sec>
    <sec id="sec-6">
      <title>3. Conclusions and Discussion</title>
      <p>
        The selected source of information in the social network Vkontakte showed that users generate a
large amount of information about incidents. In Russia, Vkontakte is more popular than Twitter [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
which is why it is important for researchers to have tools to work with this social network. The
method has shown its promise, and the data obtained can be used by both researchers and
representatives of government departments.
      </p>
      <p>During the research work, a toolkit was developed for automatic recognition of the post topic and
addresses. The considered experiment confirmed the good effectiveness of the selected open library
natasha (v.1.4.0) with a yargy parser, which managed to extract the topic and address from the text of
the posts. With the help of the Geocoding API from Google, we managed to get the coordinates of
addresses, translate the result of geocoding into the standard GeoJSON format, which allows us to use
this data in different map services, GIS, as well as in the hardware-software complex “Safe City”. In
the future, it is planned to increase and improve the data sample by using methods of automatic
collection and extraction of entities, improving the accuracy of extraction and recognition of the posts
topics and address parts.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Acknowledgement</title>
    </sec>
    <sec id="sec-8">
      <title>5. References</title>
      <p>The work was done under the research topic of ITMO University No. 620179 “Development of a
map service for monitoring the residents needs in the urban infrastructure development using
automated data processing systems from social networks”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kurkcu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Morgul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ozbay</surname>
          </string-name>
          ,
          <article-title>Crowdsourcing Incident Information for Disaster Response using Twitter</article-title>
          ,
          <source>Transportation Research Board</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ammari</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Petalas</surname>
          </string-name>
          ,
          <article-title>Traffic Event Detection Framework Using Social Media</article-title>
          , in: International Conference on Smart Grid and
          <string-name>
            <given-names>Smart</given-names>
            <surname>Cities</surname>
          </string-name>
          ,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .1109/ICSGSC.
          <year>2017</year>
          .
          <volume>8038595</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.P.Y.</given-names>
            <surname>Looc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhene</surname>
          </string-name>
          , G. Xie,
          <article-title>Urban resilience from the lens of social media data: Responses to urban flooding in Nanjing</article-title>
          , China, Cities, volume
          <volume>106</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://doi.org/10.1016/j.cities.
          <year>2020</year>
          .
          <volume>102884</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Patra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A City</given-names>
            <surname>Traffic</surname>
          </string-name>
          <article-title>Dashboard using Social Network Data</article-title>
          ,
          <source>in: the 2nd IKDD Conference</source>
          ,
          <year>2015</year>
          . doi:
          <volume>10</volume>
          .1145/2778865.2778873.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Forecasting the Subway Passenger Flow Under Event Occurrences With Social Media</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          ,
          <year>2016</year>
          , PP(
          <volume>99</volume>
          ):
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . doi:
          <volume>10</volume>
          .1109/TITS.
          <year>2016</year>
          .
          <volume>2611644</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qiana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chenb</surname>
          </string-name>
          , From Twitter to detector:
          <article-title>Real-time traffic incident detection using social media data</article-title>
          , Transportation Research Part C:
          <article-title>Emerging Technologies</article-title>
          , volume
          <volume>67</volume>
          ,
          <year>2016</year>
          , pp.
          <fpage>321</fpage>
          -
          <lpage>342</lpage>
          . URL: https://doi.org/10.1016/j.trc.
          <year>2016</year>
          .
          <volume>02</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hawelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sitko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Beinat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sobolevsky</surname>
          </string-name>
          ,
          <article-title>Geo-Located Twitter as Proxy for Global Mobility Patterns</article-title>
          ,
          <source>Cartography and Geographic Information Science</source>
          <volume>41</volume>
          (
          <issue>3</issue>
          ),
          <year>2013</year>
          . doi:
          <volume>10</volume>
          .1080/15230406.
          <year>2014</year>
          .
          <volume>890072</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dabiriab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Heaslipa</surname>
          </string-name>
          ,
          <article-title>Developing a Twitter-based traffic event detection model using deep learning architectures</article-title>
          ,
          <source>Expert Systems with Applications</source>
          , volume
          <volume>118</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>425</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alomari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Katib</surname>
          </string-name>
          ,
          <article-title>Road Traffic Event Detection Using Twitter Data, Machine Learning</article-title>
          , and Apache Spark,
          <source>in: The 3rd IEEE International Conference on Smart City Innovations (SCI 2019)</source>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1109/
          <string-name>
            <surname>SmartWorld-UIC-ATC-SCALCOM-IOPSCI</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <volume>00332</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez-Osorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pedraza</surname>
          </string-name>
          ,
          <article-title>Modern data sources and techniques for analysis and forecast of road accidents: A review</article-title>
          ,
          <source>Journal of Traffic and Transportation Engineering (English Edition)</source>
          <volume>7</volume>
          (
          <issue>4</issue>
          ),
          <year>2020</year>
          , pp.
          <fpage>432</fpage>
          -
          <lpage>446</lpage>
          . URL: https://doi.org/10.1016/j.jtte.
          <year>2020</year>
          .
          <volume>05</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>natasha/natasha: Solves basic Russian NLP tasks, API for lower level Natasha projects</article-title>
          ,
          <year>2021</year>
          . URL: https://github.com/natasha/natasha.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Overview | Geocoding</surname>
            <given-names>API</given-names>
          </string-name>
          | Google Developers,
          <year>2021</year>
          . URL: https://developers.google.com/maps/documentation/geocoding/overview?hl=ru.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>Social networks in Russia: figures and trends</article-title>
          , autumn
          <year>2020</year>
          ,
          <year>2021</year>
          . URL: https://branalytics.ru/blog/social-media-russia-
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>