Collecting data with postal address on the Internet Valentina R. Fedorova1 and Roman K. Fedorov2[0000-0002-2944-7522] 1 Irkutsk State University, Irkutsk, Russia 2 Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, Irkutsk 664033, Russia fedorov@icc.ru Abstract. An information system has been developed for collecting data with postal addresses. The system makes it possible to increase the efficiency of pub- lishing new information in the form of maps. Found addresses require expert judgment for the suitability and usefulness for publication. The software system can be applied to texts imported from different sources, such as Microsoft Word documents, PDF documents, social networks, etc. Keywords: GIS, parsing postal address, web scraping, web data extraction, web information extraction. 1 Introduction Now there are many different systems that present information on the cartographic maps. For example 2GIS, Yandex.Maps, Google Maps, etc. Basically, these systems include a city map with an address plan and a description of organizations and ser- vices. There are also geoportals of municipal and regional authorities, which provide information about the economy, resources, infrastructure, etc. The main disadvantage of these systems is low efficiency, new information appears after a long time on the map, or does not appear at all. A huge amount of information is published on the In- ternet every day in the form of text with addresses. For example, news sites publish articles within a few hours after an event, posts with comments on specific events are posted on social networks, and articles about the any work appear on municipal and regional portals every day. Therefore, it is useful to automatically collect data on the Internet and publish them on a map. 2 Review of existing methods Let us consider the existing text processing methods that search for postal addresses of geographic objects. Parsing addresses with "fuzzy regular expressions". The main idea of this method is that it is necessary to create a dictionary with possible spellings of streets. Estimat- ing analogy of names is carried out according to Levinstein distance [1]. Any expres- Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). sion and subexpression is enclosed in parentheses, the type of expression is deter- mined by the character following the opening parenthesis. For example, the symbol "=" means that words must be present either in the given order or in reverse, and ex- pressions in parentheses with a “?” sign mean optional parts, i.e. if they are absent, they are skipped and do not affect the error count. This method takes into account spelling errors, various permutations in street names, for example, st. October Revolu- tion and st. Revolution October, as well as omission of any part of the name. The disadvantage of this method is that it is necessary to manually create expressions for specific streets, and also only the string with the address is submitted for parsing, and not the full text [2]. Parsing of postal addresses using the FIAS database [3]. In this method, the strings containing the address are divided into address elements, then each element is checked in turn for the presence in the FIAS database. If an element is in the database, then its level in the hierarchy is remembered, and the next element is searched for with a larger hierarchy value and a fixed PARENTGUID equal to the GUID of the previous found element. But this method does not provide for parsing the text, it can process only a line with a correctly formed address [4]. Parsing the postal address using neural networks [5]. In this method, the address is encoded into a number. A code is assigned for each category of information - strings, numbers, abbreviations, etc. Next, the list of addresses is scanned, and ad- dresses matching unique patterns are extracted from them. After that, each template indicates to which category the analyzed part of the address belongs. Thus, a training dataset is prepared. The disadvantage of this method is that training a neural network requires a large training dataset with marked-up text. 3 Program system for collecting data with postal address A software system has been developed that collects data on the Internet, searches for postal addresses and publish them on a map. The software system consists of three components: • component for collecting data on the Internet - designed to download HTML pag- es, save in the file system and parse and search for links to other HTML pages; • component for extracting postal addresses from text - designed to parse the text of HTML pages and extract postal addresses; • component for geocoding addresses - converts a postal address into geographic coordinates using geocoding services (for example, OpenStreetMap, Yandex, etc.). Let's consider more detail the component of extracting postal addresses from text. The scheme of the component work for extracting addresses from text is shown in Fig. 1. Fig. 1. The scheme of the component work for extracting addresses from text. Initialization. At this stage, data about settlements, streets and houses are downloaded to arrays. It should be considered that, the name of a settlement may consist of several separate parts, and, the names of settlements will not always be written in the nomina- tive case. Therefore, another array is created in which names will be stored in parts, and the endings will also be cut off, that is, "Большое Голоустное" will be in two separate items, "Больш" and "Голоустн" (In Russian, word endings change). At the same time, in an array with names divided into parts, each element of the array has a link to the full name. An example of the distribution of settlements over two arrays is shown in Fig. 2. Fig. 2. Name arrays. Dividing into tokens. The file with the text of the HTML page is read and divided into tokens. Then each token is filtered and the HTML markup is discarded. Extracting sentences from the text. The postal address is assumed to be within one sentence. The text is divided into sentences, taking into account abbreviations. Search for a postal address. The input of the algorithm is a sentence represented by a set of tokens. Each token in the sentence is checked for presence in arrays with set- tlements (cities) and streets. If there is such a value, then the array element index is stored. If there are accepted names of settlements or their abbreviations near the cur- rent token, then this token is checked for presence only among settlements. Also, if there is a generally accepted street name or its abbreviation, this token is checked for presence only among the streets. Each token can have several possible matches in the arrays. To store this information, a special object is used - the sentence context, which consists of possible settlements, streets and houses (see Fig. 3). If there are settle- ments in the context, the search of streets is performed only within these settlements. Fig. 3. Sentence context. It is assumed that there is the following sentence: “Planned repair work is being car- ried out at the address Русская Аларь, ул. Заречная, 4". Since each token is matched in the database, the following values will be included in the context for the "Русск" token: settlements “Русская Аларь” and “Русский Мельхитуй”, and street “Русская”. There are two values for the "Аларь" token: “Аларь” and “Русская Аларь”. Since there is a generally accepted abbreviation of the street “ул” (like st.) before the “Заречная” token, this token will be checked for presence only in the array with streets, therefore only “Заречная” street will be included in the context. Since there is only “Заречная” street in the settlement ”Русская Аларь”, it will refer only to it, and there is indeed a house with number 4 on this street, therefore, in the context, the house will refer to “Заречная” street. Context items that do not refer to anything will be removed from it. In the case of a compound name of a settlement or a street, each part of it is checked in the array; a situation may arise when there are several possible items in the array. If these item names consist of several parts, then the presence of all these parts in the sentence is checked. If a complete match is found, then the remaining items are discarded (see Fig. 4). Fig. 4. Example of parsing compound name. Suppose that the sentence contains the name of the settlement “Русская Аларь”. Since each token is matched separately, the following items are available in the data- base: “Аларь” and “Русская Аларь” for the “Аларь” token, as well as two items: “Русская Аларь” and “Русский Мельхитуй” for the “Русская” token. For all items from the database, the presence of all its constituent parts in the sentence is tested. Among all these items, only the “Русская Аларь” remains. After all settlements and streets have been saved in context, the house number is searched for. Each token stores its position in sentence. The house number is assumed to immediately close to the street, so the sentence string is truncated. In the substring a house number is matched by a regular expression. The matched numbers (for exam- ple 267/5, 18А, 144г/3) are checked for presence in the array of houses placed on the streets which are in the context. After all the sentence tokens have been processed, the context is analyzed. In the context of the sentence, there can be several related combinations of settlements, streets, and houses. If at least one part of the address is missing in the combination, then the combination is not complete and is discarded. Initially, all possible variants of settlements are added to the context, but as soon as a street is located, all those settlements which do not have such street are removed from the context. If the street is not linked to any settlement, it will also be removed from the context. After removing incomplete combinations, there may still be several variants in the context, so it is necessary to identify the most relevant ones. For this, the estimation is introduced for each element of the context. It is preferable to choose compound names of settlements or streets. If the settlement or street is a compound name, then their score is increased by 2 points, otherwise, if the name is single, then - 1 point. The estimation is carried out according to the formula (2), where x1 is the estimation of the settlement, which takes a value equal to 2 if the name of the settlement is com- posite, and 1 - otherwise, x2 is the estimation of the street, which takes a value equal to 2 if the name street is composite, and 1 otherwise. f ( x1 , x2 = ) x1 + x2 (1) This is done in order not to miss a compound name, for example, if the sentence contains “Октябрьской революции” Street, then two meanings will appear in the context: “Октябрьской революции” Street and “Революции” Street. Of these two streets, the compound name will be the most relevant, so it gets more points in the estimation. After the scoring procedure, only the values with the highest score remain in the context. If the context is fully assembled, that is, it contains all three compo- nents of the address available in one instance, then the address is saved. 4 Approbation During the work of the address search information system, 12918 URLs were found, 995 HTML pages were downloaded to the device. The downloaded HTML pages contained 681 addresses. In fig. 5 a map with tagged addresses and URLs attached to them is shown. Fig. 5. Found addresses. On URL https://irkobl.ru/news/floods.php (see Fig. 6) the system found addresses that are collection points for humanitarian aid to victims of floods in the Irkutsk re- gion. Thus, the information system automatically generated a map of the collection points for humanitarian aid (see Fig. 7). Fig. 6. HTML page https://irkobl.ru/news/floods.php. Fig. 7. The collection points for humanitarian aid. 5 Conclusion The developed information system for collecting data with postal addresses makes it possible to increase the efficiency of publishing new information in the form of maps. Found addresses require expert judgment for the suitability and usefulness for publi- cation. The software system can be applied to texts imported from different sources, such as Microsoft Word documents, PDF documents, social networks, etc. The work was carried out with the support of RAS (projects: AAAA-A17- 117032210079-1, AAAA-A19-119111990037-0), RFBR (projects:18-07-00758-а, 17-57-44006-Mong-a) and Ministry of Science and Higher Education of the RF, the grant for implementation of large scientific projects on priority areas of scientific and technological development (project no. 13.1902.21.0033). Results are achieved using the Centre of collective usage «Integrated information network of Irkutsk scientific educational complex». References 1. Levenshtein V.I.: Binary codes with corrected dropouts, insertions and character replace- ments. Reports of the USSR Academy of Sciences (1965). 2. Parsing addresses with fuzzy regular expressions, https://habr.com/ru/post/192518, (last accessed 2020/05/08). 3. Parsing postal addresses from a string in C#, https://habr.com/ru/post/232347, (last ac- cessed 2020/05/08). 4. Parsing the postal address into components, https://basegroup.ru/community/articles/addresses, (last accessed 2020/05/08). 5. Author, F.: Contribution title. In: 9th International Proceedings on Proceedings, pp. 1–2. Publisher, Location (2010).