Methods of automation for system for collecting data from Internet construction Ageykin M.A, Andrianov A.V., Chugunov V.R. Lychagin K.A., Novopashin M.A. JSC EC-leasing, Moscow, Varshavskoe shosse 125, mageykin@ec-leasing.ru Abstract. Article describe a set of methods for construction automation of spe- cialized internet crawlers to collect data ether from static internet pages or dy- namic pages. Described methods well combined with micro-service architecture in big projects. Keywords: Internet, html, scrapy, xhtml, Ajax, crawling According to IBM strategic forecast, all companies in the next 5 years will be divided into winners and losers depending on quality of making corporate decisions. Research and case studies provide evidence that a well-designed and appropriate computerized decision support system can encourage fact-based decisions, improve decision quali- ty, and improve the efficiency and effectiveness of decision processes. There is re- source that we all have aplenty: a large amount of open data, both structured and un- structured. This report introduces the concept of acquiring data from big data sources such as social media, news, mobile and smart devices, weather information, and in- formation that is collected via sensors and using this data to get new quality of predic- tive analytics. Predictive analytics today in many companies is perceived as an evolu- tionary step in business analytics and is used primarily to build a forecast based on the same data on which reports are built. Nevertheless, this does not take into account the enormous importance of external factors in forecasting and nowcasting. In this report will be shown cases from various areas where external data are the basis for predic- tive analytics and allow gain results unattainable to the forecast based only on enter- prise data. Customers are increasingly asking what data can be used to improve the quality of analytics and forecasting. Strange as it may seem, there is a lot of such data, but even data that can be accessed legally from a legal point of view rarely have APIs to re- ceive them, so you often have to collect data from various sites which are generally quite different from each other. Authors would like to share experience in the con- struction of a typical project parsing. If there are strict requirements to the quality of the data received, the data is collected using specialized crawlers tuned to specific sources. For each data source, a separate crawler is created (the class for determining a particular site or group of sites). In the crawler the following handlers are defined: 377 • for a specific page in order to isolate the semantic information, separating it from the design • a handler for scanning links and sending queries to the queue for uploading and further processing pages • the processor of the archive or site map (if available) to bypass the time interval of the records and to queue page requests. Scheduled crawl pages with content reten- tion Specialized crawlers usually differ depending on the type of sites with which they should work. First of all: they are static or dynamic. In the case of static pages for crawling, retrieving links and content parsing, we generally use such tools for Python as a scrapy framework or robobrowser. Scrapy is a web crawling framework, which has done all the heavy lifting that is needed to write a crawler. This is the most effective way to create crawlers and prob- ably no article dedicated to crawling today can not do without mentioning this frame- work. RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries. The use of RoboBrowser is usually convenient in case we want the crawler behaved as much as possible to the user and did not cause suspicion as a robot. Regardless of the technology chosen, you can create a spider that performs GET requests, extract data from an HTML document, process and export data. To work with html, you will need to study the structure of the site and describe the actions of spiders what information to bypass and what information to collect using Xpath or CSS selector. The universal answer is that it is better to use Xpath or CSS selector, and this is primarily a matter of personal preferences. Personally, authors recommend using Xpath, but in a number of projects, it can be more convenient to use the CSS selector, so one need to know these two methods of addressing the html elements. However, in the modern world, working with static sites does not allow you to ac- cess all the required information. Many of the modern sites have so-called dynamic pages, in which updating of information occurs without reloading the page by execut- ing a JavaScript-code substituting the displayed part. In this case, the loaded page itself will most likely not contain any content. The content that is supposed to be dis- played will be obtained after the page is loaded through Ajax-requests, and not neces- sarily in the html format. Most likely, the data will be received in json format, and then displayed on the page in accordance with the markup. There are two approaches to retrieving information from such sites: A simple but resource-intensive method is preliminary javascript rendering with subsequent analysis and retrieval of information from the html page. In this case, you need to use a browser that handles javascript, in principle, any browser is suitable: Internet explorer, Mozilla Firefox, Google Chrome, PhantomJS, etc. On the other hand, you can create your own using node.js. Calling these browsers and getting the contents of the Internet pages from them through Selenium and the drivers for these browsers. 378 PhantomJS was originally designed to render JavaScript without displaying the content to the user, so for the data collection task it works faster than the others work and does not run additional windows regardless of the operating system. To the main disadvantages, one would refer to the greater laboriousness of debugging, because of the lack of a visual interface and possible problems with rendering JavaScript, since many developers do not support standards, there’s aim to make site work in popular browsers. Google Chrome now seems the most promising and convenient. It uses the most common JavaScript engine WebKit. Websites is primarily optimized for Chrome, either it is convenient for debugging due to the presence of a visual display for any actions. As well recently, a headless mode have been added. Now it works only on Linux and MacOS, and the version for MS Windows with support for headless mode should be released in October this year. The second approach requires additional knowledge about the site code responsible for filling the page with information. Consider the example of the site www.rbc.ru. To analyze the information exchange between the page in the browser and the server, one can use Firefox with the Firebug plugin or any other similar products are also suitable. This plugin starts with the hot key F12. We will be interested in the "Network" section, which contains all the addi- tional requests from the page. Consider one of the news sections, http://www.rbc.ru/spb_sz/. This page contains a table of contents containing links to the pages with the text of the news. Scrolling the page down to the end, you can notice that at a certain moment, additional news is downloaded. The network bar shows us the following: The first two requests go to a domain that is not visually associated with the site, the last three requests are the download of pictures. We are interested in the GET ajax request from rbc.ru. Let's consider it more attentively: Http://www.rbc.ru/filter/ajax?region=spb_sz&offset=10&lim it=12 We see three parameters transmitted by the GET method. In this case, the POST method does not transmit anything, but if you need to fill out some form, refine the data with filters, then this method can be used. In this case, all three parameters are obvious to us: the region, the offset relative to the latest news and the amount of news in the request. In this example, a html-code with an understandable structure is returned as a response from which you can select links to news pages: 379 Further, this html page processed as well as any static web page. In this article, we examined the basic options for creating specialized crawlers that are often used to obtain open data and competitive intelligence, for example, monitoring competitors' prices, monitoring the dynamics of their changes, etc. Such crawlers store information in a structured form, so information from them is conveniently stored in relational databases, and the results they obtained do not require cognitive technologies for fur- ther processing. References 1. Anil Maheshwari “Data Analytics Made Accessible” 2. Davy Cielen, Arno Meysman, Mohamed Ali “Introducing Data Science: Big Data, Ma- chine Learning, and more, using Python tools” 3. Eugene Rabchevsky, "Search, monitoring and analysis in social networks". URL: https://www.osp.ru/os/2015/04/13047968/ 4. Kenneth Cukier “Big Data: A Revolution that Will Transform How We Live, Work, and Think”.