Intelligent Instrumentation for Opinion Mining in Social Media N Yarushkina1, A Filippov1, V Moshkin1, G Guskov1 and A Romanov1 1 Ulyanovsk State Technical University, Ulyanovsk, Russia Abstract. The paper presents a developed intelligent tool for Opinion Mining of social media. In addition, the article presents new algorithms to the hybridization of ontological analysis and methods of knowledge engineering with methods of nature language processing (NLP) for ex- tracting the semantic and emotional component of semi-structured and unstructured text re- sources. These approaches will improve the efficiency of the analysis of social media content- specific data and fuzziness of natural language. 1. Introduction Active growth of social media audience on the Internet (social networks, forums, blogs and online media) made them a new source of data and knowledge. The specifics of working with social media has several advantages and disadvantages. Advantages include: ─ high speed of access to information; ─ a broad audience; ─ a wide range of data topics; ─ large amount of data. The disadvantages are: ─ large amount of data; ─ unstructured presentation of information; ─ absence of a single conceptual framework. A large amount of social media data is both an advantage and a disadvantage at the same time. Monthly in Russian social networks about 30 million unique authors publish 580 billion messages according to statistics for 2017. However, a large amount of data makes it possible to obtain a large training sets, for machine learning methods and a large statistical sample for social studies. The monthly billions of unstructured text messages and publications that users leave monthly can- not be processed manually. There is a need for methods of automated intelligent and sentimental analysis of text data. These methods handle large amounts of data and understand their meaning (Text Mining), determine the sentiment (Opinion Mining) of user messages and publications in a short time [1-5]. Understanding the meaning and sentiment of publications in social media is the most important and complex element of automated text processing [6-11]. Our scientific group has created an intelligent tool for Opinion Mining of social media. This tool includes new approaches to the hybridization of ontological analysis and methods of knowledge engi- 50 neering with methods of nature language processing (NLP) for extracting the semantic and emotional component of semi-structured and unstructured text resources [12-16]. These approaches will improve the efficiency of the analysis of social media content-specific data and fuzziness of natural language. 2. The architecture of the software system for Opinion Mining social media Service-oriented approach is the basis of the architecture of the software system for Opinion Mining Social Media (SOM). This approach allows: 1. To increase the overall fault tolerance of the SOM by performing services in different address spaces. 2. To increase the scalability of the SOM by running several instances of services and balancing the load between them. 3. To provide the ability to use different operating systems, programming languages, storage technologies, etc. 4. To reduce the downtime of SOM when making changes, correcting errors, etc. 5. To provide an opportunity to completely replace services while maintaining the interface of interaction with other parts of the SOM. REST in conjunction with the HTTP protocol [0] is the basis for the organization of the interface for the interaction of SOM services. REST allows a distributed system of any type to have the follow- ing properties: performance, extensibility, simplicity, updatability, intelligibility, portability and relia- bility. The architecture of SOM is shown in Figure 1. Figure 1. Architectural diagram of the software system for Opinion Mining social media. The SOM consists of the following subsystems: 51 1. Subsystem for importing data from social media. This subsystem works with popular Internet services (Vkontakte, Facebook, Odnoklassniki, Twitter, Instagram, Youtube) through the public appli- cation programming interface (Public API). The data loader from the Intranet media retrieves data from HTML pages based on rules. You need to create your own rule for each Internet media. The rule should consist of a set of CSS-selectors. The ontology loader loads into the storage subsystem a de- scription of the features of the problem area (PrA) in the form of ontologies in the language RDF or OWL. 2. The data storage subsystem provides the representation of information extracted from social media in a unified structure that is convenient for further processing. The data is stored in the context of users, collections, data sources, versions, etc. As database management systems (DBMS) are used: ─ Elasticsearch for indexing and retrieving data [0]; ─ MongoDB for storing data in JSON format [0]; ─ Neo4j for storing graphs of social interaction (social graph) and ontology [0]. The data converter converts the data imported from social media into an internal SOM submission. The social graph builder constructs a social graph. The social graph based on the relationship of users and social media communities. The translator OWL/RDF-ontology in the graph translates the ontology into the graph knowledge base [0]. 3. The subsystem of semantic data analysis performs preprocessing of text resources. In addition, this subsystem performs statistical and linguistic analysis of text resources. 4. The subsystem of sentimental data analysis determines the attitude of a speaker, writer, or other subject with respect to some topic or emotional reaction to a document, interaction, or event from text. 5. The data search subsystem searches for objects related to a specific task. The task presented in the form of a set of keywords. In this case, the user's query can be extended semantically using an on- tology. Ontology contains descriptions of features of the PrA. 2.1. The graph knowledge base and a social graph as data models of SOM The SOM storage subsystem stores the following kinds of data: ─ data extracted from social media; ─ description of PrA in the form of a graphical knowledge base; ─ social graph that reflects the users and their connections of in social media. The graph DBMS Neo4j used to store the description of the PrA in the form of a graph knowledge base and a social graph. The main advantages of Neo4j are: 1. Native storage format for graphs. 2. One copy of the DBMS can control graphs with billions of nodes and links. 3. Neo4j can control graphs that do not completely fit into RAM. 4. Graph-oriented query language - Cypher. The search engine Elasticsearch used to organize data retrieval. The main advantages of Elasticsearch are: 1. Elasticsearch can process petabytes of structured and unstructured data. 2. Using denormalization to increase the search efficiency. 3. Elasticsearch is one of the most popular search engines that is currently used by many large organizations and services such as Wikipedia, The Guardian, StackOverflow, GitHub, etc. Document-oriented DBMS MongoDB is used to store data extracted from social media. The main advantages of MongoDB are: 1. High performance. 2. Document-oriented query language. 3. Fault tolerance. 4. Scaling. 2.2. Description the main concepts of the Social Media and their relations in knowledge base The main SOM data model concepts are: 52 Mass media concept stores information about different social media (VKontakte, Facebook, Twit- ter, etc.) or news site. The SOM import subsystem downloads data from these social media using their API and from news site by using set of CSS-selectors. The Person concept is a list of users extracted from social media. The Person concept has a set of attributes often used in social networks: surname, first name, date of birth, hobbies, education, etc. The Group concept stores information about communities extracted from social media. The Group concept has a set of attributes often used in social networks: group name, group description, age re- strictions, creation date etc. The Post concept stores information about records in social media. The Post concept has the fol- lowing attributes: author, title, content, creation date, attachments etc. The Comment concept stores information about comments in social media. The Comment concept has the following attributes: author, title, content, creation date, attachments etc. The Attachment concept stores information about the attachments of entries and comments in so- cial media. The Attachment concept has several types and allows you to store the following types of attachments: photos, photo albums, audio, video, links, documents (files), surveys etc. Table 1 shows the correspondence of the social media concepts and SOM concepts. Table 1. The correspondence of the social media concepts and SOM concepts. SOM VKontakte, Twitter Instagram Youtube Social Facebook, media ok.ru MassMedia URL, URL URL URL URL For example, vk.com Person User User User User - Group Group - - - - Post Post Twit Photo Video News, Article Comment Comment Comment Comment Comment Comment Attachment Attachment Attach- tags, links Link Attach- ment ment The main concepts of the SOM data model allow storing data downloaded from most existing so- cial media. Unified presentation of SOM data allows efficient processing, analysis and search. The data converter is used to transform data downloaded from social media into the internal representation of the SOM. It is necessary to develop a data converter module for each new Internet resource. The Internet media loader generates the same data representation for all sites. Therefore, the converter for each site separately is not necessary to adapt. 3. Conclusion Intelligent tool for Opinion Mining social media developed by our research group will allow you to download data from the social network VKontakte and Internet media. The social graph is formed during the download of data from the social network VKontakte. This social graph contains the following types of relationships: is a friend, is a subscriber, is a relative, is in a relationship, is in the community. The statistical index of text data is formed when data is loaded using the search engine Elasticsearch. The data is converted into the SOM data model concept and stored in MongoDB. The data search subsystem searches for data by keywords in the context of data sources and con- cept types: users, communities, entries, comments and attachments. The user's initial search query can be extended during the search based on the graphical knowledge base. The graph knowledge base is formed during the translation of the ontology in the OWL format into nodes and the relationship of the graph knowledge base. 53 Further development of the SOM consists of: 1. Development of downloaders for social networks Twitter, Facebook, Instagram, Youtube, ok.ru. 2. Testing the storage subsystem on large amounts of data. 3. Development of a subsystem of sentimental data analysis. 4. Development of a subsystem of semantic data analysis. 5. Finalization of the user interface. The resulting SOM should improve the effectiveness of analyzing the content of social media tak- ing into account the specifics of data representation and the fuzziness of natural language. 4. References [1] Leskovec J., Faloutsos C. Sampling from large graphs //Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. pp 631-636 (2006). [2] Gjoka M. et al. Practical recommendations on crawling online social networks //Selected Areas in Communications, IEEE Journal on. Vol. 29. №. 9. pp 1872-1892 (2011). [3] Boyd D., Ellison N. Social network sites: Definition, history, and scholarship. Journal of Com- puter-Mediated Communication. Vol. 13(1). article 11. (2007) [4] Pallis G., Zeinalipour-Yazti D., Dikaiakos M.. Online Social Networks: Status and Trends. New Directions in Web Data Management 1, Studies in Computational Intelligence Volume 331, pp 213- 234 (2011). [5] Key Trends to Watch in Gartner 2012 Emerging Technologies Hype Cycle. http://www.forbes.com/sites/gartnergroup/2012/09/18/key-trends-to-watch-in-gartner2012-emerging- technologies-hype-cycle-2, last accessed 2018/05/11. [6] Korshunov A. Tasks and methods for determining the attributes of users of social networks // Proceedings of the 15th All-Russian Scientific Conference "Digital Libraries: Advanced Methods and Technologies, Digital Collections" - RCDL'2013 [7] Korshunov A., Beloborodov I., Gomzin A., Chuprina K., Astrakhantsev N., Nedumov J., Turdakov D. Determination of demographic attributes of users of microblogging // Proceedings of the Institute of System Programming of RAS. Vol. 25, 2013 DOI : 10.15514 / ISPRAS-2013-25-10. [8] Fleuret F. Fast Binary Feature Selection with Conditional Mutual Information // JMLR, 5:1531– 1555 (2004). [9] Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y. Online Passive-Aggressive Al- gorithms // JMLR, 7(Mar): pp 551–585 (2006). [10] Pang B., Lee L., Vaithyanathan S. Thumbs up? Sentiment Classification using Machine Learn- ing Techniques. pp 79–86 (2002). [11] Turney P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Clas- sification of Reviews // Proceedings of the Association for Computational Linguistics. pp 417–424. arΧiv: LG/0212032 (2002) [12] Chetviorkin I., Loukachevitch N. Sentiment Analysis Track at ROMIP-2012. Computer linguis- tics and intellectual technologies. Computer linguistics and intellectual technologies: Dialogue-2013. Sat. scientific articles volume 2, pp 40-50. [13] Antonova A., Soloviev A., Using the method of conditional random fields for processing texts in Russian. Computer linguistics and intellectual technologies: Dialogue-2013. Sat. scientific articles / Issue. 12 (19)- Moscow: Publishing house of the RSUH. pp 27-44 (2013). [14] Pazelskaya A., Soloviev A. Method of definition of emotions in texts in Russian. Computer linguistics and intellectual technologies. Computer linguistics and intellectual technologies: Dialogue- 2011. Sat. scientific articles / Issue. 11 (18). Moscow: Publishing House of the RSUH. pp 510-523 (2011). [15] García-Moya, L., Anaya-Sanchez, H., Berlanga-Llavori, R.: Retrieving product features and opinions from customer reviews. IEEE Intelligent Systems 28(3), pp 19–27 (2013) [16] Tarasov D. Deep Recurrent Neural Networks for Multiple Language Aspect-Based Sentiment Analysis // Computational Linguistics and Intellectual Technologies: Proceedings of Annual Interna- tional Conference “Dialogue-2015”. Issue 14(21), Vol.2, pp 65-74 (2015). 54 [17] Representational state transfer, https://en.wikipedia.org/wiki/ Representational_state_transfer, last accessed 2018/05/11. [18] The Heart of the Elastic Stack, https://www.elastic.co/products/elasticsearch, last accessed 2018/05/11. [19] MongoDB. For Giant ideas, https://www.mongodb.com, last accessed 2018/05/11. [20] Introducing the Neo4j Graph Platform, https://neo4j.com, last accessed 2018/05/11. [21] Yarushkina N., Filippov A., Moshkin V. Development of the unified technological platform for constructing the domain knowledge base through the context analysis. Communications in Computer and Information Science. 2017. Vol. 754. pp 62-72. Acknowledgments This study was supported Ministry of Education and Science of Russia in framework of project № 2.4760.2017/8.9 and by the Russian Foundation for Basic Research (Grants No. 18-47-730035 and 18- 37-00450). 55