Mining Real-Estate Listings Based on Decision Systems over Ontological Graphs Extended Abstract Krzysztof Pancerz1,2 and Olga Mich1 1 University of Management and Administration Akademicka Str. 4, 22-400 Zamość, Poland kpancerz@wszia.edu.pl 2 University of Information Technology and Management Sucharskiego Str. 2, 35-225 Rzeszów, Poland Abstract. In the paper, we describe the process of mining real-estate listings based on decision systems over ontological graphs. Such decision systems have been proposed to deal with data in the form of concepts linked by different semantic relations. A special attention is focused on preprocessing steps transforming advertisements in the textual form into decision systems (decision tables) defined over ontological graphs. Keywords: decision systems, ontological graphs, data mining, real-estate listings. 1 Mining Real-Estate Listings Text mining is a rapidly growing application of knowledge discovery in data (cf. [2]). A special kind of data in the textual form is constituted by advertisements, for example, real-estate ones. In case of advertisements, data have the form of loosely coupled words (terms, concepts) rather than full, grammatically correct sentences. Moreover, underlying data are of qualitative character. We can dis- tinguish several challenges posed by textual data: understanding data semantics and semantic relations between them, considering the external knowledge in pro- cesses of data classification, encoding textual data for classifiers working with numerical data. The semantic relations between concepts play an important role, among oth- ers, in cognitive psychology, linguistics and currently also in computer science. In [4], two general types of relations between words (concepts) were distinguished: Paradigmatic relations (relations between words belonging to the same gram- matical category) and Syntagmatic relations (relations between words that go together in a syntactic structure). Paradigmatically related words are, to some degree, grammatically substitutable for each other. In our research, we are inter- ested in paradigmatic relations. As it will be shown later, in real-estate listings, we do not analyze a semantic structure of sentences, but we try to derive some knowledge about concepts (terms) included in them, for example, whether they are synonyms, whether one concept can be replaced with another, for example, more general one, etc. In our research, we use the following taxonomy of types of semantic rela- tions (which is modeled on the project called Wikisaurus [1] aiming at creating a thesaurus of semantically related terms): synonymy, antonymy, hyponymy/ hyperonymy (subclass - superclass), and meronymy/ holonymy (part - whole). In the approach presented in this paper, we are interested in synonymy and hyponymy/ hyperonymy. We will use the following notation: isSyn denotes syn- onymy, (u, v) ∈ isSyn means that ”u is a synonym of v”, isGen denotes hy- ponymy, (u, v) ∈ isGen means that ”u is a hyponym of v” (”u is generalized by v”), and isSpec denotes hyperonymy, (u, v) ∈ isSpec means that ”u is a hyperonym of v” (”u is specialized by v”). Additionally, we take into considera- tion a semantic relation called ”being an instance”. Being an instance concerns an example (instance) of a given concept. This kind of relations is important in mining real-estate listings because they include, for example, instances of places. The knowledge about semantic relations between concepts is included in on- tologies. Our approach is based on the definitions of ontology given by Neches et al. [5] and Kohler [3]. That is, ontology is constructed on the basis of a controlled vocabulary and the relationships of the concepts in the controlled vocabulary. Formally, the ontology can be represented by means of graph structures. Let O be a given ontology. An ontological graph is a quadruple OG = (C, E, R, ρ), where C is a nonempty, finite set of nodes representing concepts in the ontology O, E ⊆ C × C is a finite set of edges representing relations between concepts from C, R is a family of semantic descriptions of types of relations (represented by edges) between concepts, and ρ : E → R is a function assigning a semantic description of the relation to each edge. Sometimes, we can also consider sub- graphs of ontological graphs, called by us local ontological graphs. In Figure 1, exemplary ontological graphs representing the real-estate domain are shown. a) b) Fig. 1. Exemplary ontological graphs representing the real-estate domain. In [12], information (decision) systems were proposed as the knowledge repre- sentation systems. In simple case, they consist of vectors of numbers or symbols (attribute values) describing objects from a given universe of discourse. In our research, we are interested in mining textual data in the form of concepts (words, terms). Therefore, in [7] and [10], ontologies were incorporated into information (decision) systems, i.e., attribute values were considered in the ontological (se- mantic) space. Information (decision) systems over ontological graphs can be created in different ways. In [7], two approaches were mentioned. In the first one, attribute values are concepts from ontologies assigned to attributes - a sim- ple information (decision) system over ontological graphs. In the second one, attribute values are local ontological graphs of ontologies assigned to attributes - a complex information (decision) system over ontological graphs. In case of real-estate listings, whether a client was interested in a given property can be considered as a decision attribute. The main goal of this paper is to show how to transform real-estate listings into simple information (decision) systems over ontological graphs. This is a very important preprocessing step that can be generally depicted as it is shown in Figure 2. We can distinguish three main steps: – Stemming - defining basic grammatical forms (roots) for particular words ex- isting in advertisements, for example, using a quite popular Porter stemming algorithm [13]. – Attributation - assigning concepts (built from words) existing in advertise- ments to proper attributes as their values, according to defined ontological graphs. – Deinstantiation - replacing instances existing in advertisements with the most specific concepts (with respect to the hyponymy / hyperonymy rela- tion) whose instances they are. Deinstantiation is an important step if we are interested in a more general knowl- edge derived from real-estate listings, for example, some client is interested only in houses in a village (not a particular one). Fig. 2. A procedure for transformation of real-estate listings into simple information (decision) systems over ontological graphs. Let us consider, as an example, the following advertisement: ”For sale, War- saw, Poland, Villa, bedrooms: 4, 437 m2 ”. Let us also assume that we take into consideration three attributes: Transaction, Place, and Property with proper ontological graphs assigned to them. After performing our procedure, the adver- tisement becomes one object (row) in an information (decision) system: U/A T ransaction P lace P roperty u1 Sale Capital City Villa Having an information (decision) system over ontological graphs, we can ap- ply different machine learning and data mining methods to extract some valuable knowledge. In our previous papers, for such systems, we considered: rough sets [9], decision rules [8], decision rules based on the DRSA approach [6], neural networks [11]. References 1. The Wikisaurus Homepage: http://en.wiktionary.org/wiki/Wiktionary: Wik- isaurus 2. Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y.: Data Mining: Next Generation Challenges and Future Directions. The MIT Press, Cambridge, MA (2004) 3. Köhler, J., Philippi, S., Specht, M., Rüegg, A.: Ontology based text indexing and querying for the semantic web. Knowledge-Based Systems 19, 744–754 (2006) 4. Murphy, M.L.: Semantic Relations and the Lexicon. Cambridge University Press (2008) 5. Neches, R., Fikes, R., Finin, T., Gruber, T., Patil, R., Senator, T., Swartout, W.: Enabling technology for knowledge sharing. AI Magazine 12(3), 36–56 (1991) 6. Pancerz, K.: Dominance-based rough set approach for decision systems over onto- logical graphs. In: Ganzha, M., Maciaszek, L., Paprzycki, M. (eds.) Proceedings of the FedCSIS 2012. pp. 323–330. Wroclaw, Poland (2012) 7. Pancerz, K.: Toward information systems over ontological graphs. In: Yao, J., Yang, Y., Slowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) Rough Sets and Current Trends in Computing, Lecture Notes in Artificial Intelligence, vol. 7413, pp. 243–248. Springer-Verlag, Berlin Heidelberg (2012) 8. Pancerz, K.: Decision rules in simple decision systems over ontological graphs. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) Proceedings of the CORES’2013, Advances in Intelligent Systems and Computing, vol. 226, pp. 111–120. Springer International Publishing (2013) 9. Pancerz, K.: Semantic relationships and approximations of sets: An ontological graph based approach. In: Proceedings of the HSI’2013. pp. 62–69. Sopot, Poland (2013) 10. Pancerz, K.: Some remarks on complex information systems over ontological graphs. In: Gruca, A., Czachórski, T., Kozielski, S. (eds.) Man-Machine Inter- actions 3, Advances in Intelligent Systems and Computing, vol. 242, pp. 377–384. Springer International Publishing (2014) 11. Pancerz, K., Lewicki, A.: Encoding symbolic features in simple decision systems over ontological graphs for PSO and neural network based classifiers. Neurocom- puting 144, 338–345 (2014) 12. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 13. Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)