fcaR, Spreading FCA to the Data Science World Pablo Cordero1,∗ , Manuel Enciso2 , Domingo López-Rodríguez1 and Ángel Mora1 1 Departamento de Matemática Aplicada, Universidad de Málaga, Spain 2 Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, Spain Abstract Formal concept analysis (FCA) has become a mature tool for extracting helpful knowledge for real problems based on solid mathematical foundations rooted in logic and lattice theory. However, in areas such as machine learning, big data, artificial intelligence, database, etc. remains a stranger. The R language is one of the main languages used in data science, and this work describes an R package called fcaR that implements FCA’s core notions and techniques. One of the main goals is to spread FCA to the rest of the world. The main facilities of the tool are shown with a running example. Keywords R programming language, Data science, Formal concept analysis 1. Introduction We assume the main FCA works [1, 2] are known and show in this short introduction some fea- tures of the developed package and the main references of the mathematical methods developed in fcaR. Classic FCA is devoted to the study of binary datasets (formal contexts) where variables are called attributes. Extensions of FCA (see [3, 4] ) have been developed to model real-world problems for datasets containing imprecise, graded or vague information that is not adequately represented as binary values. This fuzzy extension is able to model problems with numerical and categorical attributes since these can be scaled to a truth value describing the degree of fulfilment of the attribute. As it is well known, from a dataset (binary or fuzzy), FCA can compute maximal clusters, named concepts, between objects and attributes with a hierarchy between the concepts and relationships between the attributes (rules or implications) are computed with the same compu- tational cost in FCA. We emphasize the notion of if-then rules as a efficient way to compact knowledge and enable automatic handling by using logic. In this direction, [5] introduced a logic, named simplification logic for functional dependencies (𝑆𝐿𝐹 𝐷 ), firmly based on a simplification rule, which allows Published in Pablo Cordero, Ondrej Kridlo (Eds.): The 16𝑡ℎ International Conference on Concept Lattices and Their Applications, CLA 2022, Tallinn, Estonia, June 20–22, 2022, Proceedings, pp. 199–205. ∗ Corresponding author. Envelope-Open pcordero@uma.es (P. Cordero); enciso@uma.es (M. Enciso); dominlopez@uma.es (D. López-Rodríguez); amora@uma.es (Á. Mora) Orcid 0000-0002-5506-6467 (P. Cordero); 0000-0002-0531-4055 (M. Enciso); 0000-0002-0172-1585 (D. López-Rodríguez); 0000-0003-4548-8030 (Á. Mora) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) us to narrow the functional dependency set by removing redundant attributes. Although the semantic of implications or if-then rules in other areas are different, the logic can be used too. Using directly 𝑆𝐿𝐹 𝐷 , some automated deduction methods directly based on this inference system have been developed for classical systems and fuzzy systems [6, 7, 8, 9, 10]. Also, a generalization of 𝑆𝐿𝐹 𝐷 to the fuzzy framework [11]was developed. FASL, fuzzy attribute simplification logic, has become a helpful reasoning tool for the fuzzy extension. As we have said, one of the main goals of the fcaR package is to provide a user-friendly compu- tational interface to the principal operators and methods of binary-fuzzy FCA, including the men- tioned logic tools. The use of R language can spread FCA to others communities. As of today the package has 25000 downloads, published in CRAN repositories (https://cran.rstudio.com/web/ packages/fcaR/index.html) with a living live cycle https://github.com/Malaga-FCA-group/fcaR and with vignettes to spread the package https://neuroimaginador.github.io/fcaR/. The work is organized as follows: Section 2 describes the internal classes implemented in the library. Section 3 shows how to use the package. In Section 4, a real application of the package is shown. Finally, some conclusions and future works are presented in Section 5. 2. Structure of fcaR The fcaR package provides data structures which allow the user to work seamlessly with formal contexts and sets of implications. More explicitly, the following main classes are implemented, using the R6 object-oriented-programming paradigm in R: • FormalContext encapsulates the definition of a formal context (𝐺, 𝑀, 𝐼 ), being 𝐺 the set of objects, 𝑀 the set of attributes and 𝐼 the (fuzzy) relationship matrix, and provides methods to operate on the context using FCA tools. • ImplicationSet represents a set of implications over a specific formal context. • Set encapsulates a class for storing variables (attributes or objects) in an efficient way. As an advantage, object oriented programming style of R language and all the knowledge (concepts, implications, minimal generators, etc.) will be stored inside the formal context object fc . The main and computationally hard methods of FCA have been developed in C and linked to fcaR . 3. fcaR In this section, we present the very essential methods in the FCA framework using a well-known running example about planets. From a dataset, we build an formal context object, named fc , in R using the function FormalContext . Sets of attributes or objects will be stored in variables of type Set . For the variable fc contain- ing the formal context, a list of some methods are available: fc$clarify() , fc$attributes , fc$objects , fc$concepts , fc$implications , etc. As an example, with the planets dataset (Table 1), we compute the intent, extent and the closure of a set of attributes: small medium large near far moon no_moon Mercury × × × Venus × × × Earth × × × Mars × × × Jupiter × × × Saturn × × × Uranus × × × Neptune × × × Pluto × × × Table 1 Planets dataset. > set_objetcs <- Set$new(fc$objects) > set_objetcs$assign(Mars = 1, Earth = 1) > fc$intent(set_objetcs) {small, near, moon} > set_attributess1 <- Set$new(fc$attributes) > set_attributess1$assign(medium = 1, far = 1) > fc$extent(set_attributess1) {Uranus, Neptune} > set_attributess2 <- Set$new(fc$attributes) > set_attributess2$assign(medium = 1) > fc$closure(set_attributess2) {medium, far, moon} To extract knowledge, we will use some methods associated to the variable fc . Some concepts and implications are shown next: > fc$find_concepts() > fc$concepts[3:4] A set of 2 concepts: 1: ({Jupiter, Saturn, Uranus, Neptune, Pluto}, {far, moon}) 2: ({Jupiter, Saturn}, {large, far, moon}) > fc$find_implications() > fc$implications[1:2] Implication set with 2 implications. Rule 1: {no_moon} -> {small, near} Rule 2: {far} -> {moon} And for the concepts and implications (inside the variable fc) we could use the main methods and algorithms developed: • For concepts: ìnfimum() , supremum() , top() , bottom() , plot() , size() , join_irreducibles() , meet_irreducibles() , lower_neighbours() , etc. • For implications: apply_rules() , cardinality() , to_basis() , filter() , etc. 4. A case of study In this section, a real case of study showing fcaR on real-world problems is presented. The goal is to extract knowledge about the features of tourist destinations given a user profile. The dataset vegas (see more information in the package) stores more than 500 TripAdvisor reviews of hotels in Las Vegas Strip. The main attributes are: • Period of Stay : 4 categories are present in the original data, which produces as many binary variables: Dec-Feb , Mar-May , Jun-Aug and Sep-Nov . • Traveler type : five binary categories are created from the original data: Business , Couples , Families , Friends and Solo . • Pool , Gym , Tennis court , Spa , Casino , Free internet : binary variables for the services offered by each destination hotel. • Stars : five binary variables are created, according to the number of stars of the hotel, 3 , 3.5 , 4 , 4.5 and 5 . • Score , the score assigned in the review, from 1 to 5 , five variables are created. We can load the dataset, create a FormalContext object, and compute concepts and implica- tions with: > data(vegas) > fc <- FormalContext$new(vegas) > fc$find_implications() In this case, it is complicated to visualize the lattice with 2082 concepts, thus we opt for plotting a sublattice where we impose a minimum support: {} {Free internet} {Casino} {Gym} {Pool} {Casino, Free internet} {Gym, Free internet} {Gym, Casino} {Pool, Free internet} {Pool, Casino} {Pool, Gym} {Gym, Casino, Free internet} {Pool, Casino, Free internet} {Pool, Gym, Free internet} {Pool, Gym, Casino} {Pool, Gym, Casino, Free internet} This exploration gives some hints about the most important attributes in the dataset. After that, the set of implications is manipulated to remove redundancies and remove those rules with zero support: > fc$implications$apply_rules(c("simplification", + "composition", + "generalization")) > fc$implications <- fc$implications[fc$implications$support() > 0] We are now in position to pose the question that must be answered by means of the extracted knowledge: for a given couple, searching for a hotel in Las Vegas with Spa, which are the additional services that would make the highest score (5)? In order to answer this question, let us begin with a subset of the implications, those related to couples travelling: > base_implications <- fc$implications$filter("Traveler type=Couples") Then, specify the minimum services (Spa) in a Set : > Setattr1 <- Set$new(fc$attributes) > Setattr1$assign("Traveler type=Couples" = 1, "Spa" = 1) And compute the closure by using the simplification logic, since we are interested in the knowledge that can be inferred from the condition given by the set : > cl <- base_implications$closure(Setattr1, reduce = TRUE) > specific_implications <- cl$implications There are 36 implications representing the knowledge in the formal context for the required case. Since the problem stated to extract the additional features needed to get a score of 5, let us filter the new ImplicationSet by this condition on the RHS removing redundancies previously: > specific_implications$filter(rhs = c("Score=5")) Implication set with 5 implications. Rule 1: {Period of stay=Mar-May, Stars=4.5} -> {Score=5} Rule 2: {Period of stay=Jun-Aug, Stars=4.5} -> {Score=5} Rule 3: {Period of stay=Jun-Aug, Tennis court, Stars=3.5} -> {Score=5} Rule 4: {Period of stay=Dec-Feb, Tennis court, Stars=3.5} -> {Score=5} Rule 5: {Period of stay=Dec-Feb, Tennis court, Stars=3} -> {Score=5} From these implications, we can infer the additional services that would make a perfect stay for the user. 5. Conclusions The main objective in this work has been the development of an R package able to be useful not just for the FCA community but in general to perform knowledge retrieval from binary or fuzzy (graded) datasets. It is the first R package implementing the core methods in FCA. To sum up, the fcaR package is designed to: • Manage formal contexts (datasets), implementing the core notions of formal concept analysis: objects, attributes, derivation operators, concepts, closures, implications, etc. • Extract the concepts and the concept lattice from a context. • Find implications (exact association rules) that are true in the context. • Provide tools to visualize the extracted knowledge. • Implement the simplification logic for fuzzy and binary settings as the core of automated methods based on logic to remove redundancy in an easy way (only applying the rules of the logic), to compute closures and make recommendations. Thus, fcaR implements a wide range of features, and with the help of the included documen- tation and vignettes, any user can start analysing datasets with FCA tools. From the point of view of efficiency, the fcaR package uses the vectorial and parallelization capabilities of the R language, whereas algorithmic bottlenecks have been implemented in C. In addition, we have used sparse matrices as the main internal data structure of the package. Currently, the package is under active development of several extensions or enhancements: improvement of the efficiency of fuzzy algorithms, adding other algorithms of the FCA commu- nity to compute the concept lattice or the implication basis, or the incorporation of advanced algorithms such as the calculation of direct bases of implications and minimal generators that have proved useful in practical applications. Acknowledgments Supported by Grants TIN2017-89023-P, UMA2018-FEDERJA-001 and PGC2018-095869-B-I00 of the Junta de Andalucia, and European Social Fund. References [1] R. Wille, Restructuring lattice theory: An approach based on hierarchies of concepts, in: Ordered Sets, Springer, 1982, pp. 445–470. [2] B. Ganter, R. Wille, Formal Concept Analysis - Mathematical Foundations, Springer, 1999. URL: https://doi.org/10.1007/978-3-642-59830-2. [3] R. Belohlávek, V. Vychodil, Attribute dependencies for data with grades I, International Journal of General Systems 45 (2016) 864–888. URL: https://doi.org/10.1080/03081079.2016. 1205711. [4] R. Belohlávek, V. Vychodil, Attribute dependencies for data with grades II, International Journal of General Systems 46 (2017) 66–92. URL: https://doi.org/10.1080/03081079.2016. 1205712. doi:10.1080/03081079.2016.1205712 . [5] P. Cordero, M. Enciso, A. Mora, I. P. de Guzmán, SLFD logic: Elimination of data redun- dancy in knowledge representation, in: IBERAMIA, volume 2527 of LNCS, Springer, 2002, pp. 141–150. URL: https://doi.org/10.1007/3-540-36131-6_15. [6] A. Mora, M. Enciso, P. Cordero, I. P. de Guzmán, An efficient preprocessing transforma- tion for functional dependencies sets based on the substitution paradigm, in: CAEPIA 2003, volume 3040 of LNCS, Springer, 2003, pp. 136–146. URL: https://doi.org/10.1007/ 978-3-540-25945-9_14. [7] P. Cordero, M. Enciso, A. Mora, M. Ojeda-Aciego, Computing minimal generators from implications: a logic-guided approach, in: CLA 2012, volume 972 of CEUR W.Proc., CEUR- WS.org, 2012, pp. 187–198. URL: http://ceur-ws.org/Vol-972/paper16.pdf. [8] A. Mora, P. Cordero, M. Enciso, I. Fortes, G. Aguilera, Closure via functional dependence simplification, International Journal of Computer Mathematics 89 (2012) 510–526. URL: https://doi.org/10.1080/00207160.2011.644275. [9] E. Rodríguez Lorenzo, K. Bertet, P. Cordero, M. Enciso, A. Mora, The direct-optimal basis via reductions, in: CLA 2014, volume 1252 of CEUR W.Proc., CEUR-WS.org, 2014, pp. 145–156. URL: http://ceur-ws.org/Vol-1252/cla2014_submission_18.pdf. [10] E. Rodríguez Lorenzo, K. V. Adaricheva, P. Cordero, M. Enciso, A. Mora, From an implica- tional system to its corresponding d-basis, in: CLA 2015, volume 1466 of CEUR W.Proc., CEUR-WS.org, 2015, pp. 217–228. URL: http://ceur-ws.org/Vol-1466/paper18.pdf. [11] R. Belohlávek, P. Cordero, M. Enciso, A. Mora, V. Vychodil, Automated prover for attribute dependencies in data with grades, International Journal of Approximate Reasoning 70 (2016) 51–67. URL: https://doi.org/10.1016/j.ijar.2015.12.007.