1 Introduction

Case Studies and Education in Data Science At Universities in Slovakia

L'ubomír Antoni

lubomir.antoni@upjs.sk 1

Erik Bruoth

erik.bruoth@upjs.sk 1

Ján Guniš

Peter Gurský

peter.gursky@upjs.sk 1

Stanislav Krajcˇi

stanislav.krajci@upjs.sk 1

Ondrej Krídlo

ondrej.kridlo@upjs.sk 1

Radim Navrátil

L'ubomír Šnajder

Gabriela Andrejková

gabriela.andrejkova@upjs.sk 1

Dušan Šveda

0 0 Department of Mathematics and Statistics, Faculty of Science, Masaryk University in Brno , Czech Republic 1 Institute of Computer Science, Faculty of Science, Pavol Jozef Šafárik University in Košice , Jesenná 5, 040 01 Košice 2 Institute of Mathematics, Faculty of Science, Pavol Jozef Šafárik University in Košice , Jesenná 5, 040 01 Košice

Data analysis solutions are already being used in many areas of technical, natural, humanitarian and economic sciences. The information hidden within data can help to solve the many pending issues within community, enterprise or science. Turning of data into knowledge and wisdom is beneficial and necessary. Moreover, students with knowledge of Data Science have the potential to be highly desirable in the labor market. We present the selected case studies in Data Science Area. We bring a review of Data Science subjects that are introduced and innovated at universities in Slovakia within the IT Academy Project.

1 Introduction

The term "science" implies knowledge gained through systematic study. In one definition, it is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions. Data science might therefore imply a focus involving data and, by extension, statistics, or the systematic study of the organization, properties, and analysis of data and its role in inference, including our confidence in the inference [ 1 ].

The huge amount of data and the need for decisionmaking imply the demand for data analysis and professionals for data processing. This issue is highly relevant, data analysis solutions are already being used in many areas of technical, natural, humanitarian and economic science. Moreover, data analysis experts are already scarce, and students with knowledge of Data Science have the potential to be highly desirable in the labor market. In developing and addressing this area, it is necessary to address the amount of available data that is regularly generated by many commercial devices and scientific instruments, and the available methods of their analysis, which allow them to discover new knowledge. The aim of is to extract new, valid and potentially useful knowledge from available data in various areas of academic and corporate life. Completing objectives and knowledge itself leads to several phases in which it is necessary to distinguish the phase of data comprehension, data preparation, modeling, evaluation of results and putting results into practice [ 1, 2, 3, 4 ].

Section 2 describes classification tasks in general. Section 3 provides a basic overview of university subjects at the five partner universities of the IT Academy Project in the area of Data Science, including classification tasks. In Section 4, we present the analysis of the educational tasks and objectives of five real teachers giving lessons in computer science. 2

Classification tasks

Data Science is an interdisciplinary area that uses scientific methods, processes, algorithms, and systems to acquire knowledge and insights from data in various forms, both structured and unstructured. It introduces the concept of data unification, data analysis, machine learning and related methods to analyze current data processes. Data Science uses techniques and practices from the fields of mathematics, statistics, or informatics [ 1-4 ].

In the past, computers were used to process numerical data. In recent decades, representation in the form of relational data has also been promoted, because it allows the relationship between the instances of objects to be examined in a much wider dimension. Moreover, the knowledges found in the data helps to solve many issues related to science, business or society. Permanent modification of input data into forms of knowledge and knowledge is therefore beneficial and necessary. The data are obviously in a tabular form, and we can obtain this form of data by recording the results of doctor’s examinations, customer behavior and consumer habits, or recording the results of the students.

Data classification is a two-step process, consisting of a learning step (where a classification model is constructed) and a classification step (where the model is used to predict class labels for given data) [ 3 ].

In the first step, a model is created that describes a set of data classes. The model is constructed based on the analysis of input objects (called also examples, instances, data points or samples) described by attributes. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels. A tuple, X , is represented by an ndimensional attribute vector, X = (x1; x2; : : : ; xn), depicting n measurements made on the tuple from n database attributes, respectively, A1; A2; : : : ; An. Each tuple, X , is assumed to belong to a predefined class as determined by another database attribute called the class label attribute.

Because the class label of each training tuple is provided, this step is also known as supervised learning (i.e., the learning of the classifier is “supervised” in that it is told to which class each training tuple belongs). It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is not known, and the number or set of classes to be learned may not be known in advance. The learned model can be represented in the form of classification rules, decision trees, or a mathematical formula. For example, for database of customer credit information, classification rules can be used to identify customers with excellent or fair credit ratings. In the second step, the model is used for classification of the new objects [ 3 ].

Data prediction can be viewed as the construction of a model to assess the class of an unlabeled sample, or to assess the value of a target attribute that a given object is likely to have. In this view, classification and regression are two major types of prediction problems, where classification is used to predict discrete or nominal values, while regression is used to predict continuous or ordered values. In another view, data prediction differs from the classification of data by the fact that the class label variable can acquire continuous values [ 1, 2, 3, 4 ].

Classification and prediction have numerous applications including credit approval, performance prediction, medical diagnosis, selective marketing, prediction of water consumption in a given territory, prediction of user rating of films, recommended systems [ 1, 2, 3, 4 ].

3 Innovation of Data Science University Subjects

As part of the IT Academy Project, the preparing of new subjects and innovating of existing study materials are aimed in the area of Data Science. In addition to the presented tasks of classification, prediction and evaluation of their accuracy, many of the other topics at the university subjects are part of this attractive area. A basic overview of these topics at the five partner universities of the IT Academy Project is presented in the following.

Case studies in data mining

(Pavol Jozef Šafárik University in Košice) The course is designed to deal with prepared case studies using data mining methods, which is an interdisciplinary topic and covers knowledge from a variety of disciplines, such as mathematics, statistics, and various specific areas of informatics. The course will introduce methods for automated analysis of large volumes of data and extraction of knowledge from these data. The course is aimed at solving practical problems in the area of graphing using the presented methods and using appropriate software tools. R-project as a computing environment, Programming in R: data structures, graphics, applied mathematics, data analysis. Java libraries for data-mining, their use. Testing data mining algorithms. Dissemination of data-mining libraries. Solution of practical examples. Solutions for application projects.

Machine learning

(Pavol Jozef Šafárik University in Košice) The course is focused on modeling and algorithmization of the learning processes of computer systems from data sets, the result of learning being objects that take into account input data sets. This modeling is based on the needs of many areas of human activity (biomedical informatics, expert systems creation, robot construction, speech and text processing, predictions of some events, etc.). The basis for this modeling is the knowledge of linear regression, gradient methods of regression and classification, Bayesian learning, clustering, decision trees, hidden Markov models, but also the theoretical knowledge about probabilistic approximation learning and the Vapnik-Cervonenkis dimension. Data Pattern Search, Advanced Classification Methods, and Prediction Methods.

Large data processing technologies

(Technical University in Košice) Upon completion of the course the student will acquire basic knowledge of large data processing, methods, approaches and technologies used in this area. From a theoretical point of view the student will gain knowledge about grid and cloud computing concepts, distributed, NoSQL and in-memory database systems, parallel and distributed computing methods. The student will also acquire basic skills for designing and implementing large data processing applications.

Knowledge discovery

(Technical University in Košice) Basic concepts of knowledge discovery and mining in data. Knowledge Discovery Process, CRISP-DM methodology, step-by-step description. Data Mining Methods Predictive and Descriptive Mining in Data. Disclosure of association rules, generalization, classification, prediction, clustering. Data warehouses. Overview of selected methods for discovering knowledge. Mining knowledge from text document collections. Selected case studies from research and development projects. Logical and physical data warehouse model. Temporary and operational data repositories. Searching patterns in data.

Business Analyst

(Technical University in Košice) Business analyst extensively uses data and advanced statistical methods to optimize individual areas and business processes. The course focuses on knowledge, technology, applications and methods for continuously exploring the company’s historical performance with a view to deeper understanding and management of business planning. Rproject as a computing environment, programming in R: data structures, graphics, applied mathematics, data analysis. Practical examples.

NoSQL databases

(Pavol Jozef Šafárik University in Košice) The course introduces different ways of storing and working with large data. It compares the basic types of NoSQL databases - DB type key-value, document DB, column DB and graph DB as well as practical examples of work with selected representatives of individual types. The different types of data representations with which these databases work are presented. The course focuses not only on the use of NoSQL databases, but also on their architecture, parallel, distributed and transactional processing.

Programming, algorithms, complexity

(Pavol Jozef Šafárik University in Košice) The subject uses the more incoherent programming course that is based on the "OO-first" approach (equally starting with object-oriented programming). An important element is the visualization and use of metaphors in conjunction with the graphics implemented by the JPAZ2 framework. In the second part of the semester, a systematic interpretation with an emphasis on a good object design takes place. The course is focused on the ability to implement programs in the Java programming language, basic knowledge about the principles of object-oriented programming.

Basics of knowledge systems

(Pavol Jozef Šafárik University in Košice) The aim of the course is to apply advanced logic courses to computer science, especially in database and knowledge systems. The subject of the subject is procedural semantics of logic programming, declarative semantics of logic programming and its correctness, relationship of relational database models and logic programming.

Data mining

Educational data mining

(Constantine the Philosopher University in Nitra) The aim of the course is to acquaint students with computer data analysis and possibilities of application of selected analytical methods in IT. Through examples and case studies, students will get acquainted with selected data analysis methods and their applications to address specific issues. Case studies are focused on the evaluation of information technologies and information systems, specifically mediating the methodology of evaluation of user interfaces, systems and processes. Logical and physical data warehouse model. Temporary and operational data repositories. Data Warehouses. Educational Data Mining (EDM) can be characterized as a research area that develops new techniques and methods, testing new approaches to learning, looking for user behavior patterns of unstructured and structured data that have been created by interaction between different types of users in a particular virtual learning environment, educational software, an intelligent or adaptive learning system, or a specialized software for testing. Learning Analytics (LA) is a related research area that aims to support decision-making processes at the various levels of management of an educational organization. The objective of the course in terms of learning outcomes and competences. The aim of the subject is to introduce the students to the above mentioned problems, to familiarize them with the current trends in the field of EDM and learning analytics research and to present in practical examples the data mining methods in the area of management of the educational organization. Searching patterns in data. Solution of practical examples.

Web content mining

Neural Networks

(Constantine the Philosopher University in Nitra) Neural networks represent a bio-inspired approach to intelligent information processing computational algorithms. Recognition of knowledge in memory, interpretation of input data, knowledge relations, generalization, decision making, etc. spontaneously appear as a result of the massive parallel interaction of a large number of relatively simple calculation elements. From a content point of view, the subject deals with topics such as: introduction to neural networks, Binary perceptron, Backpropagation, Reinforcement learning, Recurrent neural networks, their training and application, Self-organization.

Algorithms of numerical mathematics and optimization

(Constantine the Philosopher University in Nitra) The course provides an introduction to the basic methods of numerical mathematics and optimization. It is based on four aspects: mathematical basics of numerical analysis and optimization, algorithmization, programming of algorithms in MATLAB and algorithm visualization. From a content point of view, the subject is focused on numerical methods of algebra, numerical methods of mathematical analysis and optimization methods, singleparametric optimization methods, multiparametric optimization without borders, least squares, and multiparametric optimization with boundaries.

Programming in Java

(Matej Bel University in Banská Bystrica) Program paradigms. Java programming language. JVM. Encapsulation. Class declaration. Instances of classes objects. Constructor. Access rights and scope in class. Static attributes and methods. Heredity. Exceptions. Abstract classes. Polymorphism. Interface. Modularisation, abstraction, bonding, cohesion. Choice from JavaAPI. Dynamic data structures. Streams of data. Solution of practical tasks.

Business Intelligence

(Matej Bel University in Banská Bystrica) Business Intelligence (BI) and BI type projects - focus, reason, goal. Repetition of necessary knowledge and skills from database systems, software systems and modeling of data structures. OLTP vs. OLAP. Data warehouses, data warehouse development, OLAP analysis, data mining (architecture, metadata, implementation, ETL, OLAP, MDX, reports). Waterfall model working on BI project. Agile methods of working on BI project. Processing of input information from the customer. Forms of offers (customer response) for the customer. Project manager role in the BI project. The activity of the project architect, his competence and the relevant documentation. Data Architect and its activity in the project. ETL developer work. Types of inputs provided to customers. Work Report developer. Participation in the project of individual actors as a function of time. Output forms for the customer. Active work with the customer. Logical and physical data warehouse model. Temporary and operational data repositories. Data warehouses. Searching patterns in data. Solution of practical examples.

Algorithms and Data Structures

(University of Žilina) Students will acquire basic knowledge of the theory of data structures and will learn how to implement them effectively. Upon completion of the course, the student is familiar with the basic data structures and is able to use them in solving practical problems, controls the process of effective implementation of basic data structures.

Statistics for practice Advanced statistical methods Fundamentals of numerical methods

(Pavol Jozef Šafárik University in Košice) The aim is to provide students with the theoretical knowledge and practical skills needed to process real data and create their mathematical models using multipurpose computer algebra systems, respectively. Dedicated software for addressing specific category issues. 4

Educational tasks analysis

The study of structures and mappings which allow to analyze the data in various forms is a challenging task. In this way, the first attempts to interpret the lattice theory as concretely as possible and to promote the better communication between lattice theorists and potential users of lattice theory represent the inception for data analysis taking into account the binary relations on the objects and attributes sets. Since the concept hierarchies play an important role here, the term of formal concept analysis has been adopted for this reasoning. Briefly, formal concept analysis scrutinizes an object-attribute block of relational data in bivalent form [ 5 ]. Formal concept analysis allows us to explore the meaningful groupings of educational tasks (referred to objects) with respect to common objectives (referred to attributes) and it provides the visualization capabilities. The conceptual difficulties in mathematics education [ 6 ], or the integrated care pathways [ 7 ] are analyzed by formal concept analysis, as well. An extensive overview of the various application domains that include software mining, web analytics, medicine, biology and chemistry data is given by [ 8, 9 ]. The statements that people use to communicate facts about the world are usually not bivalent. The truth of such statements is a matter of degree, rather than being only true or false. Fuzzy logic and fuzzy set theory are frameworks which extend formal concept analysis in various independent ways. In [ 10 ], our aim was to provide the system of objectives and tasks that is expected to fill in the gap of the National Education Program in Slovak Republic. In general, the National Education Program is formulated concisely and we put emphasis on a long term to particularize other supplementary curricular documents and express the educational objectives more explicit in various areas. Therefore, we have focused on an algorithmic thinking area and chosen an array data structure as an educational content in which we have fruitfully applied formal concept analysis. Simultaneously in this area, we focus on algorithms including searching, sorting or text processing. Particularly, we have investigated the educational tasks and objectives of five real teachers giving lessons in computer science. We aim at specifying the particular and relatively precise objectives of an array data structure education in the algorithmic thinking area. Regarding our long-term cooperation with the teachers in the field, we declare some input set of objectives of an array data structure: 1) to specify an array as the structured homogeneous data type with elements denoted by a single identifier, 2) to appoint the real examples of one-dimensional array data structure (e.g. rooms in a hotel, seats in a plane, etc.), 3) to interpret the notions of an array index (an array key) and an array element and to explain the difference between them, 4) to distinguish an array index type and an array element type, 5) to reason that an array index type is an ordinal type (numbers, characters, other enumerations), 6) to declare a variable of array, 7) to read and to write out the array elements, 8) to manipulate the array elements, to assign the array element to the other variables, to increment the array elements, 9) to appoint the common errors related to an array data structure (incorrect index type, overflow, incompatibility of the types), 10) to apply an array data structure in the simple issues (e.g. to store an array, to find the maximal value, to modify the elements of array, etc.), 11a) to apply an array data structure in searching, 11b) to apply an array data structure in sorting, 11c) to multiply access to the array elements, 11d) to apply an array data structure in a text processing, 11e) to apply an array data structure in a simple game programming, 12) to recognize the issues in which array data structure can be applied effectively, to appoint the advantages and disadvantages of an array in comparison with other simple data structures (an access to elements, a space complexity).

The specified aims are enumerated by the revised taxonomy of Bloom in order to classify statements of what we expect or intend students to learn as a result of education. The revised taxonomy focuses on four knowledge dimensions including factual knowledge (basic elements), conceptual knowledge (interrelationships among the basic elements), procedural knowledge (how to do something) and metacognitive knowledge (awareness and knowledge of one’s own cognition). In general, an educational process consists of a motivation phase, a phase of the first acquisition, a fixation phase and a diagnostic phase. The phase of a systematization, a propedeutics or an application phase can be also involved.

We submitted the previous list of aims to the teachers in the secondary schools in Slovakia. The teachers were instructed to appoint the tasks which they usually apply in an educational process of an array data structure in programming. Teachers were not limited by the number of tasks and moreover, it was possible to add some additional aims. Regarding five teachers data and one additional set of 10 tasks proposed by two of the authors, we have analyzed 102 tasks and 23 educational aims obtained in this research. We have generated the summary concept lattice and found the following observations: 45 tasks (the first row in the summary concept lattice) are the representatives; i.e. every task includes the unique set of aims and there is no task that introduces the superset of these aims, 5 tasks (from 45 representatives) are such that every task includes the unique set of aims and there is no task that introduces neither superset nor subset of these aims, 3 aims (the first row in the summary concept lattice) are unique, i.e. the aim is introduced only by one task. In effort to prepare the graduated sets of tasks, we have explored the longest paths extracted from the summary concept lattice with reduced labeling of all 102 tasks. The longest path is shown in Fig. 2. Every path contains the graduated system of tasks depending on the final task we want to achieve in conclusion. The object label, for instance 3.5, corresponds to the fifth task of third teacher. The set of tasks labeled 6.1 - 6.10 comes from the authors.

The longest path illustrates that if a student has a problem with Task6.5, we ask him/her to solve Task3.12. Moreover, if we have found that a student has a problem with Aim6 in Task6.5, we give him/her to fixation one task (or more) from the node which contains the set of equal tasks Task1.3, Task1.4, Task1.5, Task1.8. In contrary, if a student has no problem with Task6.5, we suppose that he/she will pass also the Task3.12. There is only one path including seven task nodes, however paths with six nodes appear in the summary concept lattice several times.

Moreover, we present some interesting educational tasks which appear in the summary concept lattice mostly in the first row and one can advise them to apply in the educational process related to an array data structure. The formulations are shortened in comparison with the original texts.

Propose the way how to denote the parking places in front of a hotel. How are the train carriages enumerated? How would you denote the overall and final results of six teams in the television knowledge contest? We have observed GPS data containing ten altitudes on our tourist route. Write a program to print out the altitudes on a reverse route.

Imagine that you have received SMS from your friend. Write a program to count the number of words in your text message.

A musical instrument, like a piano, can be simulated by a computer program. Some of the keys will have assigned a particular tone frequency. Write a program to play a tone when the particular key is pressed. Consider the starting sequence of children’s names and the final shift of Ferris wheel as the input. Write a program to make a list of the children names in the sequence in which they will get out of the Ferris wheel. Write a program to generate twelve random values expressing the number of your website visits in a particular month. Draw a histogram, highlight the maximum and minimum and show an average value as a horizontal line.

We have fruitfully applied formal concept analysis as a powerful tool in a simultaneous analysis that involves the teaching of programming skills in an open-source web application App Inventor 2. The tutorial website1 provides materials in the form of learning cards for building the basic applications, but one of the authors of this paper has prepared the set of ten complex educational tasks which in summary cover 129 elements (components and their elements, event handlers, call, set instructions, get instructions, data structures, etc.) available at the present time. The added value includes the proposal of the introductory set of complex tasks and its further modification in effort to teach and learn the different target groups. The talented lower secondary school’s pupils participate in our optional university courses and the teachers of secondary schools attend the didactic workshops at our university. Our results are concerned with the inclusion of the programming language elements (available at the App Inventor website at present) in the complex educational tasks and the effort to extract the appropriate tasks for the different types of an educational process. The formal context contains 10 tasks as the set objects and 129 App Inventor programming elements as the set of attributes.

Exploring own attributes, the resulting concept lattice and its attribute labels give information about the elements introduced uniquely by a particular task. As conclusion, we recommend the following methodology: a task with a high ratio of the own elements and the low total number of elements is advised to use in a first acquisition phase of education, in a fixation phase, we recommend a task with a low ratio of the own elements and the low total number of elements, a task with a low ratio of the own elements and the high total number of elements is suggested in a systematization or diagnostic educational phase, a task with a high ratio of the own elements and the high total number of elements is the least appropriate for an educational process, because it brings many new elements without their introduction in a more simple task. 1http://www.appinventor.org/ Data can be viewed simply as the observations, entities or values which are collected in effort to form information. To ensure the readability, data used to be converted into tables and graphs. Table data appear frequently and they can be observed either directly from the medical test results, the customer habits, the scores of students or by the various transformations from the other forms, for instance, from database tables or questionnaires. The information hidden within data can help to solve the many pending issues within community, enterprise or science. Turning of data into knowledge and wisdom is beneficial and necessary, considering either the simple computing in the spreadsheet calculators or various methods of data analysis which are more complex. Data collecting, preprocessing, reduction, visualization, dependencies exploration and providing the metadata are important parts of the scientific research, as well.

Our ability to generate and collect data has increased rapidly in recent decades. This has prompted the need for new techniques and automated tools to help transform a large amount of data into useful information and knowledge. In other words, for the present time it is characteristic that we are rich in data, but poor in information.

[1]

Dhar , Data science and prediction . Communications of the ACM . vol. 56 , n. 12 , 2013 , 64 - 73 .

[2]

Alpaydin , Introduction to Machine Learning. 2nd ed. Boston: MIT Press; 2010 .

[3]

Han and M. Kamber , M. , Data Mining: Concepts and Techniques , 3rd ed. Morgan Kaufmann Publishers, 2012 .

[4]

C.O.

Neil and

Schutt , Doing Data Science: Straight Talk from the frontline . O'Reilly Media , 2014 .

[5] Ganter , B. , Wille , R.: Formal Concept Analysis: Mathematical Foundation . Springer, Heidelberg; 1999 .

[6]

Priss ,

Riegler ,

Jensen , Using FCA for Modelling Conceptual Difficulties in Learning Processes , in: F. Domenach , D. I. Ignatov , J. Poelmans (Eds.), Contrib. to the 10th Intern. Conf. ICFCA 2012 , Leuven, Belgium, 2012 , 161 - 173 .

[7]

Poelmans , G. Dedene, G. Verheyden,

Van der Mussele ,

Viaene , E. Peters, Combining business process and data discovery techniques for analyzing and improving integrated care pathways , in: P. Perner (Ed.) : ICDM 2010 , SpringerVerlag Berlin Heidelberg, 2010 , 505 - 517 .

[8]

Poelmans ,

D. I.

Ignatov ,

S. O.

Kuznetsov , G. Dedene, Formal concept analysis in knowledge processing: A survey on applications , Expert Syst. Appl . vol. 40 , n. 16 , 2013 , 6538 - 6560 .

[9]

Carpineto , G. Romano, Concept Data Analysis Theory and Applications , J. Wiley, 2004 .

[10]

'. Antoni ,

Guniš , S. Krajcˇi,

Krídlo , L '. Šnajder, The educational tasks and objectives system within a formal context . CLA 2014 : 35 - 46 .