<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">EasyMiner/R Preview: Towards a Web Interface for Association Rule Learning and Classification in R</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Stanislav</forename><surname>Vojíř</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Information and Knowledge Engineering Faculty of Informatics</orgName>
								<orgName type="institution">Statistics</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Economics</orgName>
								<address>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Václav</forename><surname>Zeman</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Information and Knowledge Engineering Faculty of Informatics</orgName>
								<orgName type="institution">Statistics</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Economics</orgName>
								<address>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jaroslav</forename><surname>Kuchař</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">Faculty of Information Technology</orgName>
								<orgName type="laboratory">Web Intelligence Research Group</orgName>
								<orgName type="institution">Czech Technical University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tomáš</forename><surname>Kliegr</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Information and Knowledge Engineering Faculty of Informatics</orgName>
								<orgName type="institution">Statistics</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Economics</orgName>
								<address>
									<settlement>Prague</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">EasyMiner/R Preview: Towards a Web Interface for Association Rule Learning and Classification in R</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">2309C49EC7B3318556684A50C2908678</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:19+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>association rules</term>
					<term>R</term>
					<term>web service</term>
					<term>classification</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>EasyMiner is a web-based visual interface for association rule learning. This paper presents a preview of the next release, which uses the R environment as the data processing backend. EasyMiner/R uses the arules package to learn rules. It uses the Classifications Based on Associations (CBA) algorithm as a classifier and to perform rule pruning. Experimental results show that EasyMiner with the R-based backend is able to handle larger datasets than the previous version.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This paper describes a preview version of the next generation of the EasyMiner system for interactive association rule learning and classification. EasyMiner consists of an interactive web-based user interface and a web-service layer, which wraps association rule learning implementation. The system was first introduced at ECML'12 <ref type="bibr" target="#b0">[1]</ref> with the LISp-Miner (lispminer.vse.cz) system as the association rule learning backend. In this paper, we introduce a new version, which uses the popular arules package <ref type="bibr" target="#b1">[2]</ref> for the R environment. In addition to association rule learning, the new release allows to perform classification based on association rules using the Classifications Based on Associations (CBA) algorithm <ref type="bibr" target="#b2">[3]</ref>.</p><p>The benefits of the new version are as follows. The arules package, implementing the apriori algorithm <ref type="bibr" target="#b3">[4]</ref>, provides better performance on larger datasets. The addition of the CBA algorithm allows for new use cases. Apart from the support for the classification task, CBA can be used as a rule pruning algorithm. Reducing the number of rules on the output is vital for many applications, including business rule learning <ref type="bibr" target="#b4">[5]</ref>.</p><p>This paper is organized as follows. Section 2 gives a walk through the system's user interface. Section 3 describes the backend. A brief description of the CBA component is given in Section 4. Evaluation of the system is covered by Section 5. The conclusions summarize the contribution and outline future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Workflow in the User Interface</head><p>As the first step, the user has to log in using local account or social networks. The user uploads the dataset in CSV or zipped CSV and selects mining backend (R or LISp-Miner). In the background this operation creates a named miner associated with a database table holding the raw data and an empty table holding the preprocessed data.</p><p>The user can optionally perform data preprocessing. This is especially needed for numerical attributes due to limitations of the apriori algorithm. The preprocessing is performed by the user dragging a field from the Data fields palette to the Attributes palette (Fig. <ref type="figure" target="#fig_0">1A,B</ref>) and selecting preprocessing type (e.g. equidistant binning). This creates an attribute that can be used in the Rule pattern, out of a field in the input CSV file. In the background, the data are immediately processed. In order to facilitate processing of larger datasets, the system allows to skip the preprocessing step, using a verbatim copy of the fields in the input dataset as attributes.</p><p>In the main mining interface (Fig. <ref type="figure" target="#fig_0">1</ref>), the user defines preprocessing instructions and the rule pattern. The definition is based on drag-and-drop operations the user drags an attribute from the attributes palette (Fig. <ref type="figure" target="#fig_0">1B</ref>) and drops it into antecedent or consequent part of the rule pattern (Fig. <ref type="figure" target="#fig_0">1C</ref>).</p><p>Finally, the user executes the task. This sends the task definition to the mining backend. The system supports ordering of the discovered rules (Fig. <ref type="figure" target="#fig_0">1D</ref>) by values of interest measures. Selected rules can be stored in Rule Clipboard (Fig. <ref type="figure" target="#fig_0">1E</ref>), the contents of which persists across multiple tasks on the same miner, or to the Knowledge Base, which persists across multiple miners.</p><p>The user interface layer is implemented in PHP (using Nette Framework) and JavaScript.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">R backend</head><p>EasyMiner-Apriori-R is a REST service wrapper over the R arules library, which serves as a backend in EasyMiner, but can also be used as an independent web service.</p><p>This wrapper processes HTTP requests, transforms them to R scripts and forwards them to the R environment. The web service was implemented in the Scala language as a stand-alone application therefore it can be run on JVM (version 7+) without additional containers.</p><p>It provides the HTTP facade for sending association rule mining tasks in the GUHA AR PMML format, which is an extension of the PMML AssociationRules model<ref type="foot" target="#foot_0">3</ref> that supports both standard association rules mined by apriori and more expressive GUHA rules mined by LISp-Miner, the original backend in EasyMiner. This PMML input is processed asynchronously and transformed to R scripts that use MySQL queries for data loading and itemset pruning, and the arules library for the association rules mining. Input data, which we want to mine association rules from, have to be saved in the MySQL database; so the PMML task input has to contain both information about a task (interest measure, antecedent and consequent definitions) and information about the connection to the database.</p><p>For the data transmission from the REST service to the R environment and vice versa we use the Rserve server. Any R script, sent to the R server, has to be completely initialized and all required libraries must be loaded for each request; this initializing part may take several seconds. In order to solve this problem we implemented an Rserve connection pooling system for the pre-initialization of R scripts, so if a user posts the mining task request to the server, the system is able to pull a prepared connection from the connection pool. The R apriori mining method is called immediately without waiting for an initialization. The result of the mining process is a PMML output file where found association rules are saved EasyMiner-Apriori-R is completely thread-safe and can handle several mining requests concurrently; the number of parallel connections depends on the actor system setting: the implementation uses Akka (http://akka.io) framework with Spray (http://spray.io).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Classification Based on Associations</head><p>CBA <ref type="bibr" target="#b2">[3]</ref> is considered as the first algorithm for classification based on association rules. The algorithm was proposed in 1998, and it has multiple successor algorithms such as CPAR <ref type="bibr" target="#b5">[6]</ref> or CMAR <ref type="bibr" target="#b6">[7]</ref>. We selected the original algorithm rather than any of its successors, because it provides a desirable balance between accuracy and low rule count.</p><p>In general CBA proceeds as follows. The rules output by the apriori algorithm are sorted according to some criteria and then pruned -some rules are removed. Finally, a default rule is added at the bottom of the rule list ensuring that all instances will be covered. An unlabeled instance is classified by the top-ranked matching rule.</p><p>Since CBA only removes rules for the original list output by association rule learning, it can be also used for rule pruning. More compact rule sets have the advantage of better interpretability. The pruned rule set generated by CBA has the following desirable properties: a) each training case is covered by the rule with the highest precedence among the rules that can cover the case, and b) every rule correctly classifies at least one training case.</p><p>The CBA in EasyMiner/R follows the optimized M2 version of the algorithm <ref type="bibr" target="#b2">[3]</ref>. Pessimistic pruning, an optional step in the original algorithm, is currently not included. The software is implemented in Java and wrapped with rJava into an R package. It is executed as an optional step within the R backend.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation</head><p>This section presents a comparison of the time requirements of EasyMiner/R with the previous version. We also empirically describe the effect of pruning on rule count and computation time.</p><p>Dataset. As the evaluation dataset we used the DBpedia Binary training dataset, generated from the ESWC 2014 Recommender Systems Challenge. <ref type="foot" target="#foot_1">4</ref>This dataset contains 72,371 rows containing the following attributes: author (2 551 unique values), country (77 unique values), language (53 unique values), literaryGenre (307 unique values), mediaType (32 unique values), publisher (676 unique values) and rating (values 0 and 1). The dataset is very sparse with many missing values. No pre-processing was performed. The benchmark dataset is made available on the easyminer.eu website.</p><p>Task setting. The mining task was constrained so that the attribute rating is present in the consequent and the remaining attributes in the antecedent. The required minimal value of the confidence threshold was set to 0.5.</p><p>The count of discovered rules depends on missing value treatment. Columns denoted as with pruning contain the time requirements for solving the combined tasks of mining of association rules and the pruning using the CBA algorithm. Benchmark setup. We evaluate three configurations: a) the time required by the previous backend (LISp-Miner) run on desktop, b) the time required by the current backend (arules) run on desktop, c) the time required if mining is run from the frontend from the user interface in a web browser. This allows us to demonstrate improvement over the previous version as well as to show what additional latencies of gains have been introduced by EasyMiner/R compared to running the new arules backend system directly on the user's computer.</p><p>All three setups exclude the time required to preprocess the datasets and import them to a database.</p><p>The time reported for EasyMiner/R essentially amounts to: transmission of the task from user's browser to the front-end server, serialization of the task in the frontend-server to PMML, transmission to the server, execution of the mining task, serialization of results to PMML, transmission to the front-end server, display of the first ten rules in the user's browser. The result reported was measured in Firefox 38.0.1 (average across three runs).</p><p>Hardware and software configuration. The evaluation was performed using the latest version of LISp-Miner (25.14.06) run on Intel Core i7-3930K @ 3.2GHz, 3.5GB RAM.</p><p>Results. The results depicted at Table <ref type="table" target="#tab_0">1</ref> show that EasyMiner with the arules package as the backend is significantly faster with lower support thresholds. An interesting observation is that for tasks with pruning set to off that produce a smaller number of rules, the user will even get the results faster with EasyMiner/R in than from the R console on her computer. This is due to a saving associated with the database connection pooling.</p><p>In case of tasks with pruning, the mining takes longer, but the resulting rule set is on this particular dataset up to nine times smaller, saving the time of the human analyst or computer system doing subsequent processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions and future work</head><p>EasyMiner/R available at http://easyminer.eu is an experimental academic system for association rule learning and building classifiers composed of association rules. The new version with R backend allows to handle larger datasets as empirically validated. Compared to the previous version using the LISp-Miner system, the new version does not allow to use expressive constructs such as disjunctions and negations when defining the rule pattern. Another limitation of the current version stemming from the use of the MySQL database is a limit on the maximum number of fields in the input dataset. We plan to remove the latter limitation in the next release, which will use a column-oriented database as the storage backend. Another much needed extension is execution of the benchmark on a representative sample of datasets.</p><p>EasyMiner/R is designed to be used as a complete integrated system, however, its individual components -the EasyMiner-Apriori-R<ref type="foot" target="#foot_2">5</ref> web service wrapper for R arules package, and the CBA implementation<ref type="foot" target="#foot_3">6</ref> , can also be used independently.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. User interface of EasyMiner/R</figDesc><graphic coords="3,135.63,115.83,344.96,325.72" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Time requirements of rule mining (confidence=0.5)</figDesc><table><row><cell></cell><cell></cell><cell>rule count</cell><cell></cell><cell>backend-only</cell><cell cols="2">EasyMiner/R</cell></row><row><cell cols="7">support w/o miss. w miss. w miss. pruned LISp-Miner arules mining with prun.</cell></row><row><cell>0.010</cell><cell>79</cell><cell>163</cell><cell>54</cell><cell cols="2">3 s 6.2 s 5.4 s</cell><cell>33.8 s</cell></row><row><cell>0.009</cell><cell>95</cell><cell>186</cell><cell>68</cell><cell cols="2">6 s 6.4 s 5.4 s</cell><cell>27.8 s</cell></row><row><cell>0.008</cell><cell>112</cell><cell>213</cell><cell>73</cell><cell cols="2">16 s 6.2 s 5.4 s</cell><cell>31.7 s</cell></row><row><cell>0.007</cell><cell>144</cell><cell>295</cell><cell>90</cell><cell cols="2">27 s 6.3 s 5.5 s</cell><cell>31.7 s</cell></row><row><cell>0.006</cell><cell>187</cell><cell>397</cell><cell>107</cell><cell cols="2">1 m 10 s 6.3 s 5.5 s</cell><cell>35.6 s</cell></row><row><cell>0.005</cell><cell>256</cell><cell>552</cell><cell>141</cell><cell cols="2">4 m 38 s 6.3 s 5.7 s</cell><cell>35.5 s</cell></row><row><cell>0.004</cell><cell>396</cell><cell>765</cell><cell cols="3">184 28 m 04 s 6.5 s 6.0 s</cell><cell>37.8 s</cell></row><row><cell>0.003</cell><cell>602</cell><cell>1147</cell><cell>253</cell><cell cols="2">&gt;5 h 6.5 s 8.6 s</cell><cell>43.3 s</cell></row><row><cell>0.002</cell><cell>1391</cell><cell>2699</cell><cell>430</cell><cell cols="3">&gt;6 h 6.5 s 14.0 s 1 m 04.1 s</cell></row><row><cell>0.001</cell><cell>3394</cell><cell>6034</cell><cell>697</cell><cell cols="3">&gt;6 h 6.7 s 15.1 s 1 m 59.0 s</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://www.dmg.org/v4-0-1/AssociationRules.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">Dataset: http://easyminer.eu/images/data/dbpedia_train_small.zip</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">https://github.com/KIZI/EasyMiner-Apriori-R</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">https://github.com/jaroslav-kuchar/rCBA</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgment</head><p>The research and development of EasyMiner is performed under grant IGA 21/2013. The development of EasyMiner-R-Apriori-R component was supported by the European Union's 7th Framework Programme via the LinkedTV project (no. FP7-287911). The development of the CBA component was supported by CESNET grant no. 540/2014. Stanislav Stanislav Vojíř and Tomáš Kliegr were supported in writing this paper by the University of Economics, Prague within the "institutional support for long term research" scheme.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Association rule mining following the web search paradigm</title>
		<author>
			<persName><forename type="first">R</forename><surname>Škrabal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Šimůnek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vojíř</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hazucha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Marek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chudán</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kliegr</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning and Knowledge Discovery in Databases</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">P</forename><forename type="middle">R</forename><surname>Flach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Bie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Cristianini</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">7524</biblScope>
			<biblScope unit="page" from="808" to="811" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">arules -a computational environment for mining association rules and frequent item sets</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hahsler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Grün</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hornik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Statistical Software</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">15</biblScope>
			<biblScope unit="page" from="1" to="25" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Integrating classification and association rule mining</title>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">KDD&apos;98</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="80" to="86" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Mining association rules between sets of items in large databases</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Imielinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Swami</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGMOD</title>
				<imprint>
			<date type="published" when="1993">1993</date>
			<biblScope unit="page" from="207" to="216" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Learning business rules with association rule classifiers</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kliegr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kuchař</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sottara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vojíř</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">RuleML</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">CPAR: Classification based on predictive association rules</title>
		<author>
			<persName><forename type="first">X</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SDM, SIAM</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Barbar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Kamath</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="331" to="335" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">CMAR: Accurate and efficient classification based on multiple class-association rules</title>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2001 IEEE International Conference on Data Mining. ICDM &apos;01</title>
				<meeting>the 2001 IEEE International Conference on Data Mining. ICDM &apos;01<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="369" to="376" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
