<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Improved bio-inspired technique for big data analytics and machine learning speed optimization</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Andronicus</forename><forename type="middle">A</forename><surname>Akinyelu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Informatics</orgName>
								<orgName type="institution">University of the Free State</orgName>
								<address>
									<settlement>Bloemfontein</settlement>
									<region>Free State</region>
									<country key="ZA">South Africa</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Improved bio-inspired technique for big data analytics and machine learning speed optimization</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C7E665004532782007D57C08D9FBE1E3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T10:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Big Data Analytics (BDA) is progressively becoming a popular practice implemented by many organizations, because of their potential to discover valuable in-sights for improved decision-making. The International Data Corporation predicts that the Global Datasphere will grow from 33 Zettabytes in 2018 to 175 Zettabytes in 2025. Obviously, we are currently in the era of Big Data (BD), and the rate of data growth is very alarming. Unfortunately, BD does not offer a lot of value in its unprocessed form. Therefore, to unlock the great potentials of BD, we need efficient BDA methods. Machine Learning (ML) algorithms are one of the most efficient tools suitable for data analytics, however, some ML algorithms cannot effectively handle BD; their computational complexity increases with in-crease in data size. Therefore, some researchers introduced various techniques for improving the speed of ML algorithms, including feature selection techniques, in-stance selection techniques, sampling, and distributed computing. However, most of them failed to achieve a balanced trade-off between storage reduction and predictive accuracy <ref type="bibr" target="#b0">[1]</ref>. Therefore, this paper introduces a boundary detection and instance selection technique for improving the speed of ML-based BDA, called Ant Colony Optimization Instance Selection Algorithm for Machine Learning (ACOISA_ML)). The key highlights of ACOISA_ML are outlined below: Boundary identification: The first stage of ACOISA_ML is the boundary identification stage. Unlike other ACO-based instance selection techniques that directly use ACO algorithm for instance or feature selection, ACOISA_ML use ACO algorithm for boundary identification. It adopts the concept of ACO edge selection to search for different boundaries (not to select instances). To the best of author's knowledge, this study is one of the first studies that adopt the concept of ACO edge detection for instance selection problems. This concept is mostly used for image edge detection (and not data boundary identification) <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Boundary instance selection:</head><p>The second stage of the proposed technique is the boundary instance selection stage. After identifying different boundaries, ACOISA selects the best boundary and use k-NN to select the relevant instances for training (that is, instances close to the best-identified boundary).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Heuristic value computation:</head><p>This study introduces a novel method for computing heuristic value for ACO. This method is suitable for boundary instance se-lection problems. ACOISA_ML is designed to use the proposed computation method to cal-</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>culate the heuristic value for each instance in the dataset. As afore-mentioned, ACO is used to identify the best boundary instance, that is, the in-stance with the highest pheromone value. Hence, the heuristic value for each in-stance is designed to reflect the boundary information for each instance.</p><p>The technique was evaluated on five ML algorithms, namely: (Artificial Neural Network (ANN), Random Forest (RF), Naïve Bayes (NB), k Nearest Neighbor (k-NN), and Logistic Regression (LR)). In this study, we refer to the models produced by the full dataset as standard models, and the models produced by the reduced subset as hybrid models. Finally, we compare the hybrid models to the standard models based on the following criteria: (i) the ability to preserve prediction accuracy (ii) training speed (iii) storage reduction percentage, and (iv) algorithm time (or instance selection time). All the datasets used in this study were obtained from the UCI data repository <ref type="bibr" target="#b4">[5]</ref>. Annexures 1 and 2 shows the average training speed and prediction accuracy produced by the standard models (denoted as Standard) and hybrid models (de-noted as Hybrid). As shown in the Annexures, the hybrid models achieved better training speed than the standard models without significantly affecting their pre-diction accuracy. Moreover, the right-hand side of Annexure 1 shows the average algorithm time (denoted as Alg-T) and the average storage reduction percentage (denoted as Av-Sto) achieved by ACOISA_ML. The storage reduction percentage represents the fraction of instances selected after data reduction. As shown in the Annexure, ACOISA_ML reduced the storage size of the evaluated big datasets by over 55% (in most cases) without substantially affecting their quality. Moreover, ACOISA_ML achieved good instance selection time. It used an average of 36.8 seconds to reduce the largest dataset evaluated in this study (i.e. Twitter dataset). This shows the effectiveness of ACOISA_ML for BDA.</p><p>In addition, ACOISA_ML was compared to four recent instance selection algorithms, namely: LDIS, LSSM, LSBO, and ISDSP. The algorithms were evaluated on SVM, hence we first evaluated ACOISA_ML on SVM before comparing it to the algorithms. Annexure 3 shows the prediction accuracy (denoted as accuracy) and storage reduction percentage (denoted as storage) for the four algorithms. The best prediction accuracy for each dataset is underlined. As shown, ACOISA_ML outperformed LSSM in prediction accuracy in 6 out of 11 datasets and outperformed LSBO in 7 out of 11 datasets. Moreover, the results show that ACOISA_ML outperformed LDIS and ISDSP in 9 out of 11 datasets. Furthermore, T-test statistical analysis was performed to evaluate the speed improving capacity of ACOISA_ML. Specifically, we compared the training speed produced by the hybrid models of ANN and RF to the training speed produced by the standard algorithms. The P-values produced by all the test analysis are less than 0.05, hence we can conclude with 95% confidence level that ACOISA_ML is significantly faster, in terms of training speed, than the analyzed standard algorithms. Overall, the results show that the proposed technique is suit-able for fast and simplified BDA and ML speed optimization.</p><p>Keywords: Big data analytics, Machine learning, Instance selection, Data reduction, Speed optimization. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Annexures</head></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Annexure 1 .</head><label>1</label><figDesc>Average training time for the hybrid model and standard model Accuracy: average prediction accuracy (%) produced by the hybrid models, Storage: average storage reduction percentage produced by the instance selection techniques</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell cols="9">Annexure 2. Average prediction accuracy for the hybrid and standard model</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>Datasets</cell><cell></cell><cell>KNN</cell><cell></cell><cell>ANN</cell><cell></cell><cell>RF</cell><cell></cell><cell cols="2">LR</cell><cell></cell><cell cols="2">NB</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="4">Standard Hybrid Standard Hybrid</cell><cell cols="2">Standard Hybrid</cell><cell cols="6">Standard Hybrid Standard Hybrid</cell></row><row><cell></cell><cell>Landstat</cell><cell></cell><cell>90.55</cell><cell>86.86</cell><cell>88.5</cell><cell>85.165</cell><cell>83.75</cell><cell>82.1</cell><cell cols="2">91.05</cell><cell>88.085</cell><cell cols="2">79.6</cell><cell>78.705</cell></row><row><cell></cell><cell>Letter</cell><cell></cell><cell>95.725</cell><cell cols="2">90.995 80.975</cell><cell>79.39</cell><cell>77.375</cell><cell cols="3">75.9675 96.175</cell><cell cols="3">91.9125 62.3</cell><cell>62.2725</cell></row><row><cell></cell><cell cols="2">Mushroom</cell><cell>100</cell><cell>99.91</cell><cell>98.966</cell><cell>99.015</cell><cell cols="2">95.4825 99.97</cell><cell cols="2">100</cell><cell>99.95</cell><cell cols="3">90.8296 91.945</cell></row><row><cell></cell><cell>Optdigit</cell><cell></cell><cell cols="9">97.8297 93.7841 96.5498 93.4613 92.3205 86.9004 97.3845 90.384</cell><cell cols="3">89.4268 83.9232</cell></row><row><cell></cell><cell>Page-bloc</cell><cell></cell><cell cols="4">96.0168 96.916 96.2361 97.328</cell><cell cols="2">96.4553 97.408</cell><cell cols="3">97.5333 97.948</cell><cell cols="2">90.846</cell><cell>92.764</cell></row><row><cell></cell><cell>Shuttle</cell><cell></cell><cell cols="6">99.9103 99.7752 99.7517 99.7062 96.8345 96.76</cell><cell cols="6">99.9931 99.9028 92.2069 92.5297</cell></row><row><cell></cell><cell>Twitter</cell><cell></cell><cell cols="12">96.0911 94.6939 96.4109 94.7831 96.5566 95.3936 96.6881 95.1853 94.9611 93.2472</cell></row><row><cell></cell><cell>USPS</cell><cell></cell><cell cols="6">95.1171 93.7369 94.3199 93.2835 89.5366 86.871</cell><cell cols="6">93.3732 92.3119 76.7813 75.1022</cell></row><row><cell></cell><cell>Pentdigit</cell><cell></cell><cell cols="12">97.7416 92.8845 89.8228 89.1367 89.8228 89.1367 96.5981 90.7919 82.1326 81.6352</cell></row><row><cell></cell><cell>Waveform</cell><cell></cell><cell>80.24</cell><cell>81.8</cell><cell>83.84</cell><cell cols="2">85.2667 87.08</cell><cell cols="3">87.5167 85.24</cell><cell cols="3">85.5167 81.02</cell><cell>80.6083</cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="11">Key: Standard: average prediction accuracy (%) produced by the standard algorithm, Hy-</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="6">brid: average prediction accuracy produced by the hybrid model</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Datasets</cell><cell></cell><cell>KNN</cell><cell cols="11">ANN Annexure 3. Comparison between ACOISA_ML and LDIS, LSSM, LSBO, ISDSP RF LR NB</cell><cell cols="2">ACOISA_ML</cell></row><row><cell></cell><cell></cell><cell cols="4">Standard Hybrid Standard Hybrid (SVM Classifier)</cell><cell cols="9">Standard Hybrid Standard Hybrid Standard Hybrid Sel-T</cell><cell>Av.Sto</cell></row><row><cell>Landstat</cell><cell>Datasets</cell><cell>0</cell><cell cols="2">0.001 ACOISA_ML 115.16</cell><cell>40.811 LDIS</cell><cell>4.82</cell><cell>8.458 LSSM</cell><cell>4.79</cell><cell cols="2">1.212 LSBO</cell><cell>0.13</cell><cell cols="2">0.036 ISDSP</cell><cell cols="2">6.7908 31.567</cell></row><row><cell>Letter</cell><cell></cell><cell>0.02</cell><cell cols="13">0.047 Accuracy Storage Accuracy Storage Accuracy Storage Accuracy Storage Accuracy Storage 365.31 177.991 174.53 77.836 11.29 7.671 0.1 0.058 101.42 43.75</cell></row><row><cell cols="4">Mushroom Cardiotocography 71.25 0.01 0.001</cell><cell>79.44 26.13</cell><cell>19.277 62</cell><cell>1.24 14</cell><cell>0.211 67</cell><cell>1.87 86</cell><cell>62</cell><cell>0.458</cell><cell>0.12 31</cell><cell>59</cell><cell>0.025</cell><cell cols="2">14.758 35.435 10</cell></row><row><cell>Optdigit</cell><cell>Ecoli</cell><cell>0.03</cell><cell>0.004 84.97</cell><cell>278.16 67.21</cell><cell cols="2">157.821 26.64 77 8</cell><cell cols="2">18.089 2.6 83 91</cell><cell>74</cell><cell>2.289</cell><cell>0.08 17</cell><cell>78</cell><cell>0.05</cell><cell cols="2">32.614 52.315 10</cell></row><row><cell>Page-bloc</cell><cell cols="2">0.02 Heart-statlog</cell><cell>0.004 82.44</cell><cell>25.61 61.73</cell><cell>10.482 81</cell><cell>3.02 7</cell><cell>1.793 84</cell><cell>4.16 85</cell><cell>81</cell><cell>1.243</cell><cell>0.04 33</cell><cell>78</cell><cell>0.014</cell><cell cols="2">19.839 45.678 10</cell></row><row><cell>Shuttle</cell><cell cols="2">0.04 Ionosphere</cell><cell>0.017 92.51</cell><cell>255.04 31.75</cell><cell cols="4">109.026 1757.58 27.482 18.75 84 9 88 96</cell><cell>45</cell><cell>7.617</cell><cell>0.3 19</cell><cell>86</cell><cell>0.077</cell><cell cols="2">645.186 43.678 10</cell></row><row><cell>Twitter</cell><cell>Landsat</cell><cell>0.06</cell><cell>0.015 84.81</cell><cell cols="3">8859.44 485.939 69.84 45.1 84 8</cell><cell>2.902 87</cell><cell>275.06 95</cell><cell>85</cell><cell>6.987</cell><cell>4.46 12</cell><cell>84</cell><cell>0.208</cell><cell cols="2">503.838 6.219 10</cell></row><row><cell>USPS</cell><cell>Letter</cell><cell>0.03</cell><cell>0.001 89.10</cell><cell cols="3">5047.81 2420.235 509.16 25 75 18</cell><cell cols="2">227.328 19.06 84 96</cell><cell>73</cell><cell>8.839</cell><cell>0.65 16</cell><cell>74</cell><cell>0.319</cell><cell cols="2">97.992 43.89 10</cell></row><row><cell>Pentdigit</cell><cell>Optdigits</cell><cell>0.02</cell><cell>0.002 92.13</cell><cell>74.25 52.31</cell><cell>32.38 96</cell><cell>108.86 8</cell><cell cols="2">39.143 4.46 99 98</cell><cell>98</cell><cell>1.6</cell><cell>0.07 8</cell><cell>97</cell><cell>0.027</cell><cell cols="2">15.175 33.36 10</cell></row><row><cell>Waveform</cell><cell cols="2">0 Page-blocks</cell><cell>0 95.12</cell><cell>50.33 20.31</cell><cell>11.299 94</cell><cell>1.1 13</cell><cell>0.206 94</cell><cell>6.29 97</cell><cell>92</cell><cell>1.178</cell><cell>0.08 4</cell><cell>91</cell><cell>0.011</cell><cell>4.056 10</cell><cell>24</cell></row><row><cell></cell><cell>Parkinson</cell><cell></cell><cell cols="11">Key: Standard: time produced by the standard algorithms, Hybrid: time produced by the hy-80.95 58.31 82 17 87 89 82 13 85</cell><cell>10</cell></row><row><cell></cell><cell>Segment</cell><cell></cell><cell cols="11">brid model, Sel-T: average instance selection time (in seconds), Av-Sto: Average storage per-94.63 14.22 89 18 90 90 90 18 87</cell><cell>10</cell></row><row><cell></cell><cell>Wine</cell><cell></cell><cell>centage 94.94</cell><cell>65.03</cell><cell>94</cell><cell>12</cell><cell>97</cell><cell>89</cell><cell>96</cell><cell></cell><cell>25</cell><cell>93</cell><cell></cell><cell>10</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Key:</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective</title>
		<author>
			<persName><forename type="first">E</forename><surname>Leyva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>González</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pérez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="1523" to="1537" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">An ant colony optimization algorithm for image edge detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence)</title>
				<imprint>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="page" from="751" to="756" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Edge Detection Improvement by Ant Colony Optimization Compared to Traditional Methods on Brain MRI Image</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nayak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dash</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications on Applied Electronics (CAE)</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="19" to="23" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Edge Technique Using ACO with PSO for Noisy Image</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gautam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Biswas</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="383" to="396" />
			<pubPlace>Singapore</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">UCI machine learning repository</title>
		<author>
			<persName><forename type="first">K</forename><surname>Bache</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lichman</surname></persName>
		</author>
		<ptr target="http://archive.ics.uci.edu/ml" />
		<imprint>
			<date type="published" when="2013-12">2013. 12-May-2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
