<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Detecting SQL Injection Attacks using Machine Learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Béatrice</forename><surname>Moissinac</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Elie</forename><surname>Saad</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Miranda</forename><surname>Clay</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Maialen</forename><surname>Berrondo</surname></persName>
						</author>
						<title level="a" type="main">Detecting SQL Injection Attacks using Machine Learning</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">B68E9D9C737A159FF221B0741862EC21</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:58+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>SQL injection</term>
					<term>Machine Learning</term>
					<term>Language mixture</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Injection attacks such as SQL injection attacks (SQLia) are commonly used against systems. The consequences of those attacks range from financial, data, and reputational loss or worse. SQLia can be detected by analyzing the HyperText Transfer Protocol (HTTP) request data from which the SQLia is transmitted into the target resource. Various statistical and analytical tools exist today to detect SQLia, however, they are prone to false positives, which make their usage in production environment limited.</p><p>In this paper, we propose (1) a method of feature engineering to generate SQL and HTTP language mixtures, (2) these mixtures are used to significantly reduce the time and effort needed by Subject Matter Experts (SMEs) to label, and (3) evaluate supervised Machine Learning models using this feature engineering method. Furthermore, a major contribution of this paper is that our proposed solution is developed and evaluated using real-world HTTP request data sampled from authentication transactions served by a major Identity &amp; Access Management (IAM) company. Thus, we believe that our results are a strong representation of the real-world effect of this detection method. Finally, we also show that this technique can be trivially extended to other types of injection attacks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Injection attacks such as SQL injection attacks (SQLia) are commonly used against platforms to extract, delete, or otherwise corrupt valuable resources <ref type="bibr" target="#b0">[1]</ref>. The consequences of those attacks range from financial, data, and reputation loss or worse <ref type="bibr" target="#b1">[2]</ref>. Technically, SQLia are carried out by "injecting" (or inserting) SQL queries in the HyperText Transfer Protocol (HTTP) request data sent between the client and the server 1 . Once the attacker has sent the request containing the nefarious SQL query, she expects the server to read the SQL query and perchance, a vulnerable system would execute the query and either return, delete, or alter sensitive data. Furthermore, the risk of SQLia has recently increased with the introduction of Large Language Models (e.g., ChatGPT), to the general public, lowering the barrier of entry for potential new threat actors <ref type="bibr" target="#b2">[3]</ref>. <ref type="bibr">October 19-20, 2023</ref>, Arlington, VA * Corresponding author. Envelope beatrice.moissinac@okta.com (B. Moissinac); elie.saad@okta.com (E. Saad); miranda.clay@okta.com (M. Clay); maialen.berrondo@okta.com (M. Berrondo)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CAMLIS'23: Conference on Applied Machine Learning for Information Security,</head><p>A diverse landscape of analytical and statistical tools exists to detect SQLia. In section 2.1, we review Threat Intelligence techniques and libraries. These techniques focus on creating rule sets, which are prone to false positives and restrict their usage in real-world settings. In section 2.2, we review Machine Learning (ML) techniques to detect SQLia. These data-based approaches are usually developed using "synthetic data", that is, data generated by a Subject Matter Expert (SME) rather than from the real-world.</p><p>On one hand, generating rule sets is time consuming, prone to false positives, and potentially not exhaustive enough. On the other hand, Machine Learning techniques have been limited to synthetic data and weak statistical modeling. In this paper, we propose to address those issues with the following contributions:</p><p>1. Propose a novel method of feature engineering to generate SQL and HTTP language mixtures inspired by topic modeling <ref type="bibr" target="#b3">[4]</ref>;</p><p>2. These mixtures are used to significantly reduce the time and effort needed by Subject Matter Experts (SME) to label;</p><p>3. Evaluate supervised Machine Learning models using this feature engineering method.</p><p>Furthermore, a major contribution of this paper is that our proposed solution is developed and evaluated using real-world HTTP request data sampled from authentication transactions served by a major IAM company. Thus, we believe that our results are representative of how the method would perform in the real-world.</p><p>Finally, the novel feature engineering approach presented in this paper can be trivially extended to the parent attack class of injection <ref type="bibr" target="#b4">[5]</ref>. Thus, this model is more useful than current existing techniques and covers a wider range of attack classes than what is available today.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work on Detection of SQLia</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Detection using Threat Intelligence Techniques</head><p>Released in 2012, the Libinjection project <ref type="bibr" target="#b5">[6]</ref> proposed a novel way to detect SQLia. Most SQLia detectors were based on rule-sets and regular expressions, while Libinjection developed attack vector identification based on digesting previous patterns and generating an algorithm based on them. Libinjection was published as a library to be integrated on application layer defenses. It is commonly used by Open Source Web Application Firewalls (WAF), Intrusion Detection Systems (IDS), and Open Source Security software, such as ModSecurity, an Apache module <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>, which in turn, is used by other tools <ref type="bibr" target="#b8">[9]</ref>. Libinjection has been extended to support a number of languages (i.e., C, Python, PHP, JavaScript, Go, Ruby, and Java).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Detection using Machine Learning</head><p>Many industry tenants have also focused on developing and improving signature-based models using ML, for instance, Fortinet <ref type="bibr" target="#b9">[10]</ref> or CloudFlare <ref type="bibr" target="#b10">[11]</ref>. Some other companies (such as F5 [12] and Imperva <ref type="bibr" target="#b11">[13]</ref>) implemented ML models to enable a wider set of signatures, and suppress or trigger alerts based on the confidence of the ML model. However, their algorithms are not publicly available for comparison. On the other hand, Academic research has published multiple ML-based approaches to SQLia detection <ref type="bibr" target="#b12">[14]</ref>, whose details are described below.</p><p>Feature Sets ML models require a feature set, that is, a set of signals (i.e., presence/absence, counts, etc) to be correlated with the desired output (i.e., is/isn't SQLia). Thus, ML-based SQLia detection heavily relies on SQL language markers for detection. For instance, in <ref type="bibr" target="#b13">[15]</ref>, the authors used the presence of any comment character, the number of semicolons, the presence of a tautology (i.e., a statement that is always true, such as 1 = 1), the number of commands per statement, and the presence of abnormal command or special keywords. Similarly, in <ref type="bibr" target="#b14">[16]</ref>, the features included single-line and multi-line comments, SQL operators, punctuation, logical operators, keywords, etc. Virtually all prior work relied on some variation of the SQL language marker, but also only the SQL language markers. Algorithms Many ML-based SQLia detection models have been developed in recent years, using Naive Bayes <ref type="bibr" target="#b14">[16,</ref><ref type="bibr" target="#b15">17,</ref><ref type="bibr" target="#b16">18,</ref><ref type="bibr" target="#b17">19]</ref>, SVM <ref type="bibr" target="#b16">[18,</ref><ref type="bibr" target="#b18">20,</ref><ref type="bibr" target="#b19">21]</ref>, or an Ensemble method <ref type="bibr" target="#b13">[15,</ref><ref type="bibr" target="#b16">18,</ref><ref type="bibr" target="#b17">19,</ref><ref type="bibr" target="#b19">21]</ref>. However, we do note that Naive Bayes approaches may not be statistically robust. Naive Bayes assumes the independence of features, however, programming language markers are not independent from each other. For example, 'SELECT' is very correlated with 'FROM' in SQL.</p><p>Data Within the Threat Intelligence research on SQLia detection, models are developed from data collected from "red teams", a teams of security SMEs, which generates injection attacks for the purpose of testing a platform vulnerability. This type of data collection is omnipresent in ML research on SQLia as well <ref type="bibr" target="#b13">[15,</ref><ref type="bibr" target="#b14">16,</ref><ref type="bibr" target="#b16">18,</ref><ref type="bibr" target="#b17">19]</ref>. From an ML point of view, this type of data is called 'synthetic' and presents a major risk of not being representative of real-world data, as well as being too small. In <ref type="bibr" target="#b13">[15]</ref>, the authors trained their models on 105 SQL statements and <ref type="bibr" target="#b14">[16]</ref> used 178 examples. In <ref type="bibr" target="#b16">[18]</ref>, the authors collected 4,000 rows of plain text sentences from HTML forms collected "from user input" via a "web application". Overall, data collection and labeling is the most expensive problem to solve in ML-based SQLia detection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data HTTP URL Request</head><p>The data is a set of 1 million HTTP URL requests from identity-centric traffic. It was sampled from a large Customer Identity and Access Management (CIAM) platform between 2021 and 2023, at the network edge<ref type="foot" target="#foot_0">2</ref> . The volume of traffic from which this is sampled is substantial enough to be representative of US customer Internet traffic, and the platform may be considered a giant Honey Pot <ref type="foot" target="#foot_1">3</ref> .</p><p>In this paper, the proposed solution focuses only on the URL request data. We do not consider the IP or any other signals within the transaction, because we want to specifically evaluate the statistical robustness of language mixtures as signals for SQLia detection. Furthermore, methods such as this one are not meant to be a silver bullet, but could be integrated into a layered security architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Building a Language Mixture</head><p>Intuition behind Language Mixtures The idea described below is similar in spirit to the Latent Dirichlet Allocation (LDA) <ref type="bibr" target="#b3">[4]</ref> approach for topic modeling. The traditional LDA model estimates a mixture of topics per document (i.e., what the document is about), and a mixture of words per topic. LDA aims at answering the question: "What topics are present in this document, and how much are those topics discussed in this document?". Similarly, we want to calculate "how much" HTTP and "how much" SQL is in an URL request. In this sense, HTTP and SQL are the "topics" of the URL request (document). However, LDA is based on word count <ref type="foot" target="#foot_2">4</ref>and relies on repetition of the same words within a document to estimate the mixture of topics. Thus, LDA did not work well for this problem, because obfuscated SQLia have only very few (and unique) SQL markers in the URL request.</p><p>Instead, we propose a "language mixture", which is not affected by word count. For each URL request, we score "how much" SQL-like and "how much" HTTP-like the request is. We want to automatically estimate a dictionary of language markers for each language (SQL and HTTP). Each marker is associated with a weight based on how important (or common) the marker is to this language. For instance 'SELECT' is very representative of SQL. Using real-world data is crucial to guarantee that the markers and their weights are representative of real-world usage.</p><p>Building a Language Mixture To build a language mixture for SQL, we used 1 million SQL queries from open source SQL repositories from GitHub. For the HTTP language mixture, we used 1 million URL requests<ref type="foot" target="#foot_3">5</ref> from the same provider described in Section 3. We did not start with a known dictionary of SQL or HTTP operator, but rather extracted everything present in the data and sorted it into three categories of token, in order to be representative of the real-world. For each data set separately, we extracted three types of markers:</p><p>• Keywords (any character chain of length 2 or more);</p><p>• Delimiters (parenthesis, brackets, comma, etc.);</p><p>• Operators (+, − , * , etc.).</p><p>The weight of each token is the percentage of "documents" (URL requests or SQL queries) which contained that token at least once. For instance, the token 'FROM' has a weight of 0.47 because it was present in 47% of the SQL queries. We kept tokens with a weight greater than 0.10. From this string, we extracted the tokens listed in Table <ref type="table" target="#tab_0">1</ref>. Each mixture is the sum of the weights of the tokens present in the URL request. A weight is summed only once, even if the token appears multiple times. that is, the mixture score is not the sum of the weights multiplied by the number of occurrence of the token in the string. This is because it would make this method too insensitive to obfuscation of short SQL queries within a long HTTP string. We also don't normalize the score, because it is not usual for SQL queries or HTTP strings to have all their markers. Thus, the score is an absolute representation of the SQL-likeness or HTTP-likeness rather than a relative percentage of completeness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">How to use language mixtures to detect SQLia</head><p>Conveniently, the example presented in the previous section has an SQL mixture greater than its HTTP mixture. Unfortunately, comparing the language mixtures is generally not sufficient to make a decision as to whether an URL request contains an SQLia. For instance, we found that highly obfuscated SQLia will have a low SQL mixture and a high HTTP mixture. Nevertheless, we can use the language mixtures to (1) label more efficiently the data set ( <ref type="formula">2</ref>) build an ML model using the mixture tokens and weights as features to learn to classify within the non-linear space of SQL/HTTP mixtures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Language Mixture as Labeling Heuristic</head><p>We computed the SQL/HTTP mixtures for 1 millions URL requests from identity-centric authentication transactions. The distribution of mixtures over the data set is not linear, as shown in Figure <ref type="bibr" target="#b0">(1)</ref>. Labeling such a large dataset is not realistic, but the language mixtures can be used as a heuristic to efficiently select batches of URL requests "of interest". From a ML point of view, we want to label examples near the /upload/mobile/index.php?c=category&amp;a=asynclist&amp;price_max=1.0 %20AND%20(SELECT%201%20FROM(SELECT%20COUNT(*),CONCAT(0x7e,md5 (1),0x7e,FLOOR(RAND(0)*2))x%20FROM%20INFORMATION_SCHEMA. CHARACTER_SETS%20GROUP%20BY%20x)a)'' (B) Lowest HTTP mixture : Not "HTTP"-enough, (and also not "SQL-enough") revealed the ability of this technique to discover other type of command injections, such as this XSS injection <ref type="foot" target="#foot_4">6</ref> :</p><formula xml:id="formula_0">/?q=%27%3E%22%3Csvg%2Fonload=confirm%28%27testing- xss1%27%29%3E&amp;s=%27%3E%22%3Csvg%2Fonload=confirm%28%27testing- xss2%27%29%3E&amp;search=%27%3E%22%3Csvg%2Fonload=confirm%28%27tes ting-xss3%27%29%3E&amp;id=%27%3E%22%3Csvg%2Fonload=confirm%28%27te sting-[...]</formula><p>(C) Random sample across the entire set : A sample of 1,000 instances across the entire set was labeled to explore other areas of the search space, and increase chances to label diverse types of SQLia (in terms of SQL/HTTP mixtures). Then, using the SQLia found in this batch, we selected more URL request to be labeled by sampling URL requests whose mixture scores were within +/ − 𝑥, with 𝑥 varying from 0.05 to 1 from those SQLia examples.</p><p>Inter-Rater Reliability The data was labeled by threat intelligence and security engineer SMEs. We reached an inter-rater reliability rate of 94.9% , with 1,705 innocuous URL requests (labeled 'HTTP') and 114 SQLia (labeled 'SQL').</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">ML-based SQLia Detection</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Experimental Setup</head><p>Features In this paper, we evaluated a feature engineering method based on creating language mixture scores for each URL request. Thus, for each URL request, the feature vector has two parameters: the mixture for SQL and HTTP respectively.</p><p>Benchmark To benchmark our proposed feature vector, we compare it with previously proposed feature vectors using word counts <ref type="foot" target="#foot_5">7</ref> and presence/absence flags<ref type="foot" target="#foot_6">8</ref> of SQL tokens (see Section 2.2). We used the 61 SQL tokens generated by the method presented in Section 4.</p><p>Algorithm In order to fairly compare the efficacy of the feature vectors described above, we needed to use the same algorithm. Furthermore, the benchmark features have some statistical particularities that restrict which algorithm to use. The features are correlated with each other due to the nature of programming languages (e.g., 'SELECT' and 'FROM' in SQL are likely to go together in a query). Thus, we used Decision Tree<ref type="foot" target="#foot_7">9</ref> , an algorithm family which is not sensitive to the correlation between features, and can optimize a solution within a non-linear search space 10 .</p><p>Training &amp; Testing Sets We used the entire    XSS and other type of injections. While labeling the training and testing sets, our SMEs found that a URL request whose mixtures are not 'HTTP-enough' and not 'SQL-enough' is likely to be some other sort of command injection (e.g., template, code, os, xxe etc.). While those other type of command injections were removed from the training set, we decided to include XSS examples in the testing set to highlight an avenue for future work: the expansion of language mixtures to other types of injections. From a production perspective, it is desirable to develop one ML model capable of detection/classifying various type of injections. However, the more injection types are added, the more confusion is introduced. For example, in Table <ref type="table" target="#tab_4">5</ref>, SQL and HTTP have overlapping tokens, that is, they "share a feature". When using a presence/absence or word count type of featurization, overlapping features may create ambiguity that makes the problem more difficult for an ML model. Intuitively, a language mixture approach may help alleviate the overlapping of markers between languages, by biasing them with their weights (i.e., their importance within that language). Future work will focus on 'unknown unknowns' and previously unidentified vulnerabilities. This will include research on the language mixtures with behavioral analysis of the URL request's response. This will deepen the model's understanding to identify potential zero-days <ref type="foot" target="#foot_10">13</ref> before they are known by correlating request and response, and their effect. For instance, this may help organizations identify attacks such as data exfiltration, and reduce the false positive rate on benign requests.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Number of URL request per SQL/HTTP mixture in the 1M row sample</figDesc><graphic coords="6,154.65,84.19,283.48,261.32" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Calculation of SQL and HTTP mixture scores</figDesc><table><row><cell cols="3">Token SQL Mixture HTTP Mixture</cell></row><row><cell>and</cell><cell>0.286</cell><cell>0.000</cell></row><row><cell>select</cell><cell>0.385</cell><cell>0.000</cell></row><row><cell>-</cell><cell>0.508</cell><cell>0.000</cell></row><row><cell>=</cell><cell>0.484</cell><cell>0.732</cell></row><row><cell>.</cell><cell>0.573</cell><cell>0.727</cell></row><row><cell>/</cell><cell>0.208</cell><cell>0.999</cell></row><row><cell>(</cell><cell>0.818</cell><cell>0.000</cell></row><row><cell>,</cell><cell>0.747</cell><cell>0.000</cell></row><row><cell>)</cell><cell>0.818</cell><cell>0.000</cell></row><row><cell>?</cell><cell>0.000</cell><cell>0.738</cell></row><row><cell>Total</cell><cell>4.827</cell><cell>3.196</cell></row><row><cell cols="3">Estimating the Language Mixture Consider the following SQL injection attack found in</cell></row><row><cell>the data set.</cell><cell></cell><cell></cell></row><row><cell cols="3">/yyoa/ext/trafaxserver/downloadAtt.jsp?attach_ids=</cell></row><row><cell cols="3">(1)%20and%201=2%20union%20select%201,2,3,4,5,md5(203735726),7--</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>1,819 labeled examples for the training without hold-out, because the selection of training set example was biased by our goal to find more SQLia examples to train the model. Thus, we did not test and validate on the training set.Instead, we randomly selected 638 examples from the remaining 1 million URL requests, and applied the fitted model to predict an 'http' or 'sql' label. In parallel, our security SMEs also labeled the testing set for groundtruth. This way, the model is evaluated fairly, without biases that may have been inputted from Section 5.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Feature Vector (A) Confusion Matrix (Language Mixture Scores)</figDesc><table><row><cell cols="4">Prediction/True label HTTP SQL XSS</cell><cell></cell><cell></cell></row><row><cell>HTTP</cell><cell>613</cell><cell>3</cell><cell>4</cell><cell>FPR</cell><cell>0.16%</cell></row><row><cell>SQL</cell><cell>1</cell><cell>14</cell><cell>3</cell><cell cols="2">FNR 17.65%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Feature Vector (B) Confusion Matrix (Word count of SQL tokens)</figDesc><table><row><cell cols="4">Prediction/True label HTTP SQL XSS</cell><cell></cell></row><row><cell>HTTP</cell><cell>604</cell><cell>1</cell><cell>4</cell><cell>FPR 1.63%</cell></row><row><cell>SQL</cell><cell>10</cell><cell>16</cell><cell>3</cell><cell>FNR 5.88%</cell></row><row><cell>Table 4</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">Feature Vector (C) Confusion Matrix (Presence/Absence of SQL tokens)</cell></row><row><cell cols="4">Prediction/True label HTTP SQL XSS</cell><cell></cell></row><row><cell>HTTP</cell><cell>603</cell><cell>0</cell><cell>4</cell><cell>FPR 1.79%</cell></row><row><cell>SQL</cell><cell>11</cell><cell>17</cell><cell>3</cell><cell>FNR 0.00%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Overlapping SQL &amp; HTTP tokens and their language weights.</figDesc><table><row><cell cols="3">Token SQL HTTP</cell></row><row><cell>=</cell><cell>0.480</cell><cell>0.730</cell></row><row><cell>_</cell><cell>0.780</cell><cell>0.660</cell></row><row><cell>-</cell><cell>0.350</cell><cell>0.540</cell></row><row><cell>.</cell><cell>0.570</cell><cell>0.730</cell></row><row><cell>/</cell><cell>0.210</cell><cell>1.000</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">While the dataset may have been filtered before entering our line of vision, our methodology and results still represent a real-world use-case.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">In Threat Intelligence research, a Honey Pot is a system mimicking real-world vulnerabilities to attract the attacker and collect useful data about the attacker and the attacker pattern.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">"Word count" is a featurization in ML which count the occurrence of a word in an instance.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">While it is possible that those URL requests contain injections and other "impurities", we assume that the low volume of those attacks on this type of traffic sufficiently guarantees that the HTTP tokens extracted are correct and representative.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">This example was truncated due to space limitation.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">A "word count" feature vector is a vector where each word is a feature, and the value of each feature is the number of times the word appeared in the instance (i.e., the URL request)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_6">'Presence/absence flags' is a feature vector of boolean, with a token is a feature, and the parametrization is a boolean flag set to 1 if the token is present in the instance (i.e., URL request), and 0 otherwise.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7">We used the 𝑠𝑘𝑙𝑒𝑎𝑟𝑛 Python package, which implements the CART algorithm.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8">and 7 XSS command injection, see our discussion below. Those XSS command injections are removed from the metrics calculation in order to strictly evaluate the model on SQLia detection</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_9">Extrapolating the FPR of 1.63% on our original 1M URL requests would caused 16,300 requests to fail or be delayed.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_10">A Zero-day is a vulnerability that is not yet known, and that can be exploited.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>We thank Kim Berry for initial discussion on this approach. We thank Mathew Woodyard and George Vauter for their initial work on labels.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Results &amp; Discussion</head><p>In this paper, we proposed a feature engineering method to calculate how much "HTTP-like" and "SQL-like" are URL requests, in order to detect SQLia. These features are used as heuristic to reduce the time and effort needed to create a data set to train an ML model for detecting SQLia. We implemented a supervised ML model using Decision Tree to compare this feature set with traditional feature sets (i.e., boolean flags and word counts). The notation of those feature vectors is listed below, and used in the rest of this section. Evaluation Metrics From the ML point of view, a lot of the difficulty in evaluating ML methods for SQLia detection is in the strong imbalance in the data set. The testing set contains 17 SQLia for 614 innocuous HTTP URL requests 11 , thus accuracy is not a good measure, because even if we mislabeled all the SQL injections, we would still have 97% accuracy. Instead, we focused on False Positive Rate (FPR) and False Negative Rate (FNR). In the rest of this paper, we consider a 'Positive' to be an SQL injection, and a 'Negative' to be an innocuous HTTP URL request, and the rest of this section will be referring to the results presented in Tables 2, 3, and 4. We observed a trade-off between model (A), (B), and (C), where model (A) is less likely to falsely identify a legitimate HTTP URL request as an SQLia compared to model (B) and (C) (i.e., FPR 0.16%), while model (C) is the best at identifying SQLia (FNR 0.0%). Additionally, those results are to be nuanced. The inspection of the HTTP URL request marked as SQLia by model (A) revealed that the request did contain an SQL command. The SQL command is expected by that customer's CIAM implementation. The SQLia sample is not large enough to extrapolate an updated FPR. Overall, model (A) had fewer misclassifications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The Lesser of Two Evils</head><p>In a real-world production environment, the minimization of FPR vs FNR will depend on the use-case. On one hand, letting through SQL command injections may be dangerous, although we may also assume that this model would be part of a layered approach to security. On the other hand, the friction caused to legitimate users by a large quantity 12 of false positives might become very undesirable. A preference for each feature vector will depend on the use-case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Real-World Cost</head><p>A great advantage of Decisions trees is the full explainability and coverage of the rule set generated (if the tree is allowed to go to its full depth). Hence, the model could be trained off-line (Computation takes less than 1 second), for nearly free, and the rules generated by the Decision Tree could be added to the current rules of any systems. Thus, we argue that the cost of this method is comparable to the cost of current rule-based methods.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>Cve</surname></persName>
		</author>
		<ptr target="http://cve.mitre.org/" />
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Hern</surname></persName>
		</author>
		<ptr target="https://www.theguardian.com/business/2016/oct/05/talktalk-hit-with-record-400k-fine-over-cyber-attack" />
		<title level="m">TalkTalk hit with record £400k fine over cyber-attack</title>
				<imprint>
			<publisher>The Guardian</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<ptr target="https://owasp.org/www-project-top-10-for-large-language-model-applications" />
		<title level="m">OWASP Top 10 for Large Language Model Applications</title>
				<imprint>
			<publisher>OWASP Foundation</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine Learning research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Shadowd</forename><surname>Zecure</surname></persName>
		</author>
		<ptr target="https://capec.mitre.org/data/definitions/248.html" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title/>
		<author>
			<persName><surname>Libinjection</surname></persName>
		</author>
		<ptr target="https://github.com/client9/libinjection/blob/master/README.md" />
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title/>
		<author>
			<persName><surname>Modsecurity</surname></persName>
		</author>
		<ptr target="https://coreruleset.org/faq" />
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title/>
		<author>
			<persName><surname>Libmodsecurity</surname></persName>
		</author>
		<ptr target="https://github.com/SpiderLabs/ModSecurity/blob/ec86b242e15f9df1d143c1b7f86a27889658b4cb/README.md" />
		<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title/>
		<author>
			<persName><surname>Naxsi</surname></persName>
		</author>
		<ptr target="https://github.com/nbs-system/naxsi/blob/master/README.md" />
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title/>
		<author>
			<persName><surname>Fortinet</surname></persName>
		</author>
		<ptr target="https://docs.fortinet.com/document/fortiweb/6.3.7/administration-guide/193258/machine-learning" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title/>
		<author>
			<persName><surname>Cloudflare</surname></persName>
		</author>
		<ptr target="https://blog.cloudflare.com/waf-ml/" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title/>
		<author>
			<persName><surname>Imperva</surname></persName>
		</author>
		<ptr target="https://www.imperva.com/products/attack-analytics/" />
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">SQLIA Detection and Prevention Techniques</title>
		<author>
			<persName><forename type="first">M</forename><surname>Al Rubaiei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Al Yarubi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Al</forename><surname>Saadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kumar</surname></persName>
		</author>
		<idno type="DOI">10.1109/SMART50582.2020.9336795</idno>
	</analytic>
	<monogr>
		<title level="m">9th International Conference System Modeling and Advancement in Research Trends (SMART)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="115" to="121" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Detection of SQL injection attacks: A machine learning approach</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Balbahaith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tarique</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2019 International Conference on Electrical and Computing Technologies and Applications (ICECTA), IEEE</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Sql injection attack detection and prevention techniques using machine learning</title>
		<author>
			<persName><forename type="first">I</forename><surname>Jemal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Cheikhrouhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hamam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mahfoudhi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Applied Engineering Research</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="569" to="580" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Improving Web Application Firewalls to detect advanced SQL injection attacks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Makiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Begriche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Serhrouchni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2014 10th International Conference on Information Assurance and Security</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="35" to="40" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">SQL injection detection using machine learning</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mishra</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Multi-source data analysis and evaluation of machine learning techniques for SQL injection detection</title>
		<author>
			<persName><forename type="first">K</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Moh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-S</forename><surname>Moh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACMSE 2018 Conference</title>
				<meeting>the ACMSE 2018 Conference</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Applied machine learning predictive analytics to SQL injection attack detection and prevention</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">O</forename><surname>Uwagbole</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">J</forename><surname>Buchanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IFIP/IEEE Symposium on Integrated Network and Service Management (IM), IEEE</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="1087" to="1090" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">SQL Injection Detection Using Machine Learning Techniques and Multiple Data Sources</title>
		<author>
			<persName><forename type="first">K</forename><surname>Ross</surname></persName>
		</author>
		<idno type="DOI">10.31979/etd.zknb-4z36</idno>
		<ptr target="https://scholarworks.sjsu.edu/etd_projects/650.doi:10.31979/etd.zknb-4z36" />
		<imprint>
			<date type="published" when="2018">2018</date>
			<pubPlace>San Jose, CA, USA</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Master of Science, San Jose State University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
