<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. G. Dietterich, Approximate Statistical Tests for Comparing Supervised Classification
Learning Algorithms, Neural Comput.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/2810103.2813724</article-id>
      <title-group>
        <article-title>Small Efect Sizes in Malware Detection? Make Harder Train/Test Splits!</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tirth Patel</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fred Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward Raf</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles Nicholas</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cynthia Matuszek</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Holt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Booz Allen Hamilton</institution>
          ,
          <addr-line>8283 Greensboro Drive, McLean, VA 22102</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratory for Physical Sciences</institution>
          ,
          <addr-line>5520 Research Park Drive, Catonsville, MD 21228</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Maryland</institution>
          ,
          <addr-line>Baltimore County, 1000 Hilltop Cir, Baltimore, MD 21250</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>10</volume>
      <issue>1998</issue>
      <fpage>1118</fpage>
      <lpage>1129</lpage>
      <abstract>
        <p>Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines, meaning a 0.1% change can cause an overwhelming number of false positives. However, academic research is often restrained to public datasets on the order of ten thousand samples and is too small to detect improvements that may be relevant to industry. Working within these constraints, we devise an approach to generate a benchmark of configurable dificulty from a pool of available samples. This is done by leveraging malware family information from tools like AVClass to construct training/test splits that have diferent generalization rates, as measured by a secondary model. Our experiments will demonstrate that using a less accurate secondary model with disparate features is efective at producing benchmarks for a more sophisticated target model that is under evaluation. We also ablate against alternative designs to show the need for our approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Malware detection, determining if a given file is benign or malicious, is an important safety
problem, since malware causes billions in financial damage each year [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, it is not
easy for academic researchers to know that they have produced an improvement using freely
available data. This is because industry uses tens of millions of executables at tens of terabytes in
scale to detect meaningful improvements in accuracy [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]. In contrast, academic datasets
with raw executables available are measured in tens of thousands of executables [6, 7, 8]. This
small scale has made it easy for academic work to over-fit to the data [ 9, 10, 11], and best
practices like a train and test set split by time (by when the executable was created) are not
possible due to lack of information [11].
      </p>
      <p>The goal of this work is to provide academic researchers with a means of constructing new
train/test splits, using publicly available information for Microsoft windows malware, that can
increase the predictive dificulty of the task by removing common biases that lead to overfitting.
The crux of our method is that malware can be grouped into families of related type [12],
and an ideal malware detector is one that can detect new families that were not seen during
training. This insight gives us an objective way to group samples into train/test splits that
do not cause significant information leakage by having the same malware families in both
training and testing, as some prior academic works do [11]. By searching for malware families
of the right dificulty to place in each train and test split, we can produce new benchmark splits
for researchers to use that are smaller than the source datasets, but avoid the bias problems
mentioned above.</p>
      <p>The rest of our paper is organized as follows. In section 2 we will discuss the important
related work to our own, including prior issues in malware detection research and the work
in reproducibility and model selection that can be better leveraged by our benchmarks. Then
in section 3 we will describe how we use a base, simpler model with a search procedure to
construct these benchmark datasets. The goal is that our splits will have a lower baseline
accuracy for existing methods, showing that we can produce a harder dataset, which in turn
makes it easier to detect improvements in generalization and thus efect size. We demonstrate
this for three dificulty levels (Easy, Medium, and Hard) in section 4, and that two intuitive
ablation strategies are inefective in subsection 4.1. Finally, our article concludes in section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Malware detection research using machine learning has been active since 1995 [13], and includes
raw byte [14], API calls and assembly [15, 16], graph [17], or exogenous metadata [18]. However,
much industrial research has indicated that academic methods do not often transfer well to
industrial data, and so increasingly industry is trying to release more representative datasets [
        <xref ref-type="bibr" rid="ref2">19,
2</xref>
        ]. Such eforts are commendable, but these datasets often still require a VirusTotal license 1
to get the original files, and they can be prohibitively large. The SOREL-20M corpus has over
20 million files in a train/validation/test split to detect small improvements that matter in
real-world use. Our work is the first attempt (that we know of) to develop methods to decrease
the amount of data necessary to detect an improvement, rather than simply add more data.
      </p>
      <p>With respect to the issue of detecting improvements in our models, much of the machine
learning literature has tackled this problem. Early works explained that ordinary t-tests and
other statistical methods are not reliable for machine learning cases for a variety of technical
reasons [20]. More recent works have consistently found that a non-parametric Wilcoxon
test is a reliable way to detect which algorithm performs best, if multiple trials (i.e., datasets)
are available [21, 22, 23]. Other approaches to testing over the space of hyper-parameter
values have also been proposed to better measure the improvement achieved, if any, by a new
algorithm [24, 25]. The goal of our work is to provide a better foundation for using these prior
model selection strategies, as simple cross-validation over an existing biased academic dataset
is unlikely to produce a robust conclusion [26, 27].</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>
          To perform our study, it was critical that we had a representative population of benign samples,
as crawling publicly available sources has been demonstrated to produce models with insuficient
1This costs $400,000/year.
diversity, which do not generalize to new malware [
          <xref ref-type="bibr" rid="ref2">9, 10, 2, 11</xref>
          ]. Because our interest is in
producing train/test splits that are also of a reasonable size, so that academics can use them, we
use the EMBER 2018 dataset [19] which contains 300,000 training and 100,000 testing benign
ifles. The EMBER dataset also includes malicious files, but they are not evenly distributed by
malware family or type, which is problematic for our dataset construction approach.
        </p>
        <p>For this reason, we use the VirusShare corpus [28] as a source of freely available malware.
Malware family labels can also be obtained freely via the AVClass [29, 30] tool combined with
the VirusTotal reports of [31]. Using these sources we are able to get hundreds of malware
families with thousands of samples each. Following [32] we use the same top 184 most frequent
malware families with 10,000 samples each, 8,000 for training and 2,000 for testing.</p>
        <p>Malware Family used for Testing</p>
        <p>While the average recall rate is reasonably large, the recall per-family has an extremely high variance.</p>
        <p>makes it challenging to determine if a performance diference comes from luck or true efect.
3.</p>
        <p>A
p</p>
        <p>proach
To
make
our benchmarks of configurable
dificulty,
we
will start
with the all-pairs cross-errors
shown in</p>
        <p>Figure
1. Each
row
corresponds to selecting
one
of the
184
malware families to
be
the only family
used
during training. The resulting classifier is then tested
on itself and
all 183
of the
other
malware families,
with the recall score in the corresponding columns. (The
main
diagonal shows the recall
This gives us information
we
on
get
when testing
on the same
malware as</p>
        <p>was used for training).
how
useful each
malware family is, on its own, in
predicting
all
other
malware families. In
every case a random
set of benign files is down-sampled to the same
number
of
malicious files.</p>
        <p>That is, 8,000
malicious and
8,000
benign files are
used to train
a
model for each row.</p>
        <sec id="sec-2-1-1">
          <title>The goal</title>
          <p>will be to
generate train/test splits into three categories
mentioned
earlier, namely</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Easy,</title>
          <p>Medium, and</p>
          <p>Hard.</p>
          <p>We
will start by
generating ten
diferent train/test splits in
each
of
the categories.</p>
          <p>Note
here that each train/test split
must have
no
overlap
of
malware families
between the train
and test splits, but diferent train/test splits
might share some families. That
is to say, the family “cycbot”
may
occur in training splits 1, 3, and 4, but that
means the “cycbot”
family cannot occur in test splits 1, 3, and
4. In this
way
every individual split is a
meaningful
test of generalization to
new
malware families, the
ultimate goal of any
malware detector. Each
training split is trained
on independently (not cross-validated), and so
overlap
between splits
will not impact the results. So
we
will have 30
distinct train/test splits in total, ten
each for the</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Easy,</title>
          <p>Medium, and
Hard categories.
any
the</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>The algorithm</title>
          <p>we
use to create these train/test splits is shown in</p>
          <p>Algorithm
1. The strategy
is to apply
a random
search to
obtain
a set of training families 
and
a set of testing families  ,
which satisfy the constraint that none
of the training families perform
much
better or
worse
than target recall threshold 
on
of the testing families.</p>
          <p>That is,
|
 [ ,  ]
 
−

|
&lt;

for


∈
 , 

∈
 . This is done from</p>
          <p>Malconv
if | [ , ] −  | &gt;  for any  ∈  then</p>
          <p>Discard (, )
if | [,  ] −  | &gt;  for any  ∈  then</p>
          <p>Discard (, )
if (, ) not discarded then</p>
          <p>Add(, ), Add(, )
if  &gt;  then</p>
          <p>=  + 0.05, then go to 2
16: return , 
in the matrix which are  -close to  . The candidate pairs (, ) of training and testing families
corresponding to those elements are then randomly sampled.</p>
          <p>At each iteration, while the sampled pair satisfies the performance constraint by design, it
must also satisfy the constraint pairwise among all the families already selected in the training
and testing sets. If this condition holds and neither member of the pair is already selected, then
the pair is added to the growing training and testing sets. This procedure runs until 10 distinct
families have been chosen for both training and testing. If the algorithm is unable to converge
for  , then  is loosened (increased) and we try again. For eficiency, when this happens we
do not discard the progress we have already made, and in line 2 we use the previous  as a
lower bound and the new  as an upper bound for identifying new candidate pairs. For the
Easy, Medium, and Hard splits we use  = 0.9, 0.5, and 0.25 respectively. We set the number of
iterations  = 1000 and  = 0.05 throughout.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results</title>
      <p>In our tests, we use four models for evaluation. First is a byte 6-gram model that has been
popular in academic malware detection research for several years [14, 9, 33]. Second we use
MalConv [34] and its extended approach MalConvGCT [35]. Finally, we use the Ember feature
vectors [19] with the XGBoost algorithm [36] as the standard domain knowledge approach,
which we will refer to as just “XGBoost” for brevity. In our experiments, each train/test split
we produce has 160,000 and 40,000 total samples, respectively. However, because within a
given train/test split no family is used for both training and testing, we note that if memory
is a constraint, the experiments can be performed using just the training or testing sets alone.
We remind the reader that randomly sampling to a split of this size will produce a model with
≥ 90% accuracy in all cases.</p>
      <p>
        Having defined our approach to generating harder train/test splits, we begin with the primary
results as shown in Table 1. As can be seen, we are able to successfully produce datasets that are
more challenging than the original dataset. With a lower baseline level of accuracy, it becomes
possible to measure efect sizes with a moderate number of samples - and avoid the over 30
million files that are needed to reliably detect improvement of XGBoost like models on regular
malware data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Recall that the same splits are being used for all four algorithms. This shows an unusual,
but important, kind of generalization. Even though MalConv is less accurate in normal use
than MalConvGCT, and significantly less accurate than domain-knowledge-wielding XGBoost,
the Hard split is able to reduce XGBoost down to just 72.80% accuracy. This shows that our
benchmark search is 1) finding correlations of intrinsic dificulty, and 2) allowing us to avoid
overly biasing a test-set against a specific approach. That is to say, if we used an XGBoost model
to produce the splits to evaluate an improved XGBoost, we may unfairly over-compensate by
having produced a dataset split that is too dificult.</p>
      <p>To show the consistency of our results in producing train/test splits of comparable dificult,
we show the result for multiple splits as a function of how many epochs MalConvGCT has
trained for in Figure 3. Here it is clear that each of the Easy, Medium, and Hard dificulty
levels exhibit high degrees of similarity in their dificulty. This is important to avoid a naive
solution where the target dificulty is obtained by averaging models that are too hard against
others that are too easy. Such an undesirable scenario would make performing multiple trials to
use statistical tests dificult, as the overly easy and hard splits would degrade to adding noisy
samples2 to the test and reduce the total power of the test to conclude if one method was really
better than another [21].
2Because each model easily gets all the easy splits correct, and the hard splits all misclassified, making the diferences
between two models indistinguishable.
0.9
0.8</p>
      <sec id="sec-3-1">
        <title>4.1. Ablation</title>
        <p>Having established the eficacy of our approach to producing datasets of the desired dificulty
level, we will now demonstrate two alternative but intuitive strategies that do not meet our
needs. In the case of a desired “Easy” benchmark, one may naively select the top- “best”
families from Figure 1, which have the highest average recall against other malware families.
Second, one may similarly decide that a “Hard” dataset should be produced by selecting the
families with the lowest average recall.</p>
        <p>For the “Easy” case of selecting the top-, we first show as an example the results of this
strategy when picking the top-5 most frequent families in Figure 2. Though this produces a
recall of 70%, the variance of the results is extremely high. This huge variance is undesirable for
the same reason as our results from Figure 3. We want reasonably similar performance
characteristics for each split to maximize the power of subsequent conclusions about improvement.
Each overly easy and hard split is one that does not provide meaningful information to the
question of whether a new algorithm would perform better.</p>
        <p>One may wonder instead if the issue would improve by selecting more families. This is
unfortunately not the case, and there is relatively little variation as the top- is altered from
 = 5 to  = 35, as shown in Figure 4 (We note that all values of  look qualitatively similar
to Figure 2 as well).</p>
        <p>A diferent kind of issue occurs when selecting the worst-  malware families to produce a
“Hard” dataset.  = 10 is shown as an example in Figure 5, where the 10 chosen families each
have 100% recall, and the model does not meaningfully learn to detect any of the remaining
malware families. In this case, the hardest families are so distinct on their own that the model
easily learns to overfit to the specific malware families, and the default for any other input
becomes “benign”. This is similar to the overly-strong data leakage signal discussed by [9] when
building a benign dataset from scraping a clean install of Microsoft Word. We again note that
using multiple values of  all result in qualitatively the same results for the worst- strategy.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>We have now shown it is possible to use malware family information to construct better train/test
splits for benchmarking purposes, where the dificulty of the split is configurable. This was
demonstrated with an Easy, Medium, and Hard split — and in all cases a weaker model is able to
produce splits that are efective against a more powerful model. This is a necessary condition
of utility, as the purpose of the splits is to test a hopefully more powerful alternative model.
We further validate our approach by ablating against simpler design alternatives, which do not
0.8
e
lau0.6
v
l
l
a
c
eR0.4
0.2
0
0
20
40
60 80 100 120
Malware Families used for Testing
140
160
180
produce benchmarks of usable quality.
[6] E. Raf, C. Nicholas, A Survey of Machine Learning Methods and Challenges for Windows
Malware Classification, in: NeurIPS 2020 Workshop: ML Retrospectives, Surveys &amp;
Meta-Analyses (ML-RSA), 2020. URL: http://arxiv.org/abs/2006.09271, arXiv: 2006.09271.
[7] M. Eskandari, S. Hashemi, A graph mining approach for detecting unknown malwares,
Journal of Visual Languages &amp; Computing 23 (2012) 154–162. doi:10.1016/j.jvlc.201
2.02.002.
[8] R. Perdisci, A. Lanzi, W. Lee, McBoost: Boosting Scalability in Malware Collection and
Analysis Using Statistical Classification of Executables, in: 2008 Annual Computer Security
Applications Conference (ACSAC), IEEE, 2008, pp. 301–310. URL: http://ieeexplore.ieee.or
g/lpdocs/epic03/wrapper.htm?arnumber=4721567. doi:10.1109/ACSAC.2008.22.
[9] E. Raf, R. Zak, R. Cox, J. Sylvester, P. Yacci, R. Ward, A. Tracy, M. McLean, C. Nicholas,
An investigation of byte n-gram features for malware classification, Journal of Computer
Virology and Hacking Techniques (2016). URL: http://link.springer.com/10.1007/s11416-0
16-0283-1. doi:10.1007/s11416-016-0283-1.
[10] J. Seymour, How to build a malware classifier [that doesn’t suck on real-world data], in:
SecTor, Toronto, Ontario, 2016. URL: https://sector.ca/sessions/how-to-build-a-malware-c
lassifier-that-doesnt-suck-on-real-world-data/.
[11] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, L. Cavallaro, TESSERACT: Eliminating
Experimental Bias in Malware Classification across Space and Time, in: 28th USENIX
Security Symposium (USENIX Security 19), USENIX Association, Santa Clara, CA, 2019,
pp. 729–746. URL: https://www.usenix.org/conference/usenixsecurity19/presentation/pe
ndlebury.
[12] R. J. Joyce, D. Amlani, C. Nicholas, E. Raf, MOTIF: A Large Malware Reference Dataset
with Ground Truth Family Labels, in: The AAAI-22 Workshop on Artificial Intelligence
for Cyber Security (AICS), 2022. URL: https://github.com/boozallen/MOTIF. doi:10.485
50/arXiv.2111.15031, arXiv: 2111.15031v1.
[13] J. O. Kephart, G. B. Sorkin, W. C. Arnold, D. M. Chess, G. J. Tesauro, S. R. White, Biologically
Inspired Defenses Against Computer Viruses, in: Proceedings of the 14th International
Joint Conference on Artificial Intelligence - Volume 1, Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 1995, pp. 985–996. URL: http://dl.acm.org/citation.cfm?id=16258
55.1625983, series Title: IJCAI’95.
[14] J. Z. Kolter, M. A. Maloof, Learning to Detect and Classify Malicious Executables in the
Wild, Journal of Machine Learning Research 7 (2006) 2721–2744. URL: http://dl.acm.org/c
itation.cfm?id=1248547.1248646, publisher: JMLR.org.
[15] M. K. Shankarapani, S. Ramamoorthy, R. S. Movva, S. Mukkamala, Malware Detection
Using Assembly and API Call Sequences, J. Comput. Virol. 7 (2011) 107–119. URL: http:
//dx.doi.org/10.1007/s11416-010-0141-5. doi:10.1007/s11416-010-0141-5, publisher:
Springer-Verlag New York, Inc. Place: Secaucus, NJ, USA.
[16] R. Zak, E. Raf, C. Nicholas, What can N-grams learn for malware detection?, in: 2017 12th
International Conference on Malicious and Unwanted Software (MALWARE), IEEE, 2017,
pp. 109–118. URL: http://ieeexplore.ieee.org/document/8323963/. doi:10.1109/MALWARE.
2017.8323963.
[17] B. J. Kwon, J. Mondal, J. Jang, L. Bilge, T. Dumitras, , The Dropper Efect: Insights into
Malware Distribution with Downloader Graph Analytics, in: Proceedings of the 22Nd
19-2_11. doi:10.1007/978-3-319-45719-2_11.
[30] S. Sebastián, J. Caballero, AVClass2: Massive Malware Tag Extraction from AV Labels, in:</p>
      <p>ACSAC, 2020. URL: http://arxiv.org/abs/2006.10615, arXiv: 2006.10615.
[31] J. Seymour, C. Nicholas, Labeling the VirusShare Corpus: Lessons Learned, in: BSidesLV,</p>
      <p>Las Vegas, NV, 2016.
[32] E. Raf, R. Zak, G. L. Munoz, W. Fleming, H. S. Anderson, B. Filar, C. Nicholas, J. Holt,
Automatic Yara Rule Generation Using Biclustering, in: 13th ACM Workshop on Artificial
Intelligence and Security (AISec’20), 2020. URL: http://arxiv.org/abs/2009.03779. doi:10.1
145/3411508.3421372, arXiv: 2009.03779.
[33] E. Raf, W. Fleming, R. Zak, H. Anderson, B. Finlayson, C. K. Nicholas, M. Mclean, W.
Fleming, C. K. Nicholas, R. Zak, M. Mclean, KiloGrams: Very Large N-Grams for Malware
Classification, in: Proceedings of KDD 2019 Workshop on Learning and Mining for
Cybersecurity (LEMINCS’19), 2019. URL: https://arxiv.org/abs/1908.00200.
[34] E. Raf, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, C. Nicholas, Malware Detection
by Eating a Whole EXE, in: AAAI Workshop on Artificial Intelligence for Cyber Security,
2018. URL: http://arxiv.org/abs/1710.09435, arXiv: 1710.09435.
[35] E. Raf, W. Fleshman, R. Zak, H. S. Anderson, B. Filar, M. McLean, Classifying Sequences of
Extreme Length with Constant Memory Applied to Malware Detection, in: The Thirty-Fifth
AAAI Conference on Artificial Intelligence, 2021. URL: http://arxiv.org/abs/2012.09390,
arXiv: 2012.09390 ISSN: 23318422.
[36] T. Chen, C. Guestrin, XGBoost: Reliable Large-scale Tree Boosting System, in: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2016. ArXiv: 1603.02754.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Gantz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Florean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sikdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Madhaven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K. S.</given-names>
            <surname>Lakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nagappan</surname>
          </string-name>
          ,
          <source>The Link between Pirated Software and Cybersecurity Breaches How Malware in Pirated Software Is Costing the World Billions, Technical Report</source>
          , IDC,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Harang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Rudd</surname>
          </string-name>
          , SOREL-20M:
          <article-title>A Large Scale Benchmark Dataset for Malicious PE Detection, arXiv (</article-title>
          <year>2020</year>
          ). URL: http://arxiv.org/abs/
          <year>2012</year>
          .07634, arXiv:
          <year>2012</year>
          .07634.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Raf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nicholas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Holt</surname>
          </string-name>
          ,
          <article-title>Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints</article-title>
          ,
          <source>in: IJCAI-21 1st International Workshop on Adaptive Cyber Defense</source>
          ,
          <year>2021</year>
          . URL: http://arxiv.org/abs/2108.04081, arXiv:
          <fpage>2108</fpage>
          .
          <fpage>04081</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Soska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Roundy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Christin</surname>
          </string-name>
          ,
          <source>Automatic Application Identification from Billions of Files, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , Association for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , pp.
          <fpage>2021</fpage>
          -
          <lpage>2030</lpage>
          . URL: https://doi.org/10.1145/3097983.3098196. doi:
          <volume>10</volume>
          .1145/3097983.3098196, series Title: KDD '
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Dahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Stokes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Large-scale malware classification using random projections and neural networks</article-title>
          ,
          <source>in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>3422</fpage>
          -
          <lpage>3426</lpage>
          . URL: http://ieeexplore.ieee.org/ xpls/abs_all.jsp?arnumber=6638293. doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2013</year>
          .
          <volume>6638293</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>