<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Malware Text Collection and Mining Pro ject</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giambattista Amati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Angelini</string-name>
          <email>sangelini@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Caterina Carli</string-name>
          <email>annacaterina.carli@mise.gov.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Majorani</string-name>
          <email>cmajorani@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Riccardi</string-name>
          <email>ariccardi@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Ugo Bordoni, Rome, Italy Istituto Superiore delle Comunicazioni e delle Tecnologie dell'Informazione (ISCOM)</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We have released a malware collection in TREC style. It contains scripts, html documents and text files extracted from binary files of about 650K malwares. The objective of the project is to index, extract significant features and classify them into malware families. At this aim we will also release a TREC style set of queries for classification tasks. In this abstract, we briefly describe the test collection, the project aims and the problems underlying the use of text mining and information retrieval techniques to malware classification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Malware analysis is a growing research area but with still many open
problems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example T signatures for anti-virus toolkits are
created manually using some malware-analysis techniques and tools, that
can analyze programs either by executing them (dynamic analysis), or by
inspecting them (static analysis). Static analysis can extract information
from the binary representation of the program. Data mining techniques
for detecting malware were first introduced by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on three different static
features: Portable Executable (PE), strings and byte sequences.
Interpretable text is a high-level specification of malicious behavior, for
example: &lt;html&gt;&lt;scriptlanguage = ’javascript’&gt; window.open(’readme.eml’)
always occur in worms of type Nimda [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Text Mining classification can
be useful, and be however prohibitive because of the tokenization process
than may either produce a very high dimensionality of features or lose
relevant information by the use of a standard text IR tokenization.
Nevertheless, Big Data technologies and massive clustering techniques are
now possible so that the release of a TREC style collection, that is still
missing, will help the IR and the cyber security community to deeply
explore at what extent Information Retrieval and Text Mining
classification can be effective and useful to malware detection. Our text collection
contains about 650K documents with the text extracted from malware
and will be extended with a similar size of malware-free collection.
      </p>
      <sec id="sec-1-1">
        <title>MW-TaggedText</title>
      </sec>
      <sec id="sec-1-2">
        <title>Nr Docs</title>
        <p>655,361
#Tokens</p>
      </sec>
      <sec id="sec-1-3">
        <title>Nr Occurrences</title>
        <p>153,587,253
4,222,109,462</p>
      </sec>
      <sec id="sec-1-4">
        <title>Index Di</title>
        <p>mensions
21GB</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>The malware collection</title>
      <p>The malware collection was obtained from the VirusShare.com project.
VirusShare was born in 2011 with the aim of collecting, indexing and
freely sharing malware samples for analysts, researchers and the
computer security community. At the moment the site provides about 30
million malwares. We have downloaded a portion containing 655,361 of
the most recent malware files (i.e. collected by VirusShare in the last
6 months). Initially the collection was about 286 compressed GB (11
zip archives). We extracted the text part and formed a collection of
about 66GB of uncompressed data, or equivalently of about 30GB of
compressed data, and obtaining 21 GB of indexes. The text part of the
whole collection should therefore contain approximately 14 TeraBytes of
compressed data for 9TB of Terrier’s indexes.</p>
      <p>The malware collection has been subjected to the following operations:
Text extraction The text part was first extracted through the unix
script strings. From 286GB of compressed data, 30 GB of compressed
data were obtained.</p>
      <p>Tagging The collection was then labeled by introducing the following
new tags: DOCNO, DOC_TYPE, SCRIPT, TYPE_SCRIPT, CDATA,
DOMAIN, SOURCE, RUN_MODE, RUN_MODE_NOT. The labeling
module was obtained through a set of syntactic rules of the regex type.
We get all domains and URLs, to index them separately, trasforming
strings such as http://xxx.yy.201.53/guodanpi/dhnchia.exe into:
&lt; DOMAIN &gt; xxx.yy.201.53&lt;/DOMAIN &gt; and
&lt;SOURCE&gt; xxx.yy.201.53/guodanpi/dhnchia.exe&lt;/SOURCE&gt;
The new tags contain the following information:
– DOC: Initial malware tag, and DOCNO, the malware file identifier
that contains the MD5 hash value of the file; DOC_TYPE, a tag
for a html document or not.
– SCRIPT, the tag that encloses a script, and CDATA, the tag that
contains data in a document of markup type.
– SOURCE: a complete URL address, and DOMAIN: an internet
domain
https://VirusShare.com
These are Terrier indexes, with both inverted and direct files http://terrier.org.
– TYPE_SCRIPT, a tag with the type of script VBscript, javascript
etc., and RUN_MODE, RUN_MODE_NOT, tag for Win32 or DOS
etc.</p>
      <p>Pre-classification Documents are pre-classified according to their
type and script (javascript for example). In Figure 1 there are two
examples of processed malware.</p>
      <p>Tokenization Finally, the collection was indexed by considering the
text within tags, e.g. Html, script etc., and any sequence of
characters as meaningful tokens. The separator characters between indexed
tokens were any blank type character according to the UTF8
encoding. Therefore, the typical punctuation characters (comma, points, etc.)
were not considered separators. This extremely loose and permissive
indexing makes possible the simultaneous presence in the dictionary as
indexed terms (that is, in the lexicon) both words belonging to the
natural language of text documents (such as those with the DOC_TYPE tag
equal to html) and commands or tokens belonging to scripts (within the
SCRIPT tags). 4.2 billion tokens were obtained with 154 million unique
terms across the whole collection. The average frequency of terms is 3.58
occurrences per malware, much higher than the average word frequency
in natural language texts.</p>
      <p>Indexing Thanks to the tagging operations, one can activate or disable
any possible tag during the indexing to obtain either dedicated indexes
to only scripts (SCRIPT), to only textual parts, to only URL addresses
(SOURCE), to only the domain (DOMAIN) or to a combination of these
tags. Thanks to the DOC_TYPE and TYPE_SCRIPT fields you can
also obtain statistics on the distribution of malware in the different types
of documents.</p>
    </sec>
    <sec id="sec-3">
      <title>Classification tasks</title>
      <p>Malware classification is a hard task because very few labeled training
sets exist for detection. Clustering can be an alternative because it can
automatically aggregate malware into different groups. However, the very
first step forward for a classification task would be to separate malware
files from non-malware ones in a very large document collection. We have
released a TREC like collection of malware text that was still missing.
This collection allows for researchers from different areas to cooperate
and apply IR, Data Mining and Big Data technologies to the problem of
malware classification. At this aim a set of classification queries will be
soon released.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Egele</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scholte</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirda</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kruegel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>A survey on automated dynamic malware-analysis techniques and tools</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>44</volume>
          ,
          <issue>2</issue>
          (Mar.
          <year>2008</year>
          ),
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eskin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadok</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Stolfo</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <article-title>Data mining methods for detection of new malicious executables</article-title>
          .
          <source>In Proceedings of the 2001 IEEE Symposium on Security and Privacy</source>
          (Washington, DC, USA,
          <year>2001</year>
          ), SP '01, IEEE Computer Society, pp.
          <fpage>38</fpage>
          -
          <lpage>.</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adjeroh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Iyengar</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          <article-title>A survey on malware detection using data mining techniques</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>50</volume>
          ,
          <issue>3</issue>
          (
          <year>June 2017</year>
          ),
          <volume>41</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          :
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>