=Paper= {{Paper |id=Vol-2140/paper16 |storemode=property |title=The Malware Text Collection and Mining Project |pdfUrl=https://ceur-ws.org/Vol-2140/paper16.pdf |volume=Vol-2140 |authors=Giambattista Amati,Simone Angelini,Anna Caterina Carli,Carlo Majorani,Alessandro Riccardi |dblpUrl=https://dblp.org/rec/conf/iir/AmatiACMR18 }} ==The Malware Text Collection and Mining Project== https://ceur-ws.org/Vol-2140/paper16.pdf
The Malware Text Collection and Mining Project

   Giambattista Amati∗ , Simone Angelini∗ , Anna Caterina Carli• , Carlo
                 Majorani∗ , and Alessandro Riccardi∗
                      ∗
                        Fondazione Ugo Bordoni, Rome, Italy
                 {gba,sangelini,cmajorani,ariccardi}@fub.it
   •
     Istituto Superiore delle Comunicazioni e delle Tecnologie dell’Informazione
                               (ISCOM), Rome, Italy
                         annacaterina.carli@mise.gov.it



     Abstract. We have released a malware collection in TREC style. It
     contains scripts, html documents and text files extracted from binary
     files of about 650K malwares. The objective of the project is to index,
     extract significant features and classify them into malware families. At
     this aim we will also release a TREC style set of queries for classification
     tasks. In this abstract, we briefly describe the test collection, the project
     aims and the problems underlying the use of text mining and information
     retrieval techniques to malware classification.


     1    Introduction

     Malware analysis is a growing research area but with still many open
     problems [1]. For example T signatures for anti-virus toolkits are cre-
     ated manually using some malware-analysis techniques and tools, that
     can analyze programs either by executing them (dynamic analysis), or by
     inspecting them (static analysis). Static analysis can extract information
     from the binary representation of the program. Data mining techniques
     for detecting malware were first introduced by [2] on three different static
     features: Portable Executable (PE), strings and byte sequences. Inter-
     pretable text is a high-level specification of malicious behavior, for exam-
     ple:  window.open(’readme.eml’)
     always occur in worms of type Nimda [3]. Text Mining classification can
     be useful, and be however prohibitive because of the tokenization process
     than may either produce a very high dimensionality of features or lose
     relevant information by the use of a standard text IR tokenization. Nev-
     ertheless, Big Data technologies and massive clustering techniques are
     now possible so that the release of a TREC style collection, that is still
     missing, will help the IR and the cyber security community to deeply
     explore at what extent Information Retrieval and Text Mining classifica-
     tion can be effective and useful to malware detection. Our text collection
     contains about 650K documents with the text extracted from malware
     and will be extended with a similar size of malware-free collection.
 IIR 2018, May 28-30, 2018, Rome, Italy. Copyright held by the author(s).
 https://trec.nist.gov/data.html
  Collection            Nr Docs        #Tokens       Nr Occurrences     Index Di-
                                                                        mensions
  MW-TaggedText         655,361        153,587,253   4,222,109,462      21GB


Table 1. Collections that were collected and processed. The VS-TaggedText collection
contains the text of subset of the available collection at VirusShare.com and occupies
30GB of malware data.



       2    The malware collection
       The malware collection was obtained from the VirusShare.com project.
       VirusShare was born in 2011 with the aim of collecting, indexing and
       freely sharing malware samples for analysts, researchers and the com-
       puter security community. At the moment the site provides about 30
       million malwares. We have downloaded a portion containing 655,361 of
       the most recent malware files (i.e. collected by VirusShare in the last
       6 months). Initially the collection was about 286 compressed GB (11
       zip archives). We extracted the text part and formed a collection of
       about 66GB of uncompressed data, or equivalently of about 30GB of
       compressed data, and obtaining 21 GB of indexes. The text part of the
       whole collection should therefore contain approximately 14 TeraBytes of
       compressed data for 9TB of Terrier’s indexes.
       The malware collection has been subjected to the following operations:


       Text extraction The text part was first extracted through the unix
       script strings. From 286GB of compressed data, 30 GB of compressed
       data were obtained.


       Tagging The collection was then labeled by introducing the following
       new tags: DOCNO, DOC_TYPE, SCRIPT, TYPE_SCRIPT, CDATA,
       DOMAIN, SOURCE, RUN_MODE, RUN_MODE_NOT. The labeling
       module was obtained through a set of syntactic rules of the regex type.
       We get all domains and URLs, to index them separately, trasforming
       strings such as http://xxx.yy.201.53/guodanpi/dhnchia.exe into:
       < DOMAIN > xxx.yy.201.53 and
        xxx.yy.201.53/guodanpi/dhnchia.exe
       The new tags contain the following information:
         – DOC: Initial malware tag, and DOCNO, the malware file identifier
            that contains the MD5 hash value of the file; DOC_TYPE, a tag
            for a html document or not.
         – SCRIPT, the tag that encloses a script, and CDATA, the tag that
            contains data in a document of markup type.
         – SOURCE: a complete URL address, and DOMAIN: an internet do-
            main
  https://VirusShare.com
  These are Terrier indexes, with both inverted and direct files http://terrier.org.
Fig. 1. Examples of a tagged malware script and of a labeled html document.




    – TYPE_SCRIPT, a tag with the type of script VBscript, javascript
      etc., and RUN_MODE, RUN_MODE_NOT, tag for Win32 or DOS
      etc.


   Pre-classification Documents are pre-classified according to their
   type and script (javascript for example). In Figure 1 there are two ex-
   amples of processed malware.


   Tokenization Finally, the collection was indexed by considering the
   text within tags, e.g. Html, script etc., and any sequence of charac-
   ters as meaningful tokens. The separator characters between indexed
   tokens were any blank type character according to the UTF8 encod-
   ing. Therefore, the typical punctuation characters (comma, points, etc.)
   were not considered separators. This extremely loose and permissive in-
   dexing makes possible the simultaneous presence in the dictionary as
   indexed terms (that is, in the lexicon) both words belonging to the natu-
   ral language of text documents (such as those with the DOC_TYPE tag
   equal to html) and commands or tokens belonging to scripts (within the
   SCRIPT tags). 4.2 billion tokens were obtained with 154 million unique
   terms across the whole collection. The average frequency of terms is 3.58
   occurrences per malware, much higher than the average word frequency
   in natural language texts.


    Indexing Thanks to the tagging operations, one can activate or disable
   any possible tag during the indexing to obtain either dedicated indexes
   to only scripts (SCRIPT), to only textual parts, to only URL addresses
   (SOURCE), to only the domain (DOMAIN) or to a combination of these
   tags. Thanks to the DOC_TYPE and TYPE_SCRIPT fields you can
   also obtain statistics on the distribution of malware in the different types
   of documents.
Fig. 2. Parts of a lexicon extracted and labeled in the textual part of a malware.
Nt denotes in how many files the token occurs, TF is how many times it occurs.
Some tokens appear to be related to the Windows PE executable file with encoding
in Base64.. For example, the sequence 6!6,626@6t6‘6e6p6v6 is present 8 times in 8
malware. term298683 instead indicates the entire coding of the term in the system.
The phenomenon of obfuscation is evident. The strings in the figure are all generated
by the regex "(d.)+d" where d is the digit 6.




      3    Classification tasks
      Malware classification is a hard task because very few labeled training
      sets exist for detection. Clustering can be an alternative because it can
      automatically aggregate malware into different groups. However, the very
      first step forward for a classification task would be to separate malware
      files from non-malware ones in a very large document collection. We have
      released a TREC like collection of malware text that was still missing.
      This collection allows for researchers from different areas to cooperate
      and apply IR, Data Mining and Big Data technologies to the problem of
      malware classification. At this aim a set of classification queries will be
      soon released.


      References
      1. Egele, M., Scholte, T., Kirda, E., and Kruegel, C. A survey
         on automated dynamic malware-analysis techniques and tools. ACM
         Comput. Surv. 44, 2 (Mar. 2008), 6:1–6:42.
      2. Schultz, M. G., Eskin, E., Zadok, E., and Stolfo, S. J.
         Data mining methods for detection of new malicious executables.
         In Proceedings of the 2001 IEEE Symposium on Security and Pri-
         vacy (Washington, DC, USA, 2001), SP ’01, IEEE Computer Society,
         pp. 38–.
      3. Ye, Y., Li, T., Adjeroh, D., and Iyengar, S. S. A survey on
         malware detection using data mining techniques. ACM Comput. Surv.
         50, 3 (June 2017), 41:1–41:40.