<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Knowledge Graph Construction From Raw Log Data?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>WU (Vienna University of Economics</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Business)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Welthandelsplatz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vienna</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Austria first.last@ai.wu.ac.at</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Wien (Vienna University of Technology)</institution>
          ,
          <addr-line>Favoritenstra e 9-11/194, 1040 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Logs are a crucial source of information to diagnose the health and status of systems, but their manual investigation typically does not scale well and often leads to a lack of awareness and incomplete transparency about issues. To tackle this challenge, we introduce SLOGERT, a exible framework and work ow for automated construction of knowledge graphs from arbitrary raw log messages. To this end, we combine a variety of techniques to facilitate a knowledge-based approach to log analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Log les are a vital source of run time information about a system's state and
activities in many areas of information systems development and operations,
e.g., in the context of security monitoring, compliance auditing, forensics, and
error diagnosis.</p>
      <p>
        Around those varied applications, a market for log management solutions has
developed that assist in the process of storing, indexing, and searching log data {
the latter typically through some combination of manual inspection and regular
expressions to locate speci c messages or patterns [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Commercially available
log management solutions (e.g., Splunk3 or Logstash4) facilitate aggregation,
normalization, and storage, but provide limited integration, contextualization,
linking, enrichment, and querying capabilities. Consequently, although they ease
manual analytical processes somewhat, investigations across multiple
heterogeneous log sources with unknown content and message structures remains a
challenging and time-consuming task. Analysts therefore typically have to cope with
? This work was sponsored by the Austrian Science Fund (FWF) and netidee
SCIENCE under grant P30437-N31, and the Austrian Research Promotion Agency FFG
under grant 877389 (OBARIS). The authors thank the funders for their generous
support.
3 https://splunk.com
4 https://logstash.net
      </p>
      <p>
        Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
many di erent types of events, expressed with di erent terminologies, and
represented in a multitude of formats [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], particularly in large-scale systems composed
of heterogeneous components.
      </p>
      <p>In this paper, we propose SLOGERT (Semantic LOG ExtRaction
Templating ), a work ow for automated knowledge graph construction from unstructured,
heterogeneous, and potentially fragmented log sources. SLOGERT combines
extraction techniques that leverage particular characteristics of log data into a
modular and extensible processing framework. In particular, we propose a
workow that combines log parsing and event template learning, natural language
annotation, keyword extraction, automatic generation of RDF graph modelling
patterns, and linking and enrichment to extract and integrate the evidence-based
knowledge contained in logs. By making log data amenable to semantic analysis,
the work ow lls an important gap and opens up a wealth of data sources for
knowledge graph building.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Building knowledge graphs from log les</title>
      <p>
        In this section, we introduce the SLOGERT5 architecture, components, and
implementation. The resulting work ow, illustrated in Figure 1, expects
unstructured log les as input and consists of ve phases:
1. Template and Parameter Extraction Log les typically consist of structured
elements (e.g., time stamp, device id, facility, message severity), and an
unstructured free-text message. To extract log templates from such raw
unstructured logs, we use LogPAI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to identify constant strings and variable
values in the free-text message content. This results in two les, i.e., (i) a list
of log templates discovered in the log le, marking the position of variable
parts (parameters), and (ii) the content of the logs, with each log line linked
to one of the log template ids, and the extracted instance parameters as
an ordered list. At the end of this stage, we have templates and extracted
(variable) parameters, but their semantic meaning is yet unde ned.
2. Semantic Annotation receives the log templates and the instances with the
extracted parameters as input. This phase consists of two sub-phases:
(a) Semantic template annotation initiates the parameter type detection by
rst selecting a set of log lines for each template and then applying
rule-based Named Entity Recognition (NER) techniques. Speci cally,
we use the TokensRegex Framework from Stanford CoreNLP6 to de ne
sequences of tokens, and map them to semantic objects. As log messages
often do not follow the grammatical rules of natural language expressions
(e.g., URLs, identi ers), we additionally apply standard Regex patterns
on the complete message. For each detection pattern, we de ne a type
and a property from a log vocabulary to use for the detected entities.
      </p>
      <p>To generate a consistent representation over heterogeneous log les, we
5 Project website https://w3id.org/sepses/index.php/slogert/
6 https://nlp.stanford.edu/pubs/tokensregex-tr-2014.pdf
Log Line from auth.log</p>
      <p>Client02 Mar 9 12:10:50 Client02 sshd[2124]: Accepted
password for jhalley from 185.81.215.145 port 52410 ssh2
A1 – Template &amp; Param Extraction</p>
      <p>Timestamp Mar,9,12:10:50
Source Client02
Process sshd[2124]
Event Template Accepted password for &lt;*&gt;from &lt;*&gt;port &lt;*&gt;ssh&lt;*&gt;
Parameters ['jhalley', '185.81.215.145', '52410', '2']</p>
      <p>A2 – Semantic Annotation LogTemplate (OTTR)
lxid:LogEventTemplate_548db...[ottr:IRI ?id, ?hostUrl0, ?hostString1,
?timestamp2, ?message3, ?template4, ?logSource5, ?pname6, ?pid7,
?obj9, ?objString10, ?ip11, ?ipString12, ?address13, ?portUri14,
?portInt15, ?unknown16] :: {
lxid:OttrTemplate_unix(?id,?hostUrl0,?hostString1,?timestamp2,
?message3,?template4,?logSource5,?pname6,?pid7),
lxid:OttrTemplate_userPassword(?id,?obj9,?objString10),
lxid:OttrTemplate_ip(?id,?ip11,?ipString12),
lxid:OttrTemplate_port(?id,?address13,?portUri14,?portInt15),
lxid:OttrTemplate_unknown(?id,?unknown16) } .</p>
      <p>Log Instance Annotation (stOTTR)
lxid:LogEventTemplate_548db...(…,lid:Address_Client02,"Client02","202003-10T00:10:50", "Accepted password...","sshd","2124",lid:User_jhalley,
"jhalley", lid:IPv4_185.81.215.145,"185.81.215.145",lid:Address_52410,
svid:Port_52410,52410,"2").</p>
      <p>A5 – KG Integration</p>
      <p>A3 - RDFization
type</p>
      <p>log:Event
hasAddress</p>
      <p>logid:Address_52410
Logid:Event_33d97...</p>
      <p>time
pname
hasUser
msg
logid:User_jhalley usernatmypee
Accepted password...
2020-03-10T00:10:50
sshd
log:User
ownedBy
jhalley</p>
      <p>
        A4 - Background KG
logid:Person_JaneHal ey
type
name
foaf:Person
Jane Halley
extended an existing log vocabulary7 and mapped it to the Common
Event Expression (CEE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] taxonomy as shown in Figure 2. Once we
have identi ed each parameter of a template, we generate Reasonable
Ontology Templates (OTTRs) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
(b) Semantic instance annotation receives a set of annotated templates from
the semantic template annotation process. Based on these annotated
templates, we transform all log line instances into stOTTR, a custom
le format similar to Turtle, which references the OTTR templates. In
addition, we apply the CoreNLP Annotation features to extract
keywords from each log message to provide additional context.
3. RDFization generates a knowledge graph for each log le based on the OTTR
templates and stOTTR instance les generated in the extraction component.
To this end, we integrate Lutra8, the reference implementation for the OTTR
language to expand all instances into regular RDF graphs.
4. Background Knowledge Graph (KG) linking contextualizes entities that
appear in a log le with background knowledge. We distinguish local
background knowledge (e.g., employees, servers, installed software, and
documents) and external background knowledge (e.g., publicly available
cybersecurity information from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
7 https://w3id.org/sepses/vocab/log/core
8 https://ottr.xyz/#Lutra
5. Knowledge Graph Integration combines the KGs generated from the
previously isolated log les and sources into a single, linked representation. In
our prototype, the generated KGs and the background knowledge share the
same vocabulary and hence, can be easily merged together.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Example</title>
      <p>To illustrate the proposed approach, we simulated user behavior in a virtual
lab network in the Azure platform according to scripted scenarios and collected
various log les from each host (e.g., auth, sys, vsftpd). We then generated an
integrated log graph by processing the log les with our prototype.</p>
      <p>Listing 1 illustrates how to query the integrated graph, which contains events
from di erent sources and hosts, for log events which have a connected IP and
port number, and enrich the result by showing the services that are typically
running on those ports. This background information comes from a referenced
ontology on services and ports9. Figure 3 provides an excerpt of the query results.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>This paper introduced SLOGERT, a exible framework for automated knowledge
graph construction from unstructured log data. In our own research, we will rst
apply the proposed approach in the context of semantic security analytics, but
we see more general potential for the approach to drive adoption of Semantic
Web technologies in domains where log data needs to be analyzed.</p>
      <p>The results of the current prototype are promising { we next plan to study
the quality of the automatically generated log graphs. This includes evaluating
9 https://w3id.org/sepses/id/slogert/port-services
PREFIX...# Prefixes omitted for the sake of brevity
SELECT ?time ?event ?sourceLabel ?ipLabel ?portNumber ?serviceName
WHERE {
?event log:time ?time; log:hasSource ?source ; log:hasPort ?port ; log:hasIP ?ip .
?source log:hasSourceType ?sourceType .
?sourceType rdfs:label ?sourceLabel .
?ip log:ipv4 ?ipLabel .
?port log:port ?portNumber ; log:port ?portNumber ; log:linkedPortService ?linkedPort .
?portProtocolCombination svid:hasPort ?linkedPort ; svid:hasService ?service .
?service rdfs:label ?serviceName . } ORDER BY ASC(?time)
Listing 1: SPARQL query to get log events with ports and their standard services
the correct detection of parameters in log lines, as well as the correct entity
detection and linking, also on unknown log sources. The completeness and quality
of the extracted keywords could also be evaluated in user studies and inform
extensions of the applied method. Furthermore, we will focus on graph
management for template evolution and incremental updating of log knowledge graphs
in future work. Finally, we plan to compare analyst work ows in commercial
log management tools with our solution to highlight the advantages of semantic
graphs in log analysis, and to identify potential for improvement and synergies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kiesling</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ekelhart</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurniawan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ekaputra</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>The SEPSES Knowledge Graph: An Integrated Resource for Cybersecurity</article-title>
          .
          <source>In: The Semantic Web { ISWC</source>
          <year>2019</year>
          , pp.
          <volume>198</volume>
          {
          <fpage>214</fpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. MITRE:
          <article-title>About Common Event Expression</article-title>
          , https://cee.mitre.org
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Oliner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganapathi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Advances and challenges in log analysis</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>55</volume>
          (
          <issue>2</issue>
          ),
          <volume>55</volume>
          {
          <fpage>61</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Skj veland, M.G.,
          <string-name>
            <surname>Lupp</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karlsen</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forssell</surname>
          </string-name>
          , H.:
          <article-title>Practical ontology pattern instantiation, discovery, and maintenance with reasonable ontology templates</article-title>
          .
          <source>In: The Semantic Web { ISWC 2018</source>
          . pp.
          <volume>477</volume>
          {
          <fpage>494</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lyu</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Tools and benchmarks for automated log parsing</article-title>
          .
          <source>In: 41st Int. Conf. on Software Engineering: Software Engineering in Practice</source>
          . p.
          <volume>121</volume>
          {
          <fpage>130</fpage>
          . IEEE Press (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>