=Paper= {{Paper |id=Vol-1097/STIDS2013_T06 |storemode=property |title=Managing Semantic Big Data for Intelligence |pdfUrl=https://ceur-ws.org/Vol-1097/STIDS2013_T06_Boury-Brisset.pdf |volume=Vol-1097 |dblpUrl=https://dblp.org/rec/conf/stids/Boury-Brisset13 }} ==Managing Semantic Big Data for Intelligence== https://ceur-ws.org/Vol-1097/STIDS2013_T06_Boury-Brisset.pdf
          Managing Semantic Big Data for Intelligence
                                                   Anne-Claire Boury-Brisset
                                            Defence Research and Development Canada
                                                         Québec, Canada
                                            anne-claire.boury-brisset@drdc-rddc.gc.ca


    Abstract— All-source intelligence production involves the          Our ultimate goals are to:
collection and analysis of intelligence data provided in various
formats (raw data from sensors, imagery, text-based from human         x    Provide timely and relevant information to the analyst
reports, etc.) and distributed across heterogeneous data stores.            through intuitive search and discovery mechanisms;
The advance in sensing technologies, the acquisition of new
                                                                       x    Provide a framework facilitating the integration of
sensors, and use of mobile devices result in the production of an
overwhelming amount of sensed data, that augment the
                                                                            heterogeneous unstructured and structured data,
challenges to transform these raw data into useful, actionable              enabling Hard/Soft fusion and preparing for various
intelligence in a timely manner. Leveraging recent advances in              analytics exploitation.
data integration, Semantic Web and Big Data technologies, we            This paper describes ongoing research for the design and
are adapting key concepts of unified dataspaces and semantic        implementation of a prototype for scalable Multi-Intelligence
enrichment for the design and implementation of a R&D               Data Integration Services (MIDIS) in support of these
intelligence data integration platform MIDIS (Multi-Intelligence
                                                                    objectives, based on a flexible data integration approach,
Data Integration Services). The development of this scalable data
integration platform rests on the layered dataspace approach,
                                                                    making use of Semantic Web and Big Data technologies. The
makes use of recent Big Data technologies and leverages             paper is organized as follows. In the next section, we present
ontological models, and semantic-based analysis services            recent work addressing multi-intelligence data integration,
developed for various purposes as part of the semantic layer.       followed by a short introduction to Big Data challenges.
                                                                    Section IV describes the proposed architecture for large-scale
   Keywords—intelligence, data integration, knowledge extraction,   intelligence data integration and analysis and details the main
ontology, Big Data                                                  components of the resulting architecture. Section V provides
                                                                    details about the implementation using Big Data technologies.
                      I.    INTRODUCTION                            Section VI provides some conclusions and directions for future
                                                                    work.
    The advance in sensing technologies, the acquisition of
new sensors, and use of mobile devices result in the production
                                                                            II.   MULTI-INTELLIGENCE DATA INTEGRATION
of an overwhelming amount of sensed data, that augment the
challenges to transform these raw data into useful, actionable          Intelligence is about data management and processing: 1)
intelligence in a timely manner. Consequently, intelligence         data collection from various sources, 2) data analysis for the
operators and analysts have to deal with ever-increasing            production of intelligence and 3) dissemination of intelligence
amounts of ISR data and information from various sources            products. Intelligence data management nowadays presents the
(SIGINT, IMINT, GEOINT, HUMINT, OSINT, etc.),                       following characteristics:
produced in disparate multiple media formats (raw data sets
                                                                       x    Increase of sensor data volume (terabytes to exabytes);
from sensors, e.g., video, images, sound files, as well as human
reports and open source text), and distributed across different        x    Heterogeneity: multiple data formats and standards,
systems and data stores.                                                    mix of structured and unstructured;
    As part of a research project conducted within the                 x    Need to quickly acquire and process intelligence
Intelligence and Information Section at Defence Research and                information;
development Canada (DRDC) – Valcartier, we are
investigating advanced concepts, techniques and technologies           x    Agility is required to be able to incorporate new data
in order to provide enhanced capabilities for the management                sources;
and integration of large-scale heterogeneous information               x    Support to data exploitation: each piece of data
sources and intelligence products made available to                         represents some part of a situation, intelligence data
intelligence operators and officers in support of the production            contain entities that must be understood and correlated.
of intelligence and sense-making activities.
                                                                        Data integration aims at combining data that reside at
                                                                    distributed, autonomous, and heterogeneous data sources into a




                                                 STIDS 2013 Proceedings Page 41
single consistent view of the data [7]. Traditional approaches      Intelligence Data Integration Services) to meet our
propose either centralized or federated data integration. The       requirements in support of intelligence. In previous research,
centralized approach requires heavy pre-processing through          our team has developed several intelligence support tools in
extract, transform load (ETL) processes while the latter can        support of collation and intelligence production, and
denote performance and complex transformations issues. These        knowledge-based systems on top of military domain ontologies
approaches have been largely detailed and challenged in the         to meet various analysis requirements [15]. Some of the
literature, and they have been recently exposed by Singleton        relevant components from these tools have been incorporated
[19] as part of a research work in the military domain.             as services as part of a SOA-based Intelligence Science and
                                                                    Technology Integration platform (ISTIP) in development.
    As an alternative to these approaches to cope with large-
scale heterogeneous data management, Franklin, Halevy and               The data access component had to be further developed in
colleagues [11] proposed the concept of dataspaces as a new         this platform to provide the ability to dynamically ingest,
abstraction for information management. That is, it promotes a      integrate and manage data from various intelligence sources.
flexible co-existence approach for the incorporation of             Consequently, MIDIS aims at enriching the data access
heterogeneous data into a dataspace, and a description of the       component of the ISTIP platform to provide the set of services
concepts of the domain at a higher-level of abstraction.            needed to ingest multiple intelligence data formats available,
Integration in terms of schema harmonization is realized in a       transform them into a unified model, and make these data
pay-as-you-go approach [12].                                        accessible, searchable and exploitable (e.g. data mining) in
                                                                    support of intelligence analysis.
    Looking for a flexible data integration solution to deal with
the ever increasing heterogeneous data sources in the                   The design and development of MIDIS as a scalable data
intelligence domain and information fusion, S. Yoakum-Stover        integration platform rests on the layered dataspace approach
proposed a framework to implement this scheme [20, 21].             and makes use of Big Data technologies. Moreover, we
Based on that approach, D. Salmen and colleagues [16]               leverage ontological models, and semantic-based analysis
described their implementation of the approach. It rests on the     services developed for various purposes as part of the semantic
definition of a data integration framework (DRIF), also called      layer within the architecture described in section IV.
Data Description Framework (DDF) in previous papers, based
on a unified data integration model. The idea is to define a                          III.     BIG DATA CHALLENGES
simple data representation scheme to encapsulate every piece
of data from heterogeneous sources into a unified                        Considering the huge amount of data produced every day in
representation. The elementary constructs are composed of           both the commercial and the defense areas, the Big Data
signs, terms, concepts, predicates and statements, the latter       paradigm promotes novel approaches and technologies for data
being conceptually similar to the Semantic Web Resource             capture, storage and analytics to deal with “massive volume of
Description Framework (RDF) triple composed of subject,             unstructured and structured that cannot be managed and
predicate, and object.                                              processed with traditional databases and software approaches”
                                                                    [3].
    Based on this unified scheme, the dataspace is organized
into several layers, namely:                                            Big Data are initially characterized according to 3 Vs,
                                                                    namely: 1) Volume or scalability: ability to manage increasing
   x    Segment 0 contains the external data sources and            volumes of data, for storage and analysis; 2) Variety:
        systems from which relevant data are extracted;             heterogeneity of data types, data formats, semantic
                                                                    interpretation; 3) Velocity: timeliness or rate at which the data
   x    Segment 1 (unstructured data) represents the data store
                                                                    arrives and time in which it must be acted upon. Additional Vs
        for artefacts;
                                                                    are sometimes added, to denote the Veracity of data, as well as
   x    Segment 2 (structured data) is the universal store for      the Value that can be extracted from Big Data.
        data structured according to the unified representation         The problem of information overload is not new, but it is
        scheme;                                                     amplified in the new information era. Big Data challenges
   x    Segment 3 (data models) contains the representation of      encompass most data management processes, i.e. data capture,
        data models and ontologies to facilitate the mapping        curation, storage, search, sharing, analysis, and visualization.
        and integration of heterogeneous data.                      In our research work, we are interested by Big Data solutions
                                                                    for on-the-fly integration of heterogeneous data from various
   The concepts underlying the unified dataspace have been          sources, effective search among heterogeneous possibly
implemented as part of the US Army’s Distributed Common             inconsistent data sets, while managing data granularity and
Ground System (DCGS-A) Cloud initiative [17]. Moreover, to          consistency. Some of these will be discussed later in the paper.
address semantic heterogeneity, B. Smith and colleagues [17]
propose a strategy for the integration of diverse data through
semantic enhancement, by adding a semantic layer to the data
(explicitly represented in segment 3).                                             IV.       ARCHITECTURE COMPONENTS
    Leveraging this approach, we are adapting the underlying            The implementation of the unified dataspace approach
concepts for the design and implementation of a R&D                 points toward Big Data technological solutions, as they provide
intelligence data integration platform MIDIS (Multi-                scalability, elasticity, replication, fault-tolerance, and parallel
                                                                    processing. Next, we present the proposed global architecture



                                                 STIDS 2013 Proceedings Page 42
for intelligence data integration, its main components (data                 sources, and shows explicitly how data pieces move to
ingestion process, ontology support and semantic enrichment,                 different segments.
search and analytics), and interactions with other reasoning
modules.

A. Global architecture: from Collection to Analysis
   Figure 1 represents the high-level architecture and data
flow, from data collected from heterogeneous data stores, their
ingestion into the dataspace, to intelligence analysis by
specialized reasoning services. The key components include:
   x      Data ingestion from heterogeneous sources formats
          and integration into the unified dataspace segments;
   x      Ontology-based semantic enrichment;
   x      Data querying and analytics;
   x      Interactions with external reasoning modules.




                                                                                   Figure 2: UDS layered architecture and data flow (adapted from
                                                                                                  Yoakum-Stover, 2012 [22])


                                                                                1) Structured data

                                                                                 The ingestion pipeline for structured data processes various
                                                                             structured data sources (RDB, CSV, XML, RDF format) in
                                                                             order to populate the UDS in segments 1, 2, 3. The approach
                                                                             makes use of a XML configuration file generic enough to
                                                                             process each data schema provided (e.g. WSDL web service
                                                                             provides the XML schema to be processed). Data files are then
                                                                             parsed to extract data of interest and load them according to
       Figure 1: Intelligence data integration and analysis framework        UDS constructs, i.e. concepts, predicates, statements into the
                                                                             UDS and reference to the source in segment 2, source model in
                                                                             segment 3, while the imported data source is ingested in
                                                                             segment 1.
B. Data ingestion
                                                                                2) Unstructured data: annotation and extraction
    The system ingests intelligence data from representative
sources provided in heterogeneous formats, in order to                           Unstructured data (e.g. intelligence reports, documents) is
illustrate the integration of a variety of intelligence data as used         processed according to a text analysis pipeline using semantic
by intelligence analysts to conduct multi-intelligence all-source            annotation and knowledge extraction services supported by
analysis. A subset of the considered data sources in this context            domain ontologies. Documents are analyzed and semantically
include:                                                                     annotated using concepts instances from domain ontologies
   x      Structured data coming from intelligence                      or   (named entities, people, location, …). Then, knowledge in the
          operational database, including track data;                        form of statements (e.g. X is_located_at Loc) is extracted using
                                                                             pattern matching rules. These processes use the popular GATE
   x      Intelligence reports;                                              platform (General Architecture for Text Engineering) [4] as the
   x      Imagery database;                                                  underlying natural language processing component.
                                                                             Documents and their annotations are stored in the segment 1
   x      Data from a Content Management System;                             while extracted facts and metadata that provide meta-
                                                                             information about the documents are stored in the segment 2
   x      Internet open source (e.g. Twitter).                               according to the unified model (structured data). Metadata of
    The data ingestion pipeline is applied to structured and                 interest include data provenance, uncertainty, temporal and
unstructured data (cf. Fig. 2) as follows. Figure 2 illustrates the          spatial information.
data flow and transformation process from external data




                                                       STIDS 2013 Proceedings Page 43
    In military intelligence context, imagery data sources           composed of Feature to represent temporally persistent real-
(images, videos) are currently managed using metadata                world phenomenon, Event to represent instantaneous or short-
according to standard agreements (e.g. Stanag 4559) to               duration real-world phenomenon, Actor to represent an
facilitate information sharing (e.g. coalition operations). The      intentional entity that acts or has the capability of acting as a
next step in our architecture will be to adapt and enrich the data   participant in an event (individuals, groups), and Information
ingestion process for this type of source, possibly including        to collect non-geometric properties of other entity types. Based
automated information extraction.                                    on these models, we have developed an ontology of human
                                                                     geography to formally represent the entities present in these
C. Ontology and semantic enrichment                                  models, thus enabling automated reasoning upon it. These
   The proposed integration approach rests on the exploitation       models provide knowledge to support applications such as the
of domain ontologies to facilitate the harmonization of data         Intelligence Preparation of the Operational Environment,
models in a flexible and incremental manner.                         terrain analysis, and social network analysis that require a
                                                                     formal representation of the human terrain elements.
   1) Ontology engineering
                                                                         In some of our previously developed ontological models,
                                                                     concepts are derived from the hierarchy structure of the
    Ontologies describe flexible and extensible conceptual           JC3IEDM (Joint Command and Control Information Exchange
models that explicitly represent the concepts in a domain of         Data model) and its subsequent MIP Information Model
interest and the relationships that exist between them.              revisited and represented as a UML model. The model
Ontologies have been considered as an enabler for information        decomposes battlespace entities along Objects and
integration and have also been exploited in support of               Action/Event high-level concepts. Consequently, key high-level
information management or reasoning to meet different needs:         concepts contained in such ontologies comprise: individuals,
   x    To provide a standardized vocabulary and a taxonomy          groups and organizations, events that occur and activities that
        of the concepts in the domain of interest and facilitate     are conducted in the area of operations, their location, the
        information sharing;                                         characterization of the reported information, etc. Ontologies
                                                                     also formally represent the relationships that may exist
   x    To support text analysis and semantic annotation;            between these entities. Of course, the spatio-temporal
                                                                     dimension inevitably associated to these concepts has to be
   x    To perform federated semantic searches;
                                                                     modelled accordingly.
   x    To perform automated reasoning on top of the
                                                                         Domain ontologies are developed incrementally by
        ontology and business rules;
                                                                     adapting recognized multi-stages development methodologies,
   x    As a knowledge base (instances, relationships) to            leveraging as much as possible military models and doctrine
        capture information about the domain.                        documents. Such development approaches promote a modular,
                                                                     layered approach to ontology construction, built on top of
    In the military domain, ontologies have been developed for       foundational or upper-ontologies (e.g. SUMO, BFO, Dolce,
the last decade to meet various requirements: ontologies in          etc.) that represent generic concepts, which can be further
support of command and control [13], low-level and high-level        extended to represent more specific concepts in the domain of
information fusion, in particular situation and threat assessment    interest according to a hierarchical taxonomic structure.
[1], or intelligence analysis [18, 2].
                                                                         In the intelligence domain, the set of concepts of interest is
    At DRDC, domain ontologies have been developed and               derived from a thorough analysis of key processes and data
exploited in order to fulfill command and control as well as         sources, e.g. collation and analysis phases, in order to capture
intelligence requirements in different specific application          the essential entities in the ontological model. While elements
contexts, namely:                                                    of such knowledge are captured in some existing models, it is
   x    Maritime domain ontology in support of threat analysis       of interest to develop the corresponding ontological models
        and anomaly detection.                                       and integrate them on top of some upper-level ontologies.
                                                                     Looking at the high-level concepts taxonomy of our ontologies,
   x    Situation awareness ontologies to support knowledge          and some existing upper ontologies mentioned above, they
        management and knowledge mapping applications.               present similarities in the high-level decomposition. BFO
                                                                     (Basic Formal Ontology) [13, 18] as well as the UCore
   x    Ontologies related to terrorism and Improvised               Semantic Layer are models that we are leveraging to benefit
        Explosives Devices (IED) for ontology-based semantic         from prior modeling efforts. We are revisiting and integrating
        annotation of texts in support of intelligence collation.    them as part of this work.
    In the evolving military context, such as counterinsurgency          Moreover, domain ontologies are being extended as new
and counter-terrorism, cyber-warfare, civil-military operations,
                                                                     data sources or applications required additional concepts to be
the human terrain is a key component. The National Geospatial
                                                                     considered, and as the domain evolves (e.g. human terrain,
intelligence Agency (NGA) has undertaken the development of
                                                                     cyber). As mentioned in [18], rigorous management and
human geography data standards and models that define top-
                                                                     governance principles have to be applied to ensure consistency
level constructs and a set of sub-models encompassing topics
                                                                     and non-redundancy.
of interest such as religion, language, demographics, ethnicity,
groups, culture among others. The key high-level concepts are




                                                  STIDS 2013 Proceedings Page 44
    Domain ontologies are developed using the OWL language           E. Interface with intelligence reasoning modules
based on Description Logic due to its popularity,                        While MIDIS first aims at integrating intelligence data
interoperability facilitating the reuse of ontology parts,           from heterogeneous data sources for further retrieval and
expressiveness and tractability to represent domain knowledge        exploitation, it is part of a comprehensive architecture (ISTIP)
with expressive semantics. Consistency checking tools are used       for the analysis and production of intelligence. Thus, interfaces
to ensure that the developed ontologies are free of                  to facilitate data flow/transformation between the UDS and
inconsistencies.                                                     reasoning components are required (cf. Fig. 1). Consequently,
                                                                     we provide mechanisms and services to export data through a
                                                                     transformation process into appropriate formats to/from
   2) Semantic Enrichment                                            existing intelligence analysis modules.

    Semantic Enrichment (SE) [17] is a process for horizontal           x    Intelligence reasoning services make use of various
data integration based on the use of ontologies to integrate and             rich data formats required as input by their engine (e.g.
semantically enhance data models. The enhancement is                         rule-based reasoning and/or case-based reasoning),
accomplished by annotating (tagging) the models by the terms                 e.g. propositions, situation model, spatial feature,
of the ontology(ies), thus linking together the various resources            hypotheses structures.
in a semantically coherent way.                                         x    Data can also be exported as RDF into a graph
    According to the layered organizational structure of data in             representation to be used by various reasoning
the unified dataspace, the suite of ontologies and source data               services, e.g. social network analysis algorithms.
models are part of segment 3. Mappings between terms of the             Inversely, data produced by the various reasoning modules
ontologies and labels in the data models are explicitly defined      can be persisted in the dataspace. They are ingested back as
at this level too, so that data models are harmonized using the      new data in the UDS through the appropriate transformation
semantic layer.                                                      process, thus made discoverable for subsequent processing.
   Consequently, using this extra semantic layer, additional
semantic power (inferencing) can be exploited by query
engines, or reasoners (e.g. exploiting “same_as” relations                           V.     TECHNOLOGICAL ASPECTS
between terms linked by the same concept in the reference
ontology).
   To fulfill semantic enrichment approach consisting of                 The implementation of our multi-intelligence data
semantically linking data, unstructured documents are also           integration system leverages emerging Big Data and SOA
processed by exploiting the terms and structure of ontologies.       technologies.

D. Data search and analytics                                         A. Big Data Technologies
    As mentioned above, this work leverages and extends                  To cope with the processing of ultra-large scale data sets,
previous research we have conducted in support of intelligence,      Big Data technologies exploit distributed storage and
e.g. the provision of information management and exploitation        processing. The open source Apache Hadoop Framework [5]
services to support the analyst in his activities: semantic search   allows for the distributed processing of large data sets across
engines, filtering, notification/alert services, etc.                clusters of computers using simple programming models. It
                                                                     provides several components, including the MapReduce
    The focus in the present research is to provide scalable         distributed data-processing model, Hadoop Distributed File
solutions for large-scale data management and analysis.              System (HDFS), and HBase [6] distributed table store. These
Consequently, we are investigating various techniques and            main components and emerging tools are being exploited for
solutions that fulfill analysts’ increasing needs in terms of:       the implementation of our integration architecture (Cloudera’s
   x    Analytics from large data sets: data mining,                 platform).
        data/document clustering, data correlation among                1) Data ingestion
        various data sets, etc.
   x    Efficient search and retrieval within unstructured and           Data ingestion benefits from Hadoop MapReduce
        structured data sets.                                        distributed processing for large data sets. As presented above,
                                                                     structured data ingestion is done by using a XML configuration
    Multi-intelligence data are ingested into the dataspace          file for each data format. Data files are then parsed via
segments 1 and 2 as presented above. Consequently, efficient         MapReduce and loaded into the UDS.
indexing and search techniques and tools have to be proposed
both for data in segments 1 (unstructured world) and in                  Artefacts data are stored in HDFS in segment 1, structured
segment 2 (structured data). While analytics tools benefit from      data are stored in HBase in segment 2, and data models in
Big Data technologies (batch distributed processing), the            segment 3 in HBase as well.
required search tools have to provide real-time performance             Knowledge extraction from textual documents using
results. Some techniques are discussed in section V.                 semantic text analysis services were not initially implemented




                                                  STIDS 2013 Proceedings Page 45
using parallel processing. We are considering their adaptation      B. SOA
into Hadoop environment to benefit from distributed                     Service Oriented Architecture (SOA) has emerged as the
processing of large documents corpus and are also looking at        predominant paradigm for the building of flexible and scalable
alternate approaches such as those proposed in Lin and              architectures in net-centric environments. SOA is an
colleagues’ book [8]. Additional envisioned services for            architectural discipline that relies upon the exposure of a
extraction value from textual intelligence reports datasets         collection of loosely-coupled, distributed services which
include cross-document co-referencing in HDFS.                      communicate and interoperate via agreed standards across the
                                                                    network. Some benefits are directly based on the principles of
                                                                    service orientation, mainly: services are loosely coupled,
   2) Indexing / Query                                              autonomous, discoverable, composable and reusable.
                                                                    Consequently, SOA principles offer an appropriate approach to
    For users (or services) to retrieve relevant information from   data integration. The services can be composed into higher-
the HBase UDS in near real-time, we aim at providing efficient      level applications to support agile business processes. By
indexing and query solutions.                                       augmenting the data services layer, and incorporating
                                                                    integration services as described above, the data integration
    First, considering out of the box query tools, the Hive query
                                                                    environment will facilitate access to data and discovery,
engine has demonstrated poor performance. The recent
                                                                    integration of data from diverse sources, and handling of large
Cloudera Impala query engine is being experimented, the
                                                                    volume of data.
performance is improved due to the fact that it supports direct
query on HBase indexes and does not use MapReduce.                      The envisioned set of services complements the SOA-based
                                                                    Intelligence Science and Technology Integration platform
    Moreover, several input data formats to the UDS will be as
                                                                    (ISTIP) in development at DRDC Valcartier. This platform
RDF triples (metadata extracted from text, imagery data
                                                                    already incorporates a set of data representation schemes and
tagging, data extracted from content management systems,
                                                                    relevant services in support of various intelligence analysis
etc.). Conceptually, the UDS segment 2 can be considered as a
                                                                    tasks and sense-making activities: the analysis of textual
HBase quad store where the fourth element added to the triple
                                                                    documents, (semantic annotation of text based on domain
refers to the source (named graph). We are looking at
                                                                    ontologies, and automated extraction of facts from documents
techniques to perform efficient queries to retrieve RDF data in
                                                                    based on pattern matching rules), as well as multiple reasoners
this context (e.g. extraction of graphs for Social network
                                                                    (rule-based reasoner, case-based reasoner, multiple hypotheses
analysis).
                                                                    situation analysis) [15]. Our contribution will augment the
    One interesting approach is provided by Rya [14] that           platform with additional intelligence data services, using
introduces storage methods, indexing schemes, and query             flexible and efficient representation schemes. This will
processing techniques that scale to billions of RDF triples         facilitate the linking of data among the various sources, in
across multiple nodes, while providing fast and easy access to      order to make sense of the large amount of data made available
the data through conventional query mechanisms such as              to analysts, and provide improved situational awareness.
SPARQL. Rya proposes a method of storing triples by
indexing triples across three different tables corresponding to                           VI.    CONCLUSIONS
the permutations of triple patterns, i.e. (Subject, Predicate,
Object), (Predicate, Object, Subject), and (Object, Subject,            In this paper, we have presented the ongoing work that we
Predicate). We are experimenting with this approach, and are        are conducting for the development of a scalable and flexible
exploiting OpenRDF Sesame (SPARQL) for HBase [10].                  intelligence data integration and analysis platform. As part of
                                                                    this initiative, we leverage our previous R&D work using
    Preliminary tests are being done with various data sources,     semantic technologies, in particular the suite of ontologies and
as well as using the LUBM benchmark dataset [9] to assess the       services that are part of our ISTIP platform. Moreover, we are
performance and compare with other approaches.                      leveraging a proposed integration approach [22] and adapting it
                                                                    to our needs. We are currently developing data integration
   3) Analytics                                                     components by experimenting with recent Big Data
                                                                    technologies to address scalability and performance.
    While intelligence analysis requires specialized reasoning
tools and human intervention, Big Data Analytics may reveal             Big Data technologies represent a shift in terms of
interesting insights from the analysis of large data, (e.g.         programming approach, and their promise produce an
predictive/trend analysis) by using appropriate techniques such     increasing interest within the data/information management
as data mining. Apache Mahout is one of the first distributed       community. But proposed solutions are still immature, and first
machine-learning open source framework built on top of              experimentations show that they require incremental
Hadoop. It is a candidate for data clustering, classification,      development and testing stages to improve performance. In our
collaborative filtering, recommendation, or profiling that we       military intelligence context, Big Data performance is critical if
are considering in order to demonstrate value-added from data       these technologies are be used in tactical environments.
using Big data analytics.                                              While we aim at providing a comprehensive data
                                                                    management and exploitation platform, further research is
                                                                    required to deal with entity resolution, disambiguation, data




                                                 STIDS 2013 Proceedings Page 46
cleaning, etc. in this context. Recent research proposed in the               [11] M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A
Big Data world should provide relevant insight.                                    new abstraction for information management. SIGMOD Record,
                                                                                   34(4):27-33, December 2005.
    A data integration platform can be viewed as a prerequisite               [12] S. Jeffery, M. Franklin, and A. Halevy. Pay-as-you-go user feedback in
to multi-sources information fusion. Work within the hard/soft                     dataspace systems. In Proc. of SIGMOD, 2008.
information fusion community addresses similar challenges,                    [13] B. Mandrick, Creating an extensible command and control ontology, in
and we looked at them from an architecture perspective. The                        Int. Journal of Intelligent Defence Support Systems, Vol. 4, No. 3, 2011.
management of data uncertainty should be considered beyond                    [14] R. Punnoose, A. Crainiceanu, and D. Rapp. Rya: a scalable RDF triple
                                                                                   store for the clouds. In Proceedings of the 1st International Workshop on
simple metadata when integrating intelligence data from                            Cloud Intelligence (Cloud-I '12). ACM, New York, NY, USA, 2012.
heterogeneous sources.
                                                                              [15] Roy, J. and Auger, A., The Multi-Intelligence Tools Suite - Supporting
   We are also investigating approaches to the integration and                     Research and Development in Information and Knowledge Exploitation,
                                                                                   16th International Command and Control Research and Technology
exploitation of internet open sources in support of intelligence                   Symposium (ICCRTS), "Collective C2 in Multinational Civil-Military
analysis, in particular from social media (e.g. twitter).                          Operations", Québec City, Canada, June 21-23, 2011.
                                                                              [16] Salmen D., Malyuta T., Hansen A., Cronen S., and Smith B., Integration
                                                                                   of Intelligence Data through Semantic Enhancement, in Proceedings of
                                                                                   the 6th international conference on Semantic Technology for
                             REFERENCES                                            Intelligence, Defense, and Security (STIDS 2011), George Mason
                                                                                   University, Fairfax, Virginia, November 2011.
[1]  Boury-Brisset, A.-C. Ontology-based approach for Information Fusion,     [17] B. Smith, T. Malyuta, W. S. Mandrick, C. Fu, K. Parent, M. Patel,
     in Proceedings of the 6th International Conference on Information             Horizontal Integration of Warfighter Intelligence Data. A Shared
     Fusion, Cairns, 8-11 July, Australia, pp. 522-529, 2003.                      Semantic Resource for the Intelligence Community, Proceedings of the
[2] V. Dragos, Developing a core ontology to improve military intelligence         Conference on Semantic Technology in Intelligence, Defense and
     analysis, in International Journal of Knowledge-based and Intelligence        Security (STIDS), George Mason University, Fairfax, VA, October 23-
     Engineering Systems, 17, pp.29-36, IOS Press, 2013.                           25, 2012.
[3] Gartner, Hype cycle for Big Data, Gartner research report, 2010, also     [18] B. Smith, T. Malyuta, D. Salmen, W. Mandrick, K. Parent, S. Bardhan,
     published in 2012.                                                            J. Johnson, Ontology for the Intelligence Analyst, CrossTalk: The
                                                                                   Journal of Defense Software Engineering, November/December
[4] GATE,         General      Architecture      for    Text  Engineering,         2012,18-25.
     http://gate.ac.uk/index.html.
                                                                              [19] J. Singleton, Data integration: charting a path forward to 2035, Air War
[5] Hadoop. http://hadoop.apache.org/.                                             college, research report, Feb. 2011.
[6] HBase. http://hbase.apache.org/.                                          [20] S. Yoakum-Stover and T. Malyuta, "Unified data integration for
[7] M. Lenzerini, Data integration from a theoretical perspective, In              Situation Management," IEEE MILCOM 2008.
     Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART                 [21] Yoakum-Stover S., Malyuta T., Antunes N., A Data Integration
     symposium on Principles of database systems, 2002.                            Framework with Full Spectrum Fusion Capabilities, Sensor and
[8] Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with                  Information Fusion Symposium, Las Vegas, 2009
     Mapreduce. Morgan and Claypool Publishers, 2013.                         [22] S. Yoakum-Stover, A. Eick, Breaking the Data Barriers, DGI, London,
[9] Guo, Yuanbo, Pan, Zhengxiang and Heflin, Jeff . LUBM: A Benchmark              2012.
     for OWL Knowledge Base Systems. Web Semantics. 3( 2) July 2005.
[10] OpenRDF. http://www.openrdf.org/.




                                                         STIDS 2013 Proceedings Page 47