1


                  Contributions to a Semantically Based
                Intelligence Analysis Enterprise Workflow
                                 System
              Robert C. Schrag, Jon Pastor, Chris Long, Eric Peterson, Mark Cornwell, Lance A. Forbes, and
                                                    Stephen Cannon


                                                                                  To support the rapidly changing needs of an intelligence
   Abstract—We have contributed key elements of a semantically                 enterprise, a workflow authoring tool must be extremely
based intelligence analysis enterprise workflow architecture: a                flexible. The enterprise must be able to rearrange components
uniformly accessible semantic store conforming to an enterprise-               (e.g., pattern matchers, classifiers, group detectors) in the same
wide ontology; a branching context representation to organize
workflow components’ analytical hypotheses; a logic
                                                                               kind of way that a child rearranges Lego bricks. They must be
programming-based, forward-chaining query language for                         able to introduce new software into the enterprise rapidly.
components to access data from the store; and a software toolkit               However, Lego bricks have a distinct advantage over legacy
embracing all the foregoing to streamline the process of                       software components from different source: they were all
introducing additional legacy software components as                           created to respect a common interface. One brute-force
semantically interoperable workflow building blocks.                           approach to integrating legacy components is to manually
   We explain these contributions, focusing particularly on the
                                                                               develop code that transforms data from one form (e.g., Java
toolkit. For certain widely used input/output formats—e.g.,
comma-separated value (CSV) files—a knowledgeable user can                     objects) to another (e.g., flat files); that requires O(n2)
quickly “wrap” a newly installed component for workflow                        transforms. Tangram’s approach reduces the required number
operation by providing a compact and entirely declarative                      of transforms to O(n), and our toolkit enables knowledgeable
specification that uses the query language to map specific relation            users to “wrap” legacy components with such transforms,
arguments in the ontology to specific structural elements in the               making the components workflow-ready quickly.
component’s native input and output formats.
                                                                                  To motivate our contributions, we present the (notional,
   Our contributions are built to work with AllegroGraph, from
Franz, Inc.                                                                    simplified) two-component workflow in Fig. 1: a suspicion
                                                                               scorer hypothesizes potential terrorists, then a group detector
   Index Terms—Intelligence analysis, enterprise workflow,                     clusters the hypothesized terrorists into hypothesized potential
hypothesis representation, branching contexts, semantic                        terrorist groups.
interoperability, declarative data transformation, software
                                                                                Suspicion Scoring Component                   Group Detection Component
component wrapping
                                                                               Fig. 1 A notional intelligence analysis workflow

                          I. INTRODUCTION                                         The workflow in Fig. 1 raises some enterprise-level
                                                                               architecture issues that our contributions address.
WE have contributed key elements of a semantically based                       1) What are components’ input and output data, how is data
intelligence analysis enterprise workflow architecture for                         stored, and how do components access it? We have
Tangram, a multi-year, multi-contractor threat surveillance and                    introduced a uniformly accessible semantic store
alerting research and development program sponsored by the                         conforming to an enterprise-wide ontology and a logic
United States’ Intelligence Advanced Research Projects                             programming-based, forward-chaining query language for
Agency (IARPA). Tangram’s objective has been to automate                           components to access data from the store. Component
routine analysis workflows, so that these can be executed as                       specifications (see Issue 3 below) indicate what data is
standing processes, on a large scale.                                              accessed in particular.
                                                                               2) How are the hypotheses that analytical components
    Manuscript submitted August 19, 2009. This work was supported in part          produce distinguished from background data, and how are
by the U.S. Government.                                                            they communicated among components? As hypotheses,
    All authors were with Global InfoTek, Inc., 1920 Association Dr, Suite         analytical components’ outputs must not simply be mixed
600,       Reston,    VA      USA      20191,    703-652-1600,      (e-mail:
firstinitialLastname@globalinfotek.com).                                           indiscriminately with more uniformly credible evidence
    C. Long is now with SET Corp., Arlington, VA, 703-738-6214 (email:             data or with each other. Among other considerations, the
clong@setcorp.com).                                                                broad body of evidence changes over time (leading to
    L. A. Forbes is now with Solutions Made Simple, Inc., Reston, VA (email:
lforbes@sms-fed.com).                                                              different hypotheses), and different components—or
                                                                                                                                                           2

   different (e.g., control) configurations thereof can lead to             3) Invoke the legacy component in its “native” (unwrapped)
   different hypotheses even for the same inputs. We                             form.
   organize the content of the semantic store into distinct                 4) Convert the legacy component’s native-format outputs to
   RDF graphs that we call “datasets,” and (correlating                          the common ontology, as metadata-bearing hypotheses.
   datasets with contexts) represent the outputs of                         5) Assert the output hypotheses to the central store.
   successively applied analytical components as branching                     We implement the central semantic store using
   contexts (that incrementally add information).          Our              AllegroGraph from Franz, Inc. AllegroGraph is a “quad” store
   component specifications and our query language thus                     that includes, in addition to the “subject,” “predicate,” and
   include parameters for the datasets that are passed among                “object” fields standard to RDF and common to triple stores, a
   or otherwise accessed by components. Besides these                       “graph” field. We use this field to distinguish among the
   datasets for hypotheses, the store includes one or more                  various datasets that are available as inputs or have been
   background, or “evidence,” datasets and for convenience                  produced as outputs of workflow components.
   some intermediate (i.e., not necessarily hypothetical)                      We provide a knowledge base (KB) query language
   datasets that result from purely logical queries. This                   supporting a wrapped component’s query and assertion
   treatment of evidence and hypotheses, together with the                  processes and allowing users to define, for specific analytical
   above-mentioned query language, provide a practical                      purposes, KB query components (including no legacy process)
   implemented solution to meet broad Tangram                               that combine elements from one or more existing datasets into
   requirements outlined in [6].                                            one or more output datasets. We implement legacy component
3) How can legacy components with arbitrary input/output                    wrappers and KB query components using the Prolog and
   formats easily be made to interact with the data? The                    Common Lisp interfaces to AllegroGraph.
   contributions above are integrated in a software toolkit to                 Fig. 3 illustrates the meta-data classes (noted in bold) and
   streamline the process of introducing additional legacy                  attributes (with multi-valued attributes starred*) that support
   software components as semantically interoperable                        the representation of a dataset’s context lineage. We take each
   workflow building blocks. For certain widely used                        workflow component’s execution, noted in a ProcessExecution
   input/output formats—e.g., comma-separated value (CSV)                   (PE) object, as the source of the statements in any output
   files—a knowledgeable user can quickly wrap a newly                      (hypothesis) dataset; lineage is manifested in the connections
   installed component for workflow operation by providing                  among datasets, process executions, and workflow executions
   a compact and entirely declarative specification that uses               (noted in WorkflowExecution objects).
   the query language to map specific relation arguments in                                                 ProcessExecutionDatasetInput
                                                                              WorkflowExecution
   the ontology to specific structural elements in the                         hasProcessExecution*           hasParameterName (consistent with Process)
                                                                                                              hasInputDataset
   component’s native input and output formats. The toolkit
   also provides some less fully automated interface options                 ProcessExecution
                                                                               hasProcess (e.g., GDA)       ProcessExecutionDatasetOutput
   to address more general input/output situations.                            hasPEDatasetInput*             hasParameterName (consistent with Process)
                                                                               hasPEDatasetOutput*            hasOutputDataset
                                                                               hasPEControlInput*
 II. ARCHITECTURAL SCHEME OF A WORKFLOW COMPONENT                                                           ProcessExecutionControlInput
                                                                                                              hasParameterName
  Fig. 2 presents our general scheme for wrapping legacy                                                      hasValue
components.
                                                                            Fig. 3 Meta-data classes and attributes for hypothesis datasets

                                                                               As noted in Section I, the interpretation of datasets as a
                        Common Semantic Store
                                                                            context is incremental along its lineage: in general any
                                                                            statement that holds in a dataset that is upstream (workflow-
                                                                            wise) from a given dataset D created during a workflow also
                                                                            (implicitly) holds in D. The representation is thus space-
                  Transform:                       Transform:
                                                                            efficient. We have not yet found it necessary to implement
        Query     Common                           Common       Assert to
       Semantic   Ontology
                                     Native
                                                   Ontology     Semantic    such transitivity of dataset contexts directly in the KB query
         Store                    Component                    Store
                   Native                           Native                  language; our current workflow components use just
                   Format                           Format
                                                                            background (evidence) datasets and datasets that their
                               Wrapped Component                            immediate workflow predecessors create.
Fig. 2 Component wrapping scheme

  Fig. 2 schematizes a single wrapped component that
executes processes to:
1) Retrieve input data, expressed in the enterprise’s common
    ontology, from the central semantic store.
2) Format the input data for the legacy component.
                                                                                                                                                3


 ?watchlistGraph       Group Detection Watchlist-Evidence   ?evidenceGraph   object, graph, index—“spogi”—format) exists in the workflow
                            Dataset Join Component
                                                                             KB. q- is included in the standard Franz Allegro Prolog
                                  ?linkGraph
                                                                             interface to AllegroGraph.
                                                                                • a- indicates that a triple is to be written to the specified
                          Group Detection Component                          output dataset. An a- conjunct always succeeds. a- and its
                                                                             duplicate-avoiding twin a-- (below) are our contributions that
                                ?outputGraph                                 confer the KB query language’s forward chaining character.
Fig. 4 Use case workflow (see Section III)                                   • a-- indicates that a triple is to be written to the workflow
                                                                                  KB iff it is not already present there. An a-- conjunct
                     III. USE CASE WORKFLOW                                       always succeeds.
                                                                             • !rdf:type is an example of a shorthand that expands to
   Fig. 4 presents a use case workflow including both a
                                                                                  http://www.w3.org/1999/02/22-rdf-syntax-ns#type — the
wrapped legacy component and a KB query component.
                                                                                  atom type in the namespace for RDF. (!teo: refers to an
   In Fig. 4, datasets (graphs) are depicted by square-cornered
                                                                                  application-specific ontology.)
boxes; workflow components are depicted by round-cornered
                                                                             • ?Event, ?sender, and other symbols beginning with ? are
boxes. Each component reads data from one or more input
                                                                                  logic programming (AKA Prolog) variables. In the logic
graphs and writes to one or more output graphs. Here, a
                                                                                  programming style we support, every logic variable
dataset join KB query component is used to select from
                                                                                  becomes bound when the q- conjunct is matched in the
broader evidence (right) just information relevant to
                                                                                  KB.
watchlisted terrorist suspects (left) for processing by a
                                                                             • Prolog will backtrack to execute each conjunct in the KB
downstream legacy group detection component.
                                                                                  query for every combination of variable bindings for
   In our toolkit, the defining forms for workflow components
                                                                                  which the preceding conjuncts succeed.
are Lisp macro calls. Beyond providing one or more files
containing such definitions, ToolKit users need never interact               • The KB query language provides a variety of additional
directly with Lisp or with AllegroGraph, as we provide                            constructs (e.g., and, or, not) in which the usual
alternative interfaces.                                                           expressions that appear as top-level conjuncts may be
                                                                                  embedded—e.g.,
                                                                                   (and (not (q- ?P !rdf:type !teo:Terrorist ?evidenceGraph))
    IV. KB QUERY COMPONENTS AND QUERY LANGUAGE
                                                                                      (or (q- ?P1 !rdf:type !teo:Terrorist ?evidenceGraph)
  The definition for the KB query component used in Fig. 4                               (q- ?P2 !rdf:type !teo:Terrorist ?evidenceGraph))).
appears below.                                                               •     While the repetition of entity type statements—e.g.,
(defKB-query-component                                                             (a-- ?sender !rdf:type !teo:Person ?linkGraph)
      group-detection-watchlist-evidence-dataset-join-component                  —from the input graph is not strictly necessary given our
    ((and (q- ?Event !rdf:type !teo:TwoWayCommunicationEvent                     context interpretation, the Tangram contractors agreed
       evidenceGraph)                                                            that it would be convenient to include such declarations
        (q- ?Event !teo:sender ?sender ?evidenceGraph)                           uniformly in all datasets.
        (q- ?Event !teo:receiver ?receiver ?evidenceGraph)                     Below are the definitions for some utility KB query
        (q- ?sender !rdf:type !teo:Person ?evidenceGraph)
                                                                             components that we provide with the toolkit distribution.
        (q- ?receiver !rdf:type !teo:Person ?evidenceGraph)
        (q- ?sender !rdf:type !teo:Person ?watchlistGraph)                       (defKB-query-component 2-input-dataset-union-component
        (q- ?receiver !rdf:type !teo:Person ?watchlistGraph)                        (DataUnionProcess)
        (a- ?Event !rdf:type !teo:TwoWayCommunicationEvent                        ((query (q- ?S ?P ?O ?sourceGraph1)
           ?linkGraph)                                                                 (a- ?S ?P ?O ?destGraph))
        (a- ?Event !teo:deliberateActor ?sender ?linkGraph)                        (query (q- ?S ?P ?O ?sourceGraph2)
        (a- ?Event !teo:deliberateActor ?receiver ?linkGraph)                          (a- ?S ?P ?O ?destGraph))))
        (a-- ?sender !rdf:type !teo:Person ?linkGraph)
                                                                                 (defKB-query-component 3-input-dataset-intersection-component
        (a-- ?receiver !rdf:type !teo:Person ?linkGraph))))
                                                                                    (DataIntersectionProcess)
   The above component selects events from one dataset                            ((query (q- ?S ?P ?O ?sourceGraph1)
(denoted by the logic variable ?evidenceGraph) whose                                   (q- ?S ?P ?O ?sourceGraph2)
participants also appear in another dataset (denoted by                                (q- ?S ?P ?O ?sourceGraph3)
                                                                                       (a- ?S ?P ?O ?destGraph))))
?watchlistGraph) and asserts the links among them in an
output dataset (represented by the logic variable ?linkGraph)                    (defKB-query-component dataset-de-duplication-component ()
for consumption by a group detection component. Note the                          ((query (q- ?S ?P ?O ?sourceGraph)
following.                                                                             (a-- ?S ?P ?O ?destGraph))))
   • This component performs a single KB query that                             The (first) dataset union component writes everything it
implicitly conjoins (logically) the twelve top-level (q-, a-, and            finds in either of its source graphs into its destination graph;
a--) forms.                                                                  the (second) intersection component writes anything it finds in
   • A q- conjunct succeeds iff a triple (in subject, predicate,
                                                                                                                                              4

all of its sources into the destination. A workflow author may          Native GDA Input:                                Native GDA Output:

choose to follow either of these up with the (third) dataset de-        Ev-1194,In-10381                                 group,entity
                                                                        Ev-709,In-15840                                  G0,In-10096
duplication component to remove duplicates; note that the               Ev-709,In-36232                                  G0,In-15840
                                                                        Ev-38749,In-4938                                 G0,In-19354
author could achieve the same effect by using a-- rather than a-        Ev-38749,In-48834                                G0,In-19540
conjuncts in the union components’ definitions.                         Ev-34121,In-3007                                 G0,In-19625
                                                                        Ev-34121,In-35214                                G0,In-21371
   Existing Tangram workflow and process infrastructure                 Ev-65474,In-21371                                G0,In-28719
                                                                        Ev-65474,In-19354                                G0,In-37201
required that we specify the fixed (e.g., two-input) arities for        Ev-23484,In-39017                                G0,In-37733
                                                                        Ev-23484,In-16809                                G0,In-38634
the components above. This might not be the case in every               …                                                G0,In-47910
workflow setting of interest (see Section VIII). Likewise, it                                                            G1,In-1002
                                                                                                                         …
might not be necessary to name (or permanently
componentize) every query before it can be used.                     Fig. 5 CSV input/output files for the GDA group detection component

                                                                        Below is a toolkit-based component definition that invokes
             V. WRAPPED LEGACY COMPONENTS                            the automatic CSV file interface to wrap GDA. The
   Toolkit users define wrappers for legacy/native components        (completely declarative) definition specifies that GDA-
using the Lisp macro defWrapped-component, which affords             component-TerroristGroup is an instance of the class
a choice among three distinct interfaces.              Non-Lisp-     GroupDetectionProcess (see [9]). The (keyword) argument
programming ToolKit users will want to use one of the first          :native-input-CSV-file-specs specifies the relation of the input
two interfaces described below; Lisp-programming users are           CSV file (to be named "GDA-input-links.csv") to the input
most likely to use the first or third.                               dataset (bound to the Prolog variable ?linkGraph).1 Note that
1) Fully automatic: defWrapped-component writes a comma-             the separating character may be specified, using the :text-
     separated value (CSV) or other delimited text file (to be       delimiter argument, and the presence of a headerline via the
     consumed by the native component) for each input dataset        :headerline argument. The argument :native-output-CSV-file-
     and automatically reads a delimited text file (produced by      specs specifies the relation of the output CSV file (to be
     the native component) for each output dataset. For native       named "GDA-output-groups.csv") to the output dataset (bound
     components with delimited text file-oriented input/output,      to ?outputGraph). The remaining top-level arguments specify
     the ToolKit user need provide no additional wrapping            how to invoke the native component. Further explanation
     code.                                                           follows the definition.
2) Semi-automatic: defWrapped-component automatically                  (defWrapped-component GDA-component-TerroristGroup
     writes an ntriples file for each input dataset and                  (GroupDetectionProcess)
     automatically reads an ntriples file for each output dataset.      :native-input-CSV-file-specs
     The ToolKit user provides additional (presumably non-               (("GDA-input-links.csv"
     Lisp), shell-callable wrapping code as necessary to                   :query
     mediate between these ntriples files and the native                     (query
     component.                                                               (q- ?E !teo:deliberateActor ?P ?linkGraph))
                                                                           :query-type select
3) Manual: The ToolKit user provides, via an additional
                                                                           :headerline nil
     argument to defWrapped-component, custom Lisp code to                 :text-delimiter ","
     implement the required native component interface. Here               :query-template (?E ?P)))
     we assume that the Lisp programmer will interact directly          :native-output-CSV-file-specs
     with AllegroGraph to create suitable inputs for the native          (("GDA-output-groups.csv"
     component.                                                            :query
   In the sequel, we focus primarily on the fully automatic                  (query
interface.                                                                    (a- ?G !teo:orgMember ?P ?outputGraph)
                                                                              (a-- ?G !rdf:type !teo:TerroristGroup ?outputGraph)
   Consider the GDA group detection algorithm [3] from
                                                                              (a-- ?P !rdf:type !teo:Terrorist ?outputGraph))
CMU’s Auton Lab), which uses CSV input and output files as                 :headerline t
shown in Fig. 5. The group detector uses event-based linkages              :CSV-template (?G ?P)
among individuals to infer groups of associating individuals.              :namespace-template
Each input line indicates evidence that a certain event involves             ("http://anchor/teo#" "http://anchor/teo#")))
a certain individual. Each output line indicates that a certain         :native-component-directory "GDA_DISTRIBUTION"
individual is hypothesized to belong to a certain group.                :native-component-command-name "gda_applic"
                                                                        :native-component-command-arguments
                                                                         ("GDA-output-groups.csv" "GDA-input-links.csv"))


                                                                        1
                                                                          The full interface supports any number of native input and of native
                                                                     output delimited text files and corresponding datasets/graphs.
                                                                                                                                                                                      5

  Fig. 6 illustrates how the :native-input-CSV-file-specs                                                             :CSV-template argument), instantiating the template and
argument is processed.                                                                                                binding query variables. Again, the template indicates the
          Native GDA Input File:   General Query Conjunct:                                                            order of each bound Prolog variable in each line of the CSV
          Ev-1194,In-10381
          Ev-709,In-15840
                                   (q- ?E                  !teo:deliberateActor ?P                      ?linkGraph)
                                                                                                                      file. Note the final template instantiation step that inserts
                                   Instantiated Query Conjunct:
          Ev-709,In-36232
          Ev-38749,In-4938         (q- !teo:Ev-1194 !teo:deliberateActor !teo:In-10381 ?linkGraph)                    appropriate RDF namespaces (per the :namespace-template
          Ev-38749,In-48834
          Ev-34121,In-3007
          Ev-34121,In-35214
                                                                                                                      argument). At right, Fig. 8 illustrates how these bindings are
          Ev-65474,In-21371
          Ev-65474,In-19354
                                                                                                                      used to instantiate each specified output assertion (query
          Ev-23484,In-39017
          Ev-23484,In-16809                                                                                           conjunct). Each assertion is executed to add a triple to the
          …
                                                                                                                      semantic store (with appropriate treatment of duplicates).
                     General Query Template: (?E               ?P)

                 Instantiated Query Template: (!teo:Ev-1194 !teo:In-10381)
                                                                                                                      VI. CONCEIVED FULL AUTOMATION FOR COMPONENTS WITH
                                                                                                                                   XML INPUT/OUTPUT FILES
Fig. 6 Automatic CSV file input mechanism
                                                                                                                         While delimited text input/output formats are quite
   First, we execute the input query against the input dataset                                                        prevalent, they are by no means the only structured formats of
(graph). At top right, Fig. 6 illustrates how the query’s single                                                      interest. We have also designed (not yet implemented) a
(general) conjunct is first specifically instantiated, binding the                                                    similar, declaratively-specified wrapping capability for
conjunct’s variables to values for which a triple exists in the                                                       components with XML file input/output. The general idea is
input graph. The :query-template argument specifies how the                                                           to embed a similar query specification into the XML file where
query’s bound variable values should be ordered in the CSV                                                            data is to be read or written. Another alternative on the input
file. At bottom, Fig. 6 illustrates the intermediate step of                                                          side (only) would be integration of Xpath and Xquery with
instantiating the query template, based on the instantiated                                                           logic programming. (See [1] for a recent survey.)
query conjunct. At left, Fig. 6 shows how we generate one
CSV file line per query instantiation.2 (Note that the RDF                                                                            VII. THE WRAPPING PROCESS
namespace, !teo:, is removed, as it is not useful to the native                                                          The toolkit’s comprehensive documentation (available from
component.)                                                                                                           the first author) details the following steps included in the end-
   Fig. 7 illustrates how the native component is (next) invoked                                                      to-end process of wrapping and then deploying components.
by the workflow execution system. Execution takes place in a                                                          1) Install the wrapping toolkit.
temporary directory specific to the given workflow and                                                                2) Install the native component so that it will be accessible to
component instance.                                                                                                        the wrapper.
               Directory:             Command-name:    Command-arguments:                                             3) Define any KB query component(s) needed to select
$GU_CORE/GDA_DISTRIBUTION gda_applic                   GDA-output-groups.csv GDA-input-links.csv
                                                                                                                           appropriate data from any broader dataset(s).
Fig. 7 Automatic CSV file native component calling mechanism                                                          4) Define the wrapper for the native component.
  Fig. 8 illustrates how the :native-output-CSV-file-specs                                                            5) Test both KB query and wrapped native components to
argument is (next) processed.                                                                                              ensure effective operation. We have developed and
                                                                                                                           applied a testing framework that includes component
         Query Conjuncts:

 Gen.     (a- ?G            !teo:orgMember ?P                    ?outputGraph)
                                                                                           Native GDA Output File:         concurrency (i.e., re-entrance) testing.
 Inst.    (a- !teo:G0 !teo:orgMember !teo:In-10096 ?outputGraph)
                                                                                           group,entity
                                                                                           G0,In-10096                6) Deploy the developed and tested components.
                                                                                           G0,In-15840
                                                                                           G0,In-19354                   These steps may of course be undertaken by different
                                                                                           G0,In-19540
 Gen.     (a-- ?G            !rdf:type !teo:TerroristGroup ?outputGraph)                   G0,In-19625
                                                                                           G0,In-21371
                                                                                                                      classes of users. E.g., in a component wrapping team (of
 Inst.    (a-- !teo:G0 !rdf:type !teo:TerroristGroup ?outputGraph)                         G0,In-28719
                                                                                           G0,In-37201                which an enterprise may have several), one member (the
                                                                                           G0,In-37733
 Gen.     (a-- ?P !rdf:type !teo:Terrorist ?outputGraph)                                   G0,In-38634
                                                                                           G0,In-47910
                                                                                                                      “installer”) may be primarily responsible for software
 Inst.    (a-- !teo:In-10096 !rdf:type !teo:Terrorist ?outputGraph)
                                                                                           G1,In-1002
                                                                                           …                          installations; another (the “developer”) may be expert with the
                                                                                                                      enterprise’s ontology, workflows, and datasets, the KB query
                                                                                     (?G       ?P)
                                                   General CSV / Query Template:
                                                                                                                      language, and the component defining forms; still another (the
                                                       Instantiated CSV Template:    (G0       In-10381)
                                                      Instantiated Query Template:   (!teo:G0 !teo:In-10381)
                                                                                                                      “tester”) may primarily have testing and another (perhaps the
                                                                                                                      “installer” again) deployment responsibilities. “Scripters”
Fig. 8 Automatic CSV file output mechanism                                                                            might write custom Lisp wrapping code or shell scripts or
                                                                                                                      other command line-callable programs to perform data
  The process is here roughly the reverse of that in Fig. 6. At
                                                                                                                      transformations not (yet) supported by toolkit (semi-)
bottom, Fig. 8 illustrates how we first interpret each line of the
                                                                                                                      automation.
output CSV file (at right) using the template specified (via the
                                                                                                                         For each component to be wrapped, the wrapping team also
   2                                                                                                                  should include, or at least have access to, a component
     This is per the value select specified for the :query-type argument,
which indicates that duplicate links (useful to GDA) are to be retained in the                                        “champion” who knows what enterprise function(s) the
input dataset. By instead using the (default) value select-distinct, the                                              component must accomplish and understands how the
user may alternatively specify one line per unique query instantiation (thus                                          component works well enough to address any wrapping issues
removing duplicates).
                                                                                                                                                        6

(e.g., whether duplicate assertions are or are not appropriate,                          Disbelief in something we earlier had belief in
what native component control parameters are appropriate).                                (perhaps because it had been supplied in error).
The champion should bring one or more exemplary use cases                             Belief in something we did not have belief in
(preferably expressed in terms of the enterprise’s datasets and                           (perhaps because we had no data about it).
ontology) and should help the wrapping team realize the use                 • Differences in supporting analytical hypotheses, from:
case(s) in component (and workflow) definitions.3                               o Analyst’s conjecture, or “what-if” analysis (that may
   Finally, the component wrapping team always should be                             effect belief or disbelief in data as discussed above).
able to present new requirements to the toolkit development                     o Differences in workflow components giving rise to
team (who may serve multiple enterprises).                                           different answers, when:
   We developed the toolkit during roughly six months of                              A given workflow function has alternative
concentrated effort, to serve both the broader Tangram                                    realizations in different components.
community and ourselves. Starting with the use case presented                         A         given     component        has     alternative
in Section III, we developed first the KB query language and                              configurations of control parameters.
KB query components, then progressively more automatic                         We have commenced efforts to address these issues both
interfaces with which we wrapped GDA (initially). We also                   formally and with appropriate workflow system infrastructure.
have used (or assisted others to use) the toolkit to wrap the
ORA group detection algorithm, suspicion scorers based on                         IX. CONTRIBUTIONS’ RELEVANCE BEYOND TANGRAM
the Proximity [7] and NetKit [5] classifiers, and the pattern                  The use case workflow in Section III includes a generic
matchers LAW [9] and CADRE [8].                                             “Group Detection Component.” While we’ve noted (in
   We have met the Tangram program’s toolkit usability goals:               Section V) that GDA-component-TerroristGroup is an instance
as knowledgeable users, we can usually (for components with                 of the class GroupDetectionProcess, we haven’t said anything
inputs/outputs amenable to the toolkit’s fully automatic                    yet about how such a specific component instance is selected
interface) complete Steps 3 and 4 of the above wrapping                     from among the available alternatives for such a general
process within a single staff hour.                                         process class. Beyond enabling semantic interoperability of
                                                                            enterprise workflow components, IARPA’s broader objectives
VIII. RELAXING THE CONTEXT MONOTONICITY ASSUMPTION                          in Tangram have included providing technology for
   Implicit in the semantics of current Tangram workflow                    characterizing, for a given generic workflow process, the likely
processing is the following monotonicity assumption: A                      performance of a given specific component with data inputs
component’s output graph(s) only add(s), logically, to the                  having certain characteristics, so that the workflow
information in its input graph(s), never delete(s) or retract(s).           management system can select the component likely to
This is not entirely practical.                                             perform best in any given circumstance. Our toolkit supports
   The need to manage potentially conflicting source                        this objective by automating the formal description and
information and analytic hypotheses is ubiquitous in an                     registration of newly defined components in Tangram’s
intelligence analysis enterprise. An analyst, surrounded with               process catalog [9].
data and applicable tools or methods, may choose to pursue                     It’s worth noting that all of the toolkit’s other heretofore-
one line of reasoning at one time and another later, and                    described capabilities remain applicable in the (perhaps more
different analysts may take different approaches and may build              pragmatic) setting where users specify particular components
on each other’s analyses or workflow products. Each such                    for all workflows themselves.
approach—a combination of data, tools, methods, and earlier
hypotheses—represents a context for analytical reasoning. It                                               REFERENCES
is important within the enterprise for each analyst to                      [1]   Almendros-Jiménez, J. M., Becerra-Terón, A., Enciso-Baños, F. J.:
understand the actual context of each piece of information that                   Querying XML documents in logic programming, Theory Pract. Log.
s/he might examine and exploit in further analysis—in which                       Program. 8, 3 (May. 2008), 323–361.
                                                                            [2]   Carley, K. M., Dereno, M.: ORA—Organizational Risk Analyzer. Tech.
s/he may either extend an existing context or branch to create a                  rep. CMU-ISRI-06-113, Carnegie Mellon University, August 2006.
new subcontext.                                                             [3]   Kubica, J.; Moore, A.; Schneider, J., Tractable group detection on large
   Different contexts may arise in workflow-supported                             link data sets, Third IEEE International Conference on Data Mining
                                                                                  (ICDM-2003), pp. 573–576, 19–22 Nov. 2003
analytical reasoning for different reasons, including:                      [4]   Macskassy, S. A., Provost, F.: NetKit-SRL: A Toolkit for Network
• Differences in supporting data, from:                                           Learning and Inference, In Proceedings of the NAACSOS Conference,
     o Conflicting original data sources.                                         June 2005.
                                                                            [5]   Murray, K., Harrison, I., Lowrance, J., Rodriguez, A., Thomere, J.,
     o Time-varying data conditions for a given source, such                      Wolverton, M.: PHERL: an Emerging Representation Language for
          as:                                                                     Patterns, Hypotheses, and Evidence, in Proceedings of the AAAI
                                                                                  Workshop on Link Analysis, 2005.
                                                                            [6]   Neville, J., Jensen, D.: Dependency networks for relational data. In
  3
     Consider that a champion may also bring a new data source that may           Proceedings of the 4th IEEE International Conference on Data Mining,
require extensions or other modifications to the enterprise ontology.             2004.
Addressing such issues has been the responsibility of a different Tangram   [7]   Pioch, N.; Hunter, D.; Fournelle, C.; Washburn, B.; Moore, K.; Jones,
contractor.                                                                       E.; Bostwick, D.; Kao, A.; Graham, S.; Allen, T.; Dunn, M.: CADRE:
                                                                                7

      continuous analysis and discovery from relational evidence,
      International Conference on Integration of Knowledge Intensive Multi-
      Agent Systems, 2003. pp. 555–561, 30 Sept.–4 Oct. 2003.
[8]   Wolverton, M., Berry, P., Harrison, I., Lowrance, J., Morley, D.,
      Rodriguez, A., Ruspini, E., Thomere, J.: LAW: A Workbench for
      Approximate Pattern Matching in Relational Data. In Proceedings of the
      Fifteenth Innovative Applications of Artificial Intelligence Conference
      (IAAI-03), 2003.
[9]   Wolverton, M., Martin, D., Harrison, I., Thomere, J.: A Process Catalog
      for Workflow Generation, in The Semantic Web—7th International
      Semantic Web Conference, Springer, vol. 5318/2008, pp. 833–846,
      2008.