MDORG: Annotation Assisted Rule Agents for Metadata Files
                                Hiba Khalid                                                         Esteban Zimányi
                      Université Libre De Bruxelles                                           Université Libre De Bruxelles
                           Brussels, Belgium                                                        Brussels, Belgium
                         Hiba.Khalid@ulb.ac.be                                                 Esteban.Zimanyi@ulb.ac.be

ABSTRACT                                                                         can significantly improve data manipulation, data fetch requests,
Metadata files are often incomplete and of inadequate quality                    and data analysis [32].
with underlying issues such as low maintenance, lack of prove-                      In the context, of attaining or producing metadata, the pos-
nance, inaccurate tagging, limited or no annotation and, different               sibilities are endless but are both resource and time expensive.
metadata standards. Such metadata is not generally viable for                    Thus, there is a need for systems that can help generate, under-
analysis, large scale integration, data management, or quality                   stand, extract, and populate metadata. This task is not simplistic
metadata maintenance. These types of metadata present chal-                      and is multi-phase, which requires attention to intricate details
lenges in data analysis and production tasks. The expense associ-                such as cross-platform data sharing, data rights, data protection
ated with metadata discovery, metadata annotation, and manage-                   management, and data policies. There is still significant potential
ment amounts to the time and effort of resources. To overcome                    to support and exercise the quality production of metadata, and
the aforementioned issues, leverage can be borrowed from intel-                  its management. We explore the opportunity of leveraging ML
ligent systems that can improve quality and reduce the manual                    techniques to (i) support metadata systems and, (ii) to explore
effort associated with metadata discovery, annotation, and man-                  and categorize information in metadata files using rule agents to
agement. Intelligent agents can aid in identifying information                   understand the information available in metadata files. Based on
if they are provided relevant labels or agents are allowed to                    policies, rule agents guide the process of identifying metadata
learn over-time. Thus, the annotation for intelligent agents can                 types and organize the information at hand for better use.
improve tasks such as automated summaries, metadata manage-
ment, cataloging and, feedback management. We experiment and                     1.1    Metadata Types
propose the utility of annotation for rule agents to facilitate the              Metadata types include: (1) Descriptive Metadata: It [33] describes
analysis and organization of metadata files.                                     a dataset or resource for discovery, profiling and identification.
                                                                                 For example, title, authors, keywords, abstract, and comments.
KEYWORDS                                                                         (2) Structural Metadata: It [12] defines the type, structure, and
rule-based agents, textual metadata, metadata, intelligent agents,               data relationships. This type of metadata highlights the form and
metadata management, metadata representation, metadata cate-                     organization of data under consideration. It can include the order
gorization                                                                       in which information is available as well as; its original recording
                                                                                 format. The most popular examples include information schema
                                                                                 [2] and definition schema. (3) Administrative Metadata: As the
1    INTRODUCTION                                                                name suggests, this type of metadata contains information that
                                                                                 can help manage and update a source. It typically includes in-
In simplistic words, metadata [23] defines data, data collection
                                                                                 formation such as date of creation, published date, update dates,
and data recording process, and details about what the data en-
                                                                                 file type, logs, file definition, access management, and access
tails. The importance of metadata lies in its inherent quality of
                                                                                 rights. Most commonly used subsets of administrative metadata
maintaining the information about data. Generally, there are
                                                                                 include (i) Rights management metadata: this sub-category deals
three possibilities when dealing with metadata (i) Case-A: when
                                                                                 with data rights, intellectual property, and sharing policies. (ii)
there is no metadata available, and data profiling [24] is required
                                                                                 Preservation metadata [17]: includes the information used to
to infer metadata. (ii) Case-B: when there is little to no meta-
                                                                                 archive resources, update and quality management, and resource
data available, in such cases, mostly some sort of metadata such
                                                                                 preservation.
as creation date and publisher titles are available. (iii) Case-C:
                                                                                     In the context of this research paper, we will be discussing
when the metadata is available, however, of inadequate qual-
                                                                                 the (i) usability of rule agents in understanding metadata files,
ity, i.e., the metadata content is either incorrect or not recorded
                                                                                 (ii) define annotations on metadata files (for example, metadata
properly. In this case, the challenge is to extract meaning from
                                                                                 file headers, metadata group headers), (iii) to observe rule agents
the available metadata. There is no speculation or doubt when
                                                                                 using annotated MD files that could facilitate metadata file orga-
it comes to metadata usability [13]. The process, however, is
                                                                                 nization that includes identifying metadata types, content, dis-
non-deterministic and often very resource and time expensive.
                                                                                 persed and misplaced metadata. The paper is organized as follows:
A beneficial understanding amongst the scientific community
                                                                                 section 2 discuss the use of rule agents for different applications
around the topic of publicly available metadata can yield several
                                                                                 and use cases, section 3 discusses the working and design of
advantages. Provided so, the more the available metadata, the
                                                                                 intelligent agents, section 3.1 defines the need for rule agents and
easier it would be to retrieve datasets, suggest and recommend
                                                                                 how they can be incorporated to support and possibly facilitate
datasets, and pre-analyze various data sources. A similar advan-
                                                                                 in metadata organization and categorization. Finally, we discuss
tage is evident for data lakes [20], and data warehousing [9],
                                                                                 the importance of using annotated files for rule agents in section
where data privacy can be a prominent concern, and metadata
                                                                                 4. Section 5 provides insights on some of the challenges we en-
                                                                                 countered, and section 6 provides an overview and conclusion of
© Copyright 2021 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC   our research paper. Our design and experiment include metadata
BY 4.0)..                                                                        files extracted from the Kaggle repository and Data.gov metadata
files. The examples described in this paper are derived from the         function on a set of pre-defined policies to perform a task. How-
UK Road Safety dataset on Kaggle1 .                                      ever, this is not simplistic metadata management, and under-
                                                                         standing for intelligent agents poses many challenges such as (1)
2    RELATED WORK                                                        lack of metadata availability, (2) poorly orchestrated metadata,
This section reviews the most resonating relevant work in the            (3) lack of governance [14], (4) bad quality metadata and, (5) lack
domain of understanding metadata and the role of AI agents.              of metadata standards [28]. The rule agent’s design aims to semi-
   Intelligent Agents and Metadata: AIMMX[30] provides a                 automatically analyze available metadata in files and perform
library-based metadata model extractor from software reposito-           actions such as re-arranging, assigning useful tags, designating
ries, it performs important tasks such as identifying associate          metadata types, and metadata organization for profiles. The rule
resources, model extraction, and model names. In particular, meta-       sets for different groups of agents were distinct in nature to ob-
data supportive systems are advancing and taking support from            serve how comprehensive it was for agents to deal with different
machine learning and AI to build better systems such as IBM              types of metadata and information available in the metadata files.
Redbooks[26] that among other functions uses metadata from
imagery for specifications. Thereby the literature gathers inspi-        3.1    Rule Agents
ration from rule and knowledge systems to address concerns               Rule agents [22] are simple condition action agents that oper-
of semi-automatic categorization and organization of metadata.           ate on a provided environment by using the available rules and
Rules can have a unique impact on the type of system and avail-          completing the assigned task. Each rule agent accesses the rules
able architectures, they not only affect individual agent archi-         from the rule library and uses its allowed set of actions to analyze
tectures but multi-agent systems also [5]. The applicability of          and categorize the files. The rule library comprises rule sets that
rule agents extends to semantic web technologies and architec-           contain multiple rules for a particular category. The rule library
tures [7]. A general architecture for rule agents is devised and         contains many rules that correspond to different uses case in
materialized using semantic web languages. Rule-based systems            metadata file categorization. Each rule set can contain two or
are diversely applicable, it tends to facilitate systems in taking       more than two policies. Rule agents use the available rules to or-
crucial decisions such as intrusion detection [15]. Rules for in-        ganize and categorize information inside a metadata file. The rule
trusion detection like other rule systems have to be pre-defined         library contains all rules that are accessible by the rule agent [6]
to identify and differentiate between incoming packets on a net-         see table 1. Each rule set in table 1 comprises subset rules that sat-
work. Testing these systems also require a set of performance            isfy the main aspect of that ruleset. For example, ruleset R_1002
and evaluation criterion’s that have to be predefined to mark an         i.e, “count columns", contains a total of 4 rules that (1)identifies
agents or systems performance. Another important use case for            columns, (2) column names, (3) keyword matching, and (4) counts
rule-based agents is Grid applications [18]. Rule-based agents           the total column names that appear in a metadata file. Similarly,
perform adequately for high scale and data-intensive solutions.          other rulesets contain appropriate rules to identify and target the
Contrary to belief, these applications not only perform well but         main purpose of the ruleset. Some rule sets don’t contain a lot of
are highly advisable for autonomic applications; that allows rule        rules and are designed more simplistically. For example, the rule
agents to operate on the set of the rule designed for autonomous         set R_1018 contains only one rule i.e. to search for empty line
systems [4]. The application and use of rule agents for different        annotation. In our problem design, we work with three libraries,
application systems are apparent and understandable. Based on            the rule library (contains rule sets), the file library(contains meta-
evidence and performance of rule agents, they are deployed in            data collection), and the agent library(contains rule agents). The
various applications and systems.                                        metadata file organization and content identification are obtained
   Metadata Management and Profiling: Metadata manage-                   as a result of policy application using rule agents on collected
ment has taken many forms with time, it is important to un-              metadata files.
derstand how metadata has evolved over the years in terms of                 Rule-based systems are designed for knowledge storage, knowl-
applications and tools. In general metadata [27] and metadata            edge manipulation, and informed feedback from the system. Rules
management [11] [16] are important components of data ware-              are designed as a part of the rule-based system. A rule-based
housing and large scale data integration systems. Amongst many           system first gathers the knowledge and then manipulates this
challenges in the field of metadata, standards [28] [19] [8], data       knowledge to derive useful information, conclusions, or set of
profiling [21], and format normalization is one of the most promi-       actions. Rule-based systems are called expert systems; that can
nent concerns. Moreover, the need to maintain and provide re-            understand, store, and derive inference from available informa-
producible metadata in terms of lineage and provenance is one            tion. In the context of our research goals, we intend to identify
of the most promising aspects of metadata in semantic web [3].           different types of the available metadata, comprehensively de-
                                                                         duce metadata information, and categorize metadata content
3    INTELLIGENT AGENTS                                                  properly. Thus, we work with the most authentic and traditional
Intelligent agents are capable of performing tasks that are in-          type of rule-based system called Deduction expert systems (DES)
structed to them. The intuition behind introducing intelligent           [10]. The DES works with a domain-specific knowledge base and
agents for metadata file understanding and categorization stems          rules to produce (1) deductions and (2) actions based on choices.
from the idea of eliminating or reducing the manual effort re-           The choices can be curated or manually programmed into the
quired in cleaning, processing, and organizing metadata files. We        expert system. In our case, we deal with both types of choices that
make use of rule agents for assisting in the process of metadata         can render multiple actions. We group types of agent actions as
tagging [25], metadata annotations, and metadata file organiza-          ‘Action Set’ to centralize the different types of agent actions a rule
tion. Rule agents, or more commonly known as policy agents,              agent can perform or attain. For an expert system to function end
                                                                         to end, it has to be intuitively designed according to the domain
1 https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-vehicles   problem. However, the basic component of all expert systems
comprises the following: (i)Knowledge Acquisition: The process
of collecting knowledge on a domain, defining methods, bound-
aries, outliers, and special cases, etc. (ii)Rule Base or Knowledge
Base: this comprises a collection, list or set of indicative rules to
render action and choices., (iii)Inference Engine: this engine is
responsible for deducing or understanding the rules listed.
    We designed a multi-agent system for metadata categorization,
file understanding, and organization. However, in the context and
scope of this research paper, we will be discussing the working
and performance of 5 rule agents and omit the rest of the agents
due to space limitations. The agent continues to apply rules until
the metadata file is completely analyzed see figure 1. Each agent
is randomly assigned a metadata file from the mixed collection.
Once each agent has its file, a number of actions are performed
by each agent to understand the file and its content. For example,
in figure 2 we illustrate how agent RA_101 first identifies the
prominent file features that are expected to be in the file such as
headers, group headers, empty lines, and metadata content. After
this step, each agent takes on the metadata contents identified
and starts applying rules by accessing the rulesets from the rule
library. Each rule application involves the use of allowed agent
actions such as read, write, match, drop, delete, and annotation
request. An agent request for annotation if there is no associated
rule pre-defined in the rule set, or it requests for annotation if
the item cannot be categorized, i.e. when an agent is indecisive
about an annotation and its associated action. In figure 4, there
are five types of information for a rule agent to identify in this
use case. Selective information is listed below:


                                                                        Figure 2: The overall depiction of how agents interact with
                                                                        metadata file collection to analyze metadata files.


                                                                        in metadata files falls under the metadata content category. As,
                                                                        most of the information besides headers, group headers, and
                                                                        empty lines is metadata content.
                                                                           3.1.3 Empty Lines: the third type of information a policy
                                                                        agent encounters for this metadata file shown in figure 4 is empty
Figure 1: The overall depiction of how agents interact with
                                                                        lines. The empty lines are annotated as ‘empty lines’ for policy
metadata file collection and uses percepts and actions to
                                                                        agents to quickly differentiate between data, empty lines, and
analyze metadata files.
                                                                        empty values.
                                                                           Rule-based agents typically function on a set of pre-programmed
                                                                        policies or a set of rules [29]. These are basic instruction set for
   3.1.1 Metadata Group Headers: the annotated file in figure
                                                                        an agent to use and perform actions. In the case of metadata cat-
4 is assigned to a policy agent. The policy or rule-based agent
                                                                        egorization, organization, and understanding, rule based agents
utilizes the annotation to identify the group header and looks
                                                                        are promising. However, autonomy and self-reliance of decisions
for data inside each group header. The group headers in figure 4
                                                                        are difficult to achieve even for intelligent agents. Annotation
are ‘Usage Information’,‘Maintainers’ and, ‘Updates’. Policies are
                                                                        is a technique that allows the rule agents to (i) make sense of
designed to equip agents in providing accurate results.
                                                                        the information (ii) facilitate future autonomous actions. In our
   3.1.2 Metadata Content: the second type of information a             research, we use manual annotation and tagging to observe how
policy agent encounters is ‘metadata’. This is regarded as data in      intelligent agents perform and categorize metadata information
our problem or more precisely as value, and this is not annotated.      provided to them. To facilitate and provide an understanding of
Thus, the policy agent identifies a group header, locates data, and     the problem to rule agents, we established annotation criteria for
tags it accordingly, for example, ‘data: Group Header_Name’. For        rule agents, figure 4 demonstrates the application of established
example, the first bulk of data would be stored as ‘data[License:       annotations from table 2 highlights the annotations provided to
Database: Open Database, Contents: Database Contents, Visibil-          rule agents for different types of metadata files.
ity: Public]: Usage Information’. Now, this information is avail-          For manual annotation, we group the basic contents of a meta-
able for all policies to explore and exploit. The data inside group     data file that an intelligent agent can come across for a certain file
headers will be utilized to categorize the type of metadata or          type(e.g., text, CSV, TSV, JSON). After pre-processing of differ-
information available in the next phases. Most of the annotation        ent file types into spreadsheets the annotation process is further
RuleSet ID   Rule Identifier                Rule Description
R_1001       Data Type Identifier           Identifies the type of data encountered as text, numeric, date type etc.
                                            It searches the file for total
R_1002       Count Columns                  number of columns. It uses keywords and counts if column names appear in
                                            textual descriptions.
                                            It iterates to find attribute names, if avaible in file creates a
R_1003       Identify Attribute Names
                                            list.
R_1004       Generate Attribute Lists       It generates a list of attributes to be inserted into metadata file.
                                            It iterates over textual metadata using headers and group headers to find
R_1005       Look Up: Rows
                                            ‘total rows’ in dataset.
                                            It looks for null value indicators, and identifies if headers or metadata
R_1006       Look Up: Missing Values
                                            lacks value counterparts.
                                            It identifies and looks for date of creation in a metadat file using
R_1007       Look Up: Date of creation
                                            group header and types of metadata.
                                            It identifies and looks for data update in a metadat file using group
R_1008       Look Up: Update Dates
                                            header and types of metadata and keywords.
                                            It identifies and looks for license information in a metadat file using
R_1009       Look Up: License Information
                                            group header and types of metadata and keywords.
                                            It identifies and looks for publisher in a metadata file using group
R_1010       Look Up: Publisher
                                            header and types of metadata and keywords.
                                            It identifies and looks for data provacy details in a metadat file using
R_1011       Look Up: Data Privacy
                                            group header and types of metadata and keywords.
                                            It identifies and looks for version policies or details in a metadat file
R_1012       Version Management
                                            using group header and types of metadata and keywords.
                                            This aims to derive a topic sentence or domain of data using metadata
R_1013       Domain Identifier
                                            file, textual summary and file names.
                                            This policy particularly searches for ‘keyword’ word in metadata file to
R_1014       Search: Keywords
                                            collect keywords and save.
                                            The comments can be identified in two ways (i) using headers and (ii)
R_1015       Search: Comments
                                            using annotation tags.
R_1016       Search: Context CSV            It looks for csv’s that might contain metadata in primary downloads.
R_1017       Search: Headers                It looks for annotated headers in a metadata file.
R_1018       Search: Empty Lines            It looks for annotated empty lines in a metadata file.
R_1019       Search: Group Headers          It looks for annotated group headers in a metadata file.
R_1020       Search: Metadata Types         It looks for annotated metadata types in a metadata file.
R_1021       Search: File Type              It looks for annotated file types in a metadata file.
                                            It looks for archives in a metadata file using header and keyword
R_1022       Search: Archives
                                            information.
                                            It looks for data summary in a metadata file using header and keyword
R_1023       Search: Data Summary
                                            information.
R_1024       Look Up: Hyperlinks            It looks for hyperlinks inside a metadata file and stores them.
                              Table 1: Sample Policy description for Rule based agents


Annotation Type            Annotation Description
                           A metadata file header can include file title, the data set title, and other headings or sections
Metadata File Headers
                           in a metadata file such as Keywords etc.
                           A group header is a header that can represent one or more headers, a collection of textual
Metadata Group Headers
                           description can contain tables and column information.
                           This is actual value or content against metadata headings, the content inside metadata headers
Metadata Content
                           and group headers, values against metadata types such as created by ,
Metadata Types             This identifies different types of metadata such as descriptive, administrative, technical etc.
Empty Lines                An empty line, group of empty lines or empty space that contains no actual metadata.
                             Table 2: Sample Policy description for Rule based agents
Figure 3: The overall system process from knowledge acquisition to the action center where agents perform different sets
of actions.


carried on. After grouping, we add an annotation that rule-based
agents can use to categorize, organize, and categorize content in
metadata files. We refer to this phase as Anatomy Tagging i.e.,
in this step, we annotate sections that represent the anatomy of
a metadata file. For example, table 2 illustrates the anatomical
annotation contents. Headers and group headers are annotated to
support deterministic and scalable search through a metadata file.
Empty lines, empty paragraphs, or empty chunks can indicate the
end of files, termination triggers, etc. To deal with this, we anno-
tated empty lines in a metadata file and set actions for rule agents
to delete and discard provided it does not change information at
hand. For instance, textual metadata files contain empty spaces,
headers such as a Description that includes a bunch of text about
the dataset. The agents need context and understanding of what
classifies as a header and what is the actual content or metadata
value.
   For this purpose, we pre-process the metadata files and assign
meaningful annotations so the rule agents can process the files,
extract, categorize, and organize information more accurately.         Figure 4: Annotation For Different Cases in a Metadata
Thus, incorporating the kind of metadata available in a file for       File: The image indicates annotation key i.e. the process
the rule-based agents is critically important. We dealt with this      used to annotate different cases such as comments, empty
challenge by annotating limited information in three broad meta-       lines, metadata types, etc.
data categories and their subcategories(see section 1.1). Mostly
metadata files are messy and unprepared; the objective is to uti-
lize rule agents to organize and make sense of the messy files and     available but it is either misplaced or of bad quality. This case
generate profiles. To achieve this, the agents must understand the     is where most of this research paper focuses. We have worked
kind of information they are dealing with. For this, we annotate       with a dataset that does provide textual metadata, descriptive
examples from different types of metadata so rule agents can           metadata, metadata tags, headers, context. But, it is misplaced
categorize and organize metadata files. Figure 4 indicates the         or, more precisely, unorganized. Our goal was to identify this
annotation strategy applied for a small file.                          metadata using rule agents. We have designed an instruction
                                                                       set for metadata files to facilitate our rule agents in detecting,
3.2     Annotation Example                                             identifying, and categorizing metadata file contents. We annotate
Let us look at a simple use case from collected and annotated          these instruction set categories and supply them to rule-based
metadata file repository, the metadata file acquired from UK GOV       agents (RBA). Table 2 indicates the annotations we perform the
Road Safety Data (2005-2015) from Kaggle 2 . Figure 4 depicts a        first time files are accessed. These files are annotated and are then
simple metadata file extracted from Kaggle web page. It does           further processed by rule agents. The table 2 includes the main
not contain a metadata file; as a response, the agent adds a title     headers that might appear in a metadata file, empty lines are
inferred from other metadata files for the UK GOV Road Safety          identified to avoid repetitive cycles for AI agents. It is important
dataset. It is important to note that there is plenty of more meta-    to note that if there are sub-headings or, ‘headers’. We annotate it
data available on the Kaggle web-link, but it is unorganized, and      as a ‘Metadata group header’. It helps the agent in differentiating
it is not marked or labeled as metadata. There are typically three     between file headers and group headers that can hold a chunk of
use cases that we address namely:, (i) Case-I: when there is no        metadata or information. The chunks of metadata are treated as
metadata available, in this case, we use techniques such as ex-        data and are further analyzed to identify the types of metadata it
ploratory data analysis (EDA) [31] and data profiling [1, 24] to       contains. This is also an annotation category identified as ‘meta-
derive meaning from the data itself. (ii) Case-II: when metadata       data types’ in table 2. Once the annotation is complete, now we
is available in high quality. Most of the time, metadata is either     have to design rule sets that can be utilized by policy agents to
unavailable or in poor quality (iii) Case-III: is when metadata is     render a decision. The collection of files, annotation, metadata
2 https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-         cleaning & organization, and preparation comprise the knowl-
vehicles/metadata                                                      edge acquisition phase. Figure 3 illustrates four main stages for a
policy-based or rule-based agent. The second step involves Rule        the categorization of metadata value at hand or the occurrence
Base, a rule base is a collection of rule sets or policy sets. These   of empty blocks, empty lines, etc (see figure 6). The experiments
policy sets are designed to support agents in performing different     in figure 6 highlight the frequency of manual annotation request
actions. The rule sets are analyzed by an inference engine that        of 5 rule agents(case-I agents-with annotated files, and case-II
makes sense of how each rule matters and applies to the case.          5 rule agents-unannotated metadata files). Each time an agent
Finally, an action center is a set of actions an agent is permitted    cannot categorize content i.e. data object, header, content, text
or allowed. Thus, a policy agent can perform a certain limited         block, etc., it sends a request for manual annotation to the user.
set of actions based on an inference derived by rule application       This represents indecisiveness as these agents are simple policy
on metadata files.                                                     agents and are not reinforced, agents. From this experiment, we
                                                                       expected to learn about the usability of adding annotations for
4     EVALUATION                                                       metadata headers, group headers, content, and types. As a result,
                                                                       we observed that unannotated files required significantly more
The role of annotated metadata files for rule agents was observed
                                                                       annotation requests as compared to files that were pre-annotated.
by conducting experiments that evaluated and compared the rule
agent’s ability to categorize and organize metadata files with and
without annotations. Another objective of the experiments was          4.2    Detecting Metadata Types and Analyzing
to observe how the rule agents performed if rules were altered or             Policy Frequency
manipulated. The experiments were conducted on a spreadsheet           The second experiment was to observe if rule agents can inde-
dataset that was gathered from Kaggle metadata files, Kaggle           pendently detect metadata types with and without annotations.
resource descriptions, and Data.gov metadata files from various        The objective was to alter policies for rule agents and observe
categories. A total of 1123 metadata files were retrieved from         if it would affect the overall change in metadata type detection.
both repositories and added to a “mixed collection” to the work-       Firstly, rule agents without annotations were 80% of the times
ing repository for rule agents. Throughout the research paper,         requesting for a manual annotation if a metadata type had to
we will refer to the mixed collection as a collection of metadata      be detected. Figure 7 indicates detectable metadata types (De-
files from both Kaggle and Data.gov. As far as our research con-       scriptive (DMD), Administrative (AMD), Structural (SMD), Rights
tribution is concerned, to the best of our knowledge, there are        (RMD), Preservation (PMD)) in a number of files(recorded in per-
no substantial contributions that directly address and aim at the      centages). We observed that most of the metadata files did not
exact definition of the problem. However, there are related tools      contain the exact information such as resource link to the original
such as Metanome [24] that can be tested with the technique            data repository, update policies, copyrights, etc. In most cases,
we have developed to assess how their performance improves             this metadata could however be retrieved, we deal with this and
or deteriorates. With regard to the scope of this research paper,      regard it as “misplaced metadata”. Thus, rule agents with anno-
we would like to emphasize that the experiments carried out are        tated files were able to detect more metadata types and were
based on the usability of the technique and how meaningful ob-         intelligent enough to identify discrepancies or missing informa-
servations have been made with regard to metadata management           tion from files. In regards to “misplaced metadata”, we made a
using rule agents. It would go beyond the scope of this research       few observations on collected files. Some of the most prominent
paper to compare and examine the improvement of existing tools         observations were as follows: Lack of metadata support and
by the technique we have developed. Nevertheless, our future           updates: most websites, resources, and portals do not maintain a
work will be concentrated on observing performance change and          complete metadata update cycle. The information on the dataset
the applicability of our technique to other tools.                     description page and the actual metadata file is observed to be in-
                                                                       coherent. Lack of metadata download support: repositories
4.1    Evaluating Rule Agent Performance                               and resources describe a resource on the main page or portal but
To understand the role of annotated metadata files for metadata        disregard the facility to download this valuable metadata. For
categorization and file organization. It is crucial to observe how     example, the repository pages at Kaggle and Data.gov both con-
agents behave with and without annotations. This experiment            tain essential descriptive and rights metadata. This information
was performed with a set of five rule agents. In the first phase,      such as resource description, keywords is not readily available
these five rule agents were provided metadata files that were not      for downloads. It is also not properly tagged and is not automati-
annotated, or labeled. The same set of agents were analyzed and        cally embedded or added to downloadable metadata files. In most
observed in a setting with annotated metadata files. Each agent        cases, it has to be manually downloaded or added to the metadata
was randomly assigned metadata files from the entire mixed col-        collection.
lection metadata files retrieved from Kaggle and Data.gov. Figure          Another important experiment was to observe how many
5 illustrates the performance of each agent with and without an-       times a particular rule was accessed and applied by the rule
notations. For instance, the AR1 represents rule agent_001 with        agents. This was observed by maintaining a Rule log in each
annotated metadata files and, UR1 represents the performance of        cycle for all rule agents. Figure 8 depicts the frequency (recorded
the same agent with unannotated metadata files. Each agent in          in percentages) of rules fired by each rule agent. The most useful
both categories (annotated, unannotated files) was evaluated on        aspect of this observation experiment was to understand the type
four factors: (1) accurate identification of metadata headers, (2)     and frequency of metadata content prevalent in collected files.
correct identification of metadata group headers, (3) MD content       For example, R_1006 firing frequency is quite high indicating that
or values against metadata types, and (4) accurate categorization      the rule agents encountered null and empty values in the majority
of available metadata into metadata types. File annotation re-         of metadata files. Similarly, R_1022 firing frequency was quite
quest was also examined for all agents, to understand how many         low in comparison to other rules indicating that the majority
times an expert user was prompted for a manual annotation re-          of the metadata files did not contain information relevant to
quest. This happened when the rule agent could not decide on           data archives, metadata archives, or legacy support files and
Figure 5: The overall action performance of agents on annotated and unannotated metadata files that involves use of
content, annotations and set of action sequences (mixed collection).

                            Annotation Percentage RA_101 RA_102 RA_103 RA_104 RA_105
                            25%                   47%       45%       57%      59%       43%
                            50%                   65%       78%       69%      75%       84%
                            75%                   95%       92%       89%      78%       88%
Table 3: The table illustrates the change in behavior of rule agents based on different annotation levels of 25%, 50%, and
75%.


         Dataset       Total Files   Pre-Processing Rule Applicability Rate Rule Failure Rate Rule In-applicability Rate
         Kaggle        563           91%             73%                    11%               16%
         Data GOV      560           45%             87%                    8%                5%
                                      Table 4: Dataset Collection Comparison and Analysis.


                                                                     Figure 7: The figure represents the percentage of files that
                                                                     rule agents were able to detect different metadata types
Figure 6: The number of manual annotations requested by              (mixed collection).
agents with annotated and unannotated files mixed collec-
tion. The performance of agents depicted in the above fig-
ure is based on 75 percent annotation level of metadata              fired and how many times it is utilized on average by each agent.
files from the mixed collection.                                     The table 1 provides insight into policies available to all rule
                                                                     agents for both annotated resources and unannotated resources.
                                                                     Each policy defined in table 1 is made available to rule agents
hyperlinks. Another aspect that can be observed from figure          based on their tuning. We refer to tuning, i.e.“a set of policies”
8 is that not all metadata files contain relevant hyperlinks. As     as profile setting available to a particular agent. In our research,
indicated R_1024 was accessed by agents a couple of times.           we experiment with agents in different settings by increasing or
   The table 1 includes a sample of 24 rule sets that are measured   decreasing the number of policies accessible by a particular agent.
in figure 8 that demonstrates the rule firing frequency by rule      For example, R_1010 searches the file for keywords as “publisher”,
agents on the mixed collection. It visualized how each rule is       synonyms, and approximate matches for this keyword group. The
functionality allows a rule agent to scan headers, group headers,        Actions           RA_101     RA_102     RA_103    RA_104     RA_105
and text blocks to locate a valid publisher and its details in a         Read              1189       1189       1189      1189       1189
metadata file.                                                           Search            1458       1650       1687      1465       1093
                                                                         Write             1189       1192       1125      1143       1061
                                                                         Match             4567       4987       4976      3456       1148
                                                                         Drop              10         120        56        34         26
                                                                         Delete            0          3          8         12         8
                                                                         Compare           201        189        300       450        120
                                                                         Annotation Req    109        80         344       450        122
                                                                        Table 5: An overview of average agent actions on a meta-
                                                                        data file. The table illustrates the importance of different
                                                                        profile settings for rule agents and their impact on agent
                                                                        actions.


                                                                        4.4    Agent Performance and Analysis
                                                                        To obtain a better picture and understanding of agent perfor-
                                                                        mance we designed three major categories of experimental obser-
                                                                        vation. Firstly, we tested our technique and rule agents on mixed
                                                                        collection i.e., a collection of metadata files obtained from Kaggle
Figure 8: An observation of average rule firing frequency               and Data.gov. Figure 9 illustrates the performance of each rule
by rule agents on mixed collection                                      agent on the mixed collection. The second experimental setting
                                                                        was to observe how agents performed with different levels of
                                                                        annotated datasets. Table 3 provides an analysis of each agent’s
                                                                        performance in accordance with the annotation percentage. We
                                                                        defined three annotation levels starting from 25%, 50%, and con-
4.3    Agent Actions
                                                                        cluding with 75% annotated metadata files. By observing how
The policy agents are assigned a set of actions that can be chosen      agents performed we were able to understand that annotated
to attain the next state and complete the assigned task (based          metadata files do significantly improve the agent performance
on available policies provided to each agent). The allowed set of       even in the most difficult cases such as metadata type identifica-
actions for any rule agent in this experimental setup include the       tion. Most importantly, from this experiment, we observed how
following: Read:when the agent scans and reads the file ingested        each agent was affected based on its primary policy selection (i.e.,
into the system. Search:a search action that allows the agent to        setting for each agent in terms of allowed or accessible policies).
scan the file, gather insight and look for common words, indi-          For example, in table 3.
cators and headers. Write: allows the agent to write into a new
metadata file. Match: is an action where the agent matches its
knowledge of annotation with available tags in the metadata file
and cases where there is a need to search for synonyms, etc. Drop:
is the case where a content is dropped or considered invaluable
by the agent and added in the metadata file as ‘supplementary in-
formation’. Delete: is when an agent decides to delete a piece of
content such as consecutive delimiters, misspelled words. Com-
pare: when two or more headers, sub-headers, or content pieces
are compared for duplicate data identification. Request Anno-
tation: when an agent is indecisive on a set of sentences such
as comments (if not annotated), the agent requests for a manual
annotation from the user(this action is typically requested after
the agent completes the read cycle and before the write opera-
tion). In table 5, five rule agents are observed for action behaviors
on the same metadata file. The experiment observes the role of
policies and how it changes the action initiation sequence for          Figure 9: An observation of overall agent performance in
each agent. For example, RA_101 does not exclude the dropped            terms of accuracy, precision, and recall.
and deleted items. Thus, the read items are the same as written
items in new metadata files but with a different context. On the
other hand, agents like RA_102, RA_103, etc., do not add these          4.5    Metadata Collection Analysis
items into the total write count for newly generated metadata           It is very critical to understand how the technique responds to
files. The difference in operations is based on the accessibility of    different types of datasets. In our case, the aim was to first collect
policies and action preferences assigned to each agent. All five        metadata files that can be analyzed by rule agents. In the context
agents were observed for different settings to understand how           of our research problem, we did not limit the collection of meta-
actions and operations differ based on agent action allowances.         data files to a certain domain. Thus, the collection comprised
Thus, by changing policies we were able to observe significant          metadata files from various domains such as accidents, Geospa-
changes in the performance of agents for the same metadata files.       tial metadata files, pharmacy, music, movies to name a few. The
challenge was, however, to understand and find relevance (if it         formats, the information does not strictly adhere to standards,
persists) in these metadata files and secondly to identify different    and the files are populated differently for each dataset. Thus,
constituents of metadata. Based on our collection, we worked            metadata file pre-processing is one of the challenges that we
with a total of 1123 metadata files. A total of 563 files were ob-      currently deal with and address. Most of the metadata files were
tained from the Kaggle repository (on numerous domains and              raw, and metadata was dispersed and misplaced a pre-processing
topics), and a total of 560 metadata files were retrieved from the      step had to be established.
Data.gov repository (on numerous domains and topics). Table 4               Conflicting Labels and Label Cleaning We address this
depicts our overall observations and understanding of metadata          by designing labels to avoid conflicts in the ground truth. The
collection. We have separately analyzed both collections on the         strategy we defined includes a growth pattern that allows for
following criteria:                                                     labels to evolve and change over time if needed. We set this up by
                                                                        providing functions that allow the user to manipulate non-critical
   4.5.1 Pre-Processing. : this metric identifies the amount of         labels (the implementation details of this feature are beyond the
pre-processing required on files before it can be fed to the rule       scope of this research paper). For the context of the current
agents for further categorization and analysis. Pre-processing is       research paper, the labels added are marked for each file, and a
an important aspect as most metadata files are raw and do not           non-conflicting hierarchy of labels is executed to avoid confusion
come processed with preambles, labels, and annotations. More-           for the rule agents. For example, if a file contains more than one
over, the file structures and standards vary for the majority of        main header, it should explicitly contain separate information
the cases. We observed that metadata files from Kaggle required         and should be a different word. One file cannot have two main
more pre-processing in terms of missing values, incorrect values,       headers title says “Descriptive Metadata” or “Publisher”. Such
and file standardization. Also, metadata from Kaggle had many           hierarchical control allows for information categorization and
cases of “misplaced metadata” making pre-processing a necessary         avoids conflicting labels. We have extended our work on dealing
step in the process. On the other hand, metadata retrieved from         with conflicting labels and intend to improve as we move along.
Data.gov required basic cleaning and file preparation and was               Nested Files and Data In the context of our research we only
already in structured formats.                                          deal with non-nested semi-structured files.
   4.5.2 Rule Applicability Rate. : we also observed how many               Failure Despite Annotations we prepared ground truth to
rules were applicable on a pre-processed metadata file from both        deal with the most frequent and available information in meta-
collections. Table 4 identifies a relative difference between rule      data files by providing annotations. However, even with annota-
applicability on both data collections. However, due to more            tion practices, the rule agents were unable to provide a decisive
header and tag oriented file structure, the metadata files from         result on cases such as identification of the most recent meta-
Data.gov were perceived to be more rule applicable in comparison        data file version. This was a failed use case for rule agents. The
to Kaggle collection.                                                   annotation and content information was insufficient to make a
                                                                        necessary conclusion.
   4.5.3 Rule Failure Rate. : the rule failure rate is described as         Unreadable Text The collection contained metadata files that
a failed attempt of rule application by a designated rule agent.        had discrepancies in terms of incorrect (text in English with
If an agent executes a rule and there is no profitable outcome          incorrect spellings etc.)and indiscernible text (text in another
such as no conclusive agent action is performed and no task is          language). In both cases, the rule agents were unable to classify
completed it is recorded as a failure rate. The files from Kaggle       the text in any of the categories. Since we do not provide multi-
had an 11%failure rate and metadata collection from Data.gov            lingual support and many metadata files contain information
had an 8% failure rate as depicted in table 4.                          from different languages such as French, Chinese, Italian, etc.
                                                                        This remains a challenge for our current research.
   4.5.4 Rule In-applicability Rate. : this is defined in terms of          Incomprehensible Metadata Files The collection comprised
lack of metadata elements or lack of metadata content inside            of files that contained metadata information such as schema and
metadata files. The unavailability of information in metadata files     legacy links contained inside an image. This was a challenge and
causes a rule in-applicability. It means that certain rules were        a limitation of our system since we do not process image files.
never applied by agents as those elements, keywords, headers,           Thus, we had to exclude such files from our collection.
or information content was not available. From a collection of              Inconclusive Column Names This is a semantic challenge
agent polices, a total of 16% rules were inapplicable in different      that is not a part of our problem description. As we deal with
phases on files retrieved from Kaggle, and a total of 5% rules were     metadata files, and categorize the information inside them, clean
inapplicable on metadata gathered from Data.gov (see table 4).          it for better reusability. Nonetheless, we identified in about 25% of
                                                                        metadata files in the extended collection (before scrutiny, the final
5   CHALLENGES                                                          collection contained 1123 files only) contained column names
Some of the most eminent challenges we faced were: Metadata             that did not represent the information accurately. Also, most
collection: Metadata files were available for some datasets. How-       of the time column names and column values were conflicting.
ever, most of the metadata was misplaced or can be gathered from        If the column name was, for example, “Publisher” it would be
other resources. Thus, one file with all metadata was a rarity. In      expected to contain the dataset publisher name, date, or resource
this section, we discuss some of the most difficult challenges we       link. But it contained a phone number or city name.
encountered. for some of these challenges, we aim to provide                Empty Files We address this issue partially in our current
partial solutions and extend them to our future work.                   work by identifying blocks of empty data and empty files. This is
   Metadata Pre-processing In almost all of the files collected         another challenge that is more oriented towards understanding
from the Kaggle repository and Data.gov, the metadata had to be         file structures and identifying different tables or contents inside
cleaned and prepared before it can be utilized by our system. The       a file. It is beyond the scope of our research, we only identify
prepossessing is a particular challenge as all files are in different
empty metadata files from our extended collection and disregard        Information Technologies for Business Intelligence-Doctoral College
them from the main collection i.e., 1123 files.                        (IT4BI-DC).
   Inconclusive Headers This is a challenge that we address in
our current research and we are extending on this challenge in our     REFERENCES
future work as well. The problem inhibits naming conventions            [1] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data profil-
                                                                            ing: A tutorial. In Proceedings of the 2017 ACM International Conference on
and identifying whether a title or text is to be considered a header        Management of Data. 1747–1751.
or sub-header. Primarily we deal with this by intuitive annotation      [2] Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. 1986. A com-
techniques in our ground truth. We intend to extend this approach           parative analysis of methodologies for database schema integration. ACM
                                                                            computing surveys (CSUR) 18, 4 (1986), 323–364.
by developing a tree-based structure for the title, header, and sub-    [3] Rafael Berlanga, Oscar Romero, Alkis Simitsis, Victoria Nebot, Torben Bach
header. The tree would be designed to identify different levels of          Pedersen, Alberto Abelló, and María José Aramburu. 2012. Semantic web
headers and text to remove the ambiguity of inconclusive headers            technologies for business intelligence. In Business Intelligence Applications
                                                                            and the Web: Models, Systems and Technologies. IGI global, 310–339.
and their hierarchy in metadata files.                                  [4] Sanjay P Bhat and Dennis S Bernstein. 2000. Finite-time stability of continuous
   Other Challenges Another case was misplaced commas and                   autonomous systems. SIAM Journal on Control and Optimization 38, 3 (2000),
                                                                            751–766.
delimiters, rule agents failed because preparation was manually         [5] Costin Bǎdicǎ, Lars Braubach, and Adrian Paschke. 2011. Rule-based dis-
done and no rules directly addressed punctuation, delimiters,               tributed and agent systems. In International Workshop on Rules and Rule
misplaced commas, etc. Another example of rule agent failure                Markup Languages for the Semantic Web. Springer, 3–28.
                                                                        [6] Serena H Chen, Anthony J Jakeman, and John P Norton. 2008. Artificial intel-
includes the case of broken text paragraphs and empty columns.              ligence techniques: an introduction to their use for modelling environmental
Most of the time the rule agents were able to identify an empty             systems. Mathematics and computers in simulation 78, 2-3 (2008), 379–400.
row or column if it was labeled, on the other hand, we did observe      [7] Jens Dietrich, Alexander Kozlenkov, Michael Schroeder, and Gerd Wagner.
                                                                            2003. Rule-based agents for the semantic web. Electronic Commerce Research
rule agents in a cycle of interpretation of empty rows in sheets.           and Applications 2, 4 (2003), 323–338.
Finally, missing values were also an issue even with annotations,       [8] Erik Duval. 2001. Metadata standards: What, who & why. Journal of universal
                                                                            computer science 7, 7 (2001), 591–601.
as rule agents work by example meaning, the rule agents failed in       [9] Neil Foshay, Avinandan Mukherjee, and Andrew Taylor. 2007. Does data
identifying if the content or “metadata value” was placed against           warehouse end-user metadata add value? Commun. ACM 50, 11 (2007), 70–77.
the correct metadata category.                                         [10] Dov M Gabbay. 1985. Theoretical foundations for non-monotonic reasoning in
                                                                            expert systems. In Logics and models of concurrent systems. Springer, 439–457.
                                                                       [11] S Christopher Gladwin, Matthew M England, Dustin M Hendrickson, Zachary J
6   CONCLUSION                                                              Mark, Vance T Thornton, Jason K Resch, and Dhanvi Gopala Krishna
                                                                            Kapila Lakshmana Harsha. 2009. Metadata management system for an infor-
In this paper, we have discussed the prospect of utilizing rule             mation dispersed storage system. US Patent 7,574,579.
agents that could facilitate the metadata categorization based on      [12] Jane Greenberg. 2009. Metadata and digital information. In Encyclopedia of
                                                                            library and information sciences. CRC Press, 3610–3623.
two types of metadata files i.e., annotated MD files and unanno-       [13] IEEE. [n.d.]. IEEE LOM: IEEE Standard for Learning Object Metadata.
tated metadata files. To support and enhance the use of metadata       [14] William H Inmon, Bonnie O’Neil, and Lowell Fryman. 2010. Business metadata:
                                                                            Capturing enterprise knowledge. Morgan Kaufmann.
in data management and business information systems it is cru-         [15] S Jha and Mahbub Hassan. 2002. Building agents for rule-based intrusion
cial to minimize the amount of time spent on pre-processing                 detection system. Computer Communications 25, 15 (2002), 1366–1373.
and categorization. We provide a policy-based system attributed        [16] Phokion G Kolaitis. 2005. Schema mappings, data exchange, and metadata man-
                                                                            agement. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART
by policy agents to identify, observe, and categorize available             symposium on Principles of database systems. 61–75.
metadata files for reusability and easy understanding. To obtain       [17] Brian F Lavoie and Richard Gartner. 2005. Preservation metadata. OCLC.
this, we explored the possibilities of changing policies and ob-       [18] Zhen Li and Manish Parashar. 2004. Rudder: A rule-based multi-agent infras-
                                                                            tructure for supporting autonomic grid applications. In International Confer-
serving the behavior of agents. By our observation, change in               ence on Autonomic Computing, 2004. Proceedings. IEEE, 278–279.
available and accessible policies affected the agent’s ability to      [19] Marilyn McClelland. 2003. Metadata standards for educational resources.
                                                                            Computer 36, 11 (2003), 107–109.
correctly categorize a piece of metadata in a file. We also observed   [20] Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endow-
that different files contain information that might be completely           ment 11, 12 (2018), 2130–2139.
irrelevant based on the policies defined in our research. This         [21] Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4
                                                                            (2014), 40–49.
method of experimentation helped us in understanding the core          [22] Nils J Nilsson and Nils Johan Nilsson. 1998. Artificial intelligence: a new
components of metadata files and how policy agents can best                 synthesis. Morgan Kaufmann.
fit a metadata file categorization use-case. The second experi-        [23] NISO. [n.d.]. Understanding metadata. National Information Standards
                                                                            Organization. NISO Press, 2004.
ment we conducted was to utilize manually annotated files for          [24] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and
rule agents to observe improvement and ease in metadata file                Felix Naumann. 2015. Data profiling with metanome. Proceedings of the VLDB
                                                                            Endowment 8, 12, 1860–1863.
categorization and processing. As a result, we observed that an-       [25] David Peto and Stef Lewandowski. 2015. Metadata tagging of moving and
notated files are helpful to policy agents as the search space is           still image content. US Patent 8,935,204.
more reasonable, and the policies can easily identify the type of      [26] IBM Redbooks. [n.d.]. http://www.redbooks.ibm.com/.
                                                                       [27] Arun Sen. 2004. Metadata management: past, present and future. Decision
item in the file. We intend to take a few steps to improve the              Support Systems 37, 1 (2004), 151–173.
readability and categorization of metadata files using rule agents.    [28] John R Smith and Peter Schirling. 2006. Metadata standards roundup. IEEE
In our upcoming experiments, we intend to develop self-learning             MultiMedia 13, 2 (2006), 84–88.
                                                                       [29] Alessandra Toninelli, Jeffrey Bradshaw, Lalana Kagal, Rebecca Montanari,
and service-based AI agents that can learn from their success               et al. 2005. Rule-based and ontology-based policies: Toward a hybrid approach
and failure rate as well as user feedback. We will be focusing              to control agents in pervasive environments. In proceedings of the Semantic
                                                                            Web and Policy Workshop.
on allowing agents to learn from a feedback network and how            [30] Jason Tsay, Alan Braz, Martin Hirzel, Avraham Shinnar, and Todd Mummert.
it affects the user decision process. A feedback system that can            2020. AIMMX: Artificial Intelligence Model Metadata Extractor. In Proceedings
identify how each policy contributed to metadata categorization.            of the 17th International Conference on Mining Software Repositories. 81–92.
                                                                       [31] John W Tukey et al. 1977. Exploratory data analysis. Vol. 2. Reading, Mass.
                                                                       [32] Jovan Varga, Oscar Romero, Torben Bach Pedersen, and Christian Thomsen.
ACKNOWLEDGMENT                                                              2014. Towards next generation BI systems: the analytical metadata chal-
                                                                            lenge. In International conference on data warehousing and knowledge discovery.
The work of Hiba Khalid is supported by the European Com-                   Springer, 89–101.
mission through the Erasmus Mundus Joint Doctorate project             [33] Marcia Lei Zeng. 2008. Metadata. Neal-Schuman Publishers, Inc.