MDORG: Annotation Assisted Rule Agents for Metadata Files Hiba Khalid Esteban Zimányi Université Libre De Bruxelles Université Libre De Bruxelles Brussels, Belgium Brussels, Belgium Hiba.Khalid@ulb.ac.be Esteban.Zimanyi@ulb.ac.be ABSTRACT can significantly improve data manipulation, data fetch requests, Metadata files are often incomplete and of inadequate quality and data analysis [32]. with underlying issues such as low maintenance, lack of prove- In the context, of attaining or producing metadata, the pos- nance, inaccurate tagging, limited or no annotation and, different sibilities are endless but are both resource and time expensive. metadata standards. Such metadata is not generally viable for Thus, there is a need for systems that can help generate, under- analysis, large scale integration, data management, or quality stand, extract, and populate metadata. This task is not simplistic metadata maintenance. These types of metadata present chal- and is multi-phase, which requires attention to intricate details lenges in data analysis and production tasks. The expense associ- such as cross-platform data sharing, data rights, data protection ated with metadata discovery, metadata annotation, and manage- management, and data policies. There is still significant potential ment amounts to the time and effort of resources. To overcome to support and exercise the quality production of metadata, and the aforementioned issues, leverage can be borrowed from intel- its management. We explore the opportunity of leveraging ML ligent systems that can improve quality and reduce the manual techniques to (i) support metadata systems and, (ii) to explore effort associated with metadata discovery, annotation, and man- and categorize information in metadata files using rule agents to agement. Intelligent agents can aid in identifying information understand the information available in metadata files. Based on if they are provided relevant labels or agents are allowed to policies, rule agents guide the process of identifying metadata learn over-time. Thus, the annotation for intelligent agents can types and organize the information at hand for better use. improve tasks such as automated summaries, metadata manage- ment, cataloging and, feedback management. We experiment and 1.1 Metadata Types propose the utility of annotation for rule agents to facilitate the Metadata types include: (1) Descriptive Metadata: It [33] describes analysis and organization of metadata files. a dataset or resource for discovery, profiling and identification. For example, title, authors, keywords, abstract, and comments. KEYWORDS (2) Structural Metadata: It [12] defines the type, structure, and rule-based agents, textual metadata, metadata, intelligent agents, data relationships. This type of metadata highlights the form and metadata management, metadata representation, metadata cate- organization of data under consideration. It can include the order gorization in which information is available as well as; its original recording format. The most popular examples include information schema [2] and definition schema. (3) Administrative Metadata: As the 1 INTRODUCTION name suggests, this type of metadata contains information that can help manage and update a source. It typically includes in- In simplistic words, metadata [23] defines data, data collection formation such as date of creation, published date, update dates, and data recording process, and details about what the data en- file type, logs, file definition, access management, and access tails. The importance of metadata lies in its inherent quality of rights. Most commonly used subsets of administrative metadata maintaining the information about data. Generally, there are include (i) Rights management metadata: this sub-category deals three possibilities when dealing with metadata (i) Case-A: when with data rights, intellectual property, and sharing policies. (ii) there is no metadata available, and data profiling [24] is required Preservation metadata [17]: includes the information used to to infer metadata. (ii) Case-B: when there is little to no meta- archive resources, update and quality management, and resource data available, in such cases, mostly some sort of metadata such preservation. as creation date and publisher titles are available. (iii) Case-C: In the context of this research paper, we will be discussing when the metadata is available, however, of inadequate qual- the (i) usability of rule agents in understanding metadata files, ity, i.e., the metadata content is either incorrect or not recorded (ii) define annotations on metadata files (for example, metadata properly. In this case, the challenge is to extract meaning from file headers, metadata group headers), (iii) to observe rule agents the available metadata. There is no speculation or doubt when using annotated MD files that could facilitate metadata file orga- it comes to metadata usability [13]. The process, however, is nization that includes identifying metadata types, content, dis- non-deterministic and often very resource and time expensive. persed and misplaced metadata. The paper is organized as follows: A beneficial understanding amongst the scientific community section 2 discuss the use of rule agents for different applications around the topic of publicly available metadata can yield several and use cases, section 3 discusses the working and design of advantages. Provided so, the more the available metadata, the intelligent agents, section 3.1 defines the need for rule agents and easier it would be to retrieve datasets, suggest and recommend how they can be incorporated to support and possibly facilitate datasets, and pre-analyze various data sources. A similar advan- in metadata organization and categorization. Finally, we discuss tage is evident for data lakes [20], and data warehousing [9], the importance of using annotated files for rule agents in section where data privacy can be a prominent concern, and metadata 4. Section 5 provides insights on some of the challenges we en- countered, and section 6 provides an overview and conclusion of © Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC our research paper. Our design and experiment include metadata BY 4.0).. files extracted from the Kaggle repository and Data.gov metadata files. The examples described in this paper are derived from the function on a set of pre-defined policies to perform a task. How- UK Road Safety dataset on Kaggle1 . ever, this is not simplistic metadata management, and under- standing for intelligent agents poses many challenges such as (1) 2 RELATED WORK lack of metadata availability, (2) poorly orchestrated metadata, This section reviews the most resonating relevant work in the (3) lack of governance [14], (4) bad quality metadata and, (5) lack domain of understanding metadata and the role of AI agents. of metadata standards [28]. The rule agent’s design aims to semi- Intelligent Agents and Metadata: AIMMX[30] provides a automatically analyze available metadata in files and perform library-based metadata model extractor from software reposito- actions such as re-arranging, assigning useful tags, designating ries, it performs important tasks such as identifying associate metadata types, and metadata organization for profiles. The rule resources, model extraction, and model names. In particular, meta- sets for different groups of agents were distinct in nature to ob- data supportive systems are advancing and taking support from serve how comprehensive it was for agents to deal with different machine learning and AI to build better systems such as IBM types of metadata and information available in the metadata files. Redbooks[26] that among other functions uses metadata from imagery for specifications. Thereby the literature gathers inspi- 3.1 Rule Agents ration from rule and knowledge systems to address concerns Rule agents [22] are simple condition action agents that oper- of semi-automatic categorization and organization of metadata. ate on a provided environment by using the available rules and Rules can have a unique impact on the type of system and avail- completing the assigned task. Each rule agent accesses the rules able architectures, they not only affect individual agent archi- from the rule library and uses its allowed set of actions to analyze tectures but multi-agent systems also [5]. The applicability of and categorize the files. The rule library comprises rule sets that rule agents extends to semantic web technologies and architec- contain multiple rules for a particular category. The rule library tures [7]. A general architecture for rule agents is devised and contains many rules that correspond to different uses case in materialized using semantic web languages. Rule-based systems metadata file categorization. Each rule set can contain two or are diversely applicable, it tends to facilitate systems in taking more than two policies. Rule agents use the available rules to or- crucial decisions such as intrusion detection [15]. Rules for in- ganize and categorize information inside a metadata file. The rule trusion detection like other rule systems have to be pre-defined library contains all rules that are accessible by the rule agent [6] to identify and differentiate between incoming packets on a net- see table 1. Each rule set in table 1 comprises subset rules that sat- work. Testing these systems also require a set of performance isfy the main aspect of that ruleset. For example, ruleset R_1002 and evaluation criterion’s that have to be predefined to mark an i.e, “count columns", contains a total of 4 rules that (1)identifies agents or systems performance. Another important use case for columns, (2) column names, (3) keyword matching, and (4) counts rule-based agents is Grid applications [18]. Rule-based agents the total column names that appear in a metadata file. Similarly, perform adequately for high scale and data-intensive solutions. other rulesets contain appropriate rules to identify and target the Contrary to belief, these applications not only perform well but main purpose of the ruleset. Some rule sets don’t contain a lot of are highly advisable for autonomic applications; that allows rule rules and are designed more simplistically. For example, the rule agents to operate on the set of the rule designed for autonomous set R_1018 contains only one rule i.e. to search for empty line systems [4]. The application and use of rule agents for different annotation. In our problem design, we work with three libraries, application systems are apparent and understandable. Based on the rule library (contains rule sets), the file library(contains meta- evidence and performance of rule agents, they are deployed in data collection), and the agent library(contains rule agents). The various applications and systems. metadata file organization and content identification are obtained Metadata Management and Profiling: Metadata manage- as a result of policy application using rule agents on collected ment has taken many forms with time, it is important to un- metadata files. derstand how metadata has evolved over the years in terms of Rule-based systems are designed for knowledge storage, knowl- applications and tools. In general metadata [27] and metadata edge manipulation, and informed feedback from the system. Rules management [11] [16] are important components of data ware- are designed as a part of the rule-based system. A rule-based housing and large scale data integration systems. Amongst many system first gathers the knowledge and then manipulates this challenges in the field of metadata, standards [28] [19] [8], data knowledge to derive useful information, conclusions, or set of profiling [21], and format normalization is one of the most promi- actions. Rule-based systems are called expert systems; that can nent concerns. Moreover, the need to maintain and provide re- understand, store, and derive inference from available informa- producible metadata in terms of lineage and provenance is one tion. In the context of our research goals, we intend to identify of the most promising aspects of metadata in semantic web [3]. different types of the available metadata, comprehensively de- duce metadata information, and categorize metadata content 3 INTELLIGENT AGENTS properly. Thus, we work with the most authentic and traditional Intelligent agents are capable of performing tasks that are in- type of rule-based system called Deduction expert systems (DES) structed to them. The intuition behind introducing intelligent [10]. The DES works with a domain-specific knowledge base and agents for metadata file understanding and categorization stems rules to produce (1) deductions and (2) actions based on choices. from the idea of eliminating or reducing the manual effort re- The choices can be curated or manually programmed into the quired in cleaning, processing, and organizing metadata files. We expert system. In our case, we deal with both types of choices that make use of rule agents for assisting in the process of metadata can render multiple actions. We group types of agent actions as tagging [25], metadata annotations, and metadata file organiza- ‘Action Set’ to centralize the different types of agent actions a rule tion. Rule agents, or more commonly known as policy agents, agent can perform or attain. For an expert system to function end to end, it has to be intuitively designed according to the domain 1 https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-vehicles problem. However, the basic component of all expert systems comprises the following: (i)Knowledge Acquisition: The process of collecting knowledge on a domain, defining methods, bound- aries, outliers, and special cases, etc. (ii)Rule Base or Knowledge Base: this comprises a collection, list or set of indicative rules to render action and choices., (iii)Inference Engine: this engine is responsible for deducing or understanding the rules listed. We designed a multi-agent system for metadata categorization, file understanding, and organization. However, in the context and scope of this research paper, we will be discussing the working and performance of 5 rule agents and omit the rest of the agents due to space limitations. The agent continues to apply rules until the metadata file is completely analyzed see figure 1. Each agent is randomly assigned a metadata file from the mixed collection. Once each agent has its file, a number of actions are performed by each agent to understand the file and its content. For example, in figure 2 we illustrate how agent RA_101 first identifies the prominent file features that are expected to be in the file such as headers, group headers, empty lines, and metadata content. After this step, each agent takes on the metadata contents identified and starts applying rules by accessing the rulesets from the rule library. Each rule application involves the use of allowed agent actions such as read, write, match, drop, delete, and annotation request. An agent request for annotation if there is no associated rule pre-defined in the rule set, or it requests for annotation if the item cannot be categorized, i.e. when an agent is indecisive about an annotation and its associated action. In figure 4, there are five types of information for a rule agent to identify in this use case. Selective information is listed below: Figure 2: The overall depiction of how agents interact with metadata file collection to analyze metadata files. in metadata files falls under the metadata content category. As, most of the information besides headers, group headers, and empty lines is metadata content. 3.1.3 Empty Lines: the third type of information a policy agent encounters for this metadata file shown in figure 4 is empty Figure 1: The overall depiction of how agents interact with lines. The empty lines are annotated as ‘empty lines’ for policy metadata file collection and uses percepts and actions to agents to quickly differentiate between data, empty lines, and analyze metadata files. empty values. Rule-based agents typically function on a set of pre-programmed policies or a set of rules [29]. These are basic instruction set for 3.1.1 Metadata Group Headers: the annotated file in figure an agent to use and perform actions. In the case of metadata cat- 4 is assigned to a policy agent. The policy or rule-based agent egorization, organization, and understanding, rule based agents utilizes the annotation to identify the group header and looks are promising. However, autonomy and self-reliance of decisions for data inside each group header. The group headers in figure 4 are difficult to achieve even for intelligent agents. Annotation are ‘Usage Information’,‘Maintainers’ and, ‘Updates’. Policies are is a technique that allows the rule agents to (i) make sense of designed to equip agents in providing accurate results. the information (ii) facilitate future autonomous actions. In our 3.1.2 Metadata Content: the second type of information a research, we use manual annotation and tagging to observe how policy agent encounters is ‘metadata’. This is regarded as data in intelligent agents perform and categorize metadata information our problem or more precisely as value, and this is not annotated. provided to them. To facilitate and provide an understanding of Thus, the policy agent identifies a group header, locates data, and the problem to rule agents, we established annotation criteria for tags it accordingly, for example, ‘data: Group Header_Name’. For rule agents, figure 4 demonstrates the application of established example, the first bulk of data would be stored as ‘data[License: annotations from table 2 highlights the annotations provided to Database: Open Database, Contents: Database Contents, Visibil- rule agents for different types of metadata files. ity: Public]: Usage Information’. Now, this information is avail- For manual annotation, we group the basic contents of a meta- able for all policies to explore and exploit. The data inside group data file that an intelligent agent can come across for a certain file headers will be utilized to categorize the type of metadata or type(e.g., text, CSV, TSV, JSON). After pre-processing of differ- information available in the next phases. Most of the annotation ent file types into spreadsheets the annotation process is further RuleSet ID Rule Identifier Rule Description R_1001 Data Type Identifier Identifies the type of data encountered as text, numeric, date type etc. It searches the file for total R_1002 Count Columns number of columns. It uses keywords and counts if column names appear in textual descriptions. It iterates to find attribute names, if avaible in file creates a R_1003 Identify Attribute Names list. R_1004 Generate Attribute Lists It generates a list of attributes to be inserted into metadata file. It iterates over textual metadata using headers and group headers to find R_1005 Look Up: Rows ‘total rows’ in dataset. It looks for null value indicators, and identifies if headers or metadata R_1006 Look Up: Missing Values lacks value counterparts. It identifies and looks for date of creation in a metadat file using R_1007 Look Up: Date of creation group header and types of metadata. It identifies and looks for data update in a metadat file using group R_1008 Look Up: Update Dates header and types of metadata and keywords. It identifies and looks for license information in a metadat file using R_1009 Look Up: License Information group header and types of metadata and keywords. It identifies and looks for publisher in a metadata file using group R_1010 Look Up: Publisher header and types of metadata and keywords. It identifies and looks for data provacy details in a metadat file using R_1011 Look Up: Data Privacy group header and types of metadata and keywords. It identifies and looks for version policies or details in a metadat file R_1012 Version Management using group header and types of metadata and keywords. This aims to derive a topic sentence or domain of data using metadata R_1013 Domain Identifier file, textual summary and file names. This policy particularly searches for ‘keyword’ word in metadata file to R_1014 Search: Keywords collect keywords and save. The comments can be identified in two ways (i) using headers and (ii) R_1015 Search: Comments using annotation tags. R_1016 Search: Context CSV It looks for csv’s that might contain metadata in primary downloads. R_1017 Search: Headers It looks for annotated headers in a metadata file. R_1018 Search: Empty Lines It looks for annotated empty lines in a metadata file. R_1019 Search: Group Headers It looks for annotated group headers in a metadata file. R_1020 Search: Metadata Types It looks for annotated metadata types in a metadata file. R_1021 Search: File Type It looks for annotated file types in a metadata file. It looks for archives in a metadata file using header and keyword R_1022 Search: Archives information. It looks for data summary in a metadata file using header and keyword R_1023 Search: Data Summary information. R_1024 Look Up: Hyperlinks It looks for hyperlinks inside a metadata file and stores them. Table 1: Sample Policy description for Rule based agents Annotation Type Annotation Description A metadata file header can include file title, the data set title, and other headings or sections Metadata File Headers in a metadata file such as Keywords etc. A group header is a header that can represent one or more headers, a collection of textual Metadata Group Headers description can contain tables and column information. This is actual value or content against metadata headings, the content inside metadata headers Metadata Content and group headers, values against metadata types such as created by , Metadata Types This identifies different types of metadata such as descriptive, administrative, technical etc. Empty Lines An empty line, group of empty lines or empty space that contains no actual metadata. Table 2: Sample Policy description for Rule based agents Figure 3: The overall system process from knowledge acquisition to the action center where agents perform different sets of actions. carried on. After grouping, we add an annotation that rule-based agents can use to categorize, organize, and categorize content in metadata files. We refer to this phase as Anatomy Tagging i.e., in this step, we annotate sections that represent the anatomy of a metadata file. For example, table 2 illustrates the anatomical annotation contents. Headers and group headers are annotated to support deterministic and scalable search through a metadata file. Empty lines, empty paragraphs, or empty chunks can indicate the end of files, termination triggers, etc. To deal with this, we anno- tated empty lines in a metadata file and set actions for rule agents to delete and discard provided it does not change information at hand. For instance, textual metadata files contain empty spaces, headers such as a Description that includes a bunch of text about the dataset. The agents need context and understanding of what classifies as a header and what is the actual content or metadata value. For this purpose, we pre-process the metadata files and assign meaningful annotations so the rule agents can process the files, extract, categorize, and organize information more accurately. Figure 4: Annotation For Different Cases in a Metadata Thus, incorporating the kind of metadata available in a file for File: The image indicates annotation key i.e. the process the rule-based agents is critically important. We dealt with this used to annotate different cases such as comments, empty challenge by annotating limited information in three broad meta- lines, metadata types, etc. data categories and their subcategories(see section 1.1). Mostly metadata files are messy and unprepared; the objective is to uti- lize rule agents to organize and make sense of the messy files and available but it is either misplaced or of bad quality. This case generate profiles. To achieve this, the agents must understand the is where most of this research paper focuses. We have worked kind of information they are dealing with. For this, we annotate with a dataset that does provide textual metadata, descriptive examples from different types of metadata so rule agents can metadata, metadata tags, headers, context. But, it is misplaced categorize and organize metadata files. Figure 4 indicates the or, more precisely, unorganized. Our goal was to identify this annotation strategy applied for a small file. metadata using rule agents. We have designed an instruction set for metadata files to facilitate our rule agents in detecting, 3.2 Annotation Example identifying, and categorizing metadata file contents. We annotate Let us look at a simple use case from collected and annotated these instruction set categories and supply them to rule-based metadata file repository, the metadata file acquired from UK GOV agents (RBA). Table 2 indicates the annotations we perform the Road Safety Data (2005-2015) from Kaggle 2 . Figure 4 depicts a first time files are accessed. These files are annotated and are then simple metadata file extracted from Kaggle web page. It does further processed by rule agents. The table 2 includes the main not contain a metadata file; as a response, the agent adds a title headers that might appear in a metadata file, empty lines are inferred from other metadata files for the UK GOV Road Safety identified to avoid repetitive cycles for AI agents. It is important dataset. It is important to note that there is plenty of more meta- to note that if there are sub-headings or, ‘headers’. We annotate it data available on the Kaggle web-link, but it is unorganized, and as a ‘Metadata group header’. It helps the agent in differentiating it is not marked or labeled as metadata. There are typically three between file headers and group headers that can hold a chunk of use cases that we address namely:, (i) Case-I: when there is no metadata or information. The chunks of metadata are treated as metadata available, in this case, we use techniques such as ex- data and are further analyzed to identify the types of metadata it ploratory data analysis (EDA) [31] and data profiling [1, 24] to contains. This is also an annotation category identified as ‘meta- derive meaning from the data itself. (ii) Case-II: when metadata data types’ in table 2. Once the annotation is complete, now we is available in high quality. Most of the time, metadata is either have to design rule sets that can be utilized by policy agents to unavailable or in poor quality (iii) Case-III: is when metadata is render a decision. The collection of files, annotation, metadata 2 https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and- cleaning & organization, and preparation comprise the knowl- vehicles/metadata edge acquisition phase. Figure 3 illustrates four main stages for a policy-based or rule-based agent. The second step involves Rule the categorization of metadata value at hand or the occurrence Base, a rule base is a collection of rule sets or policy sets. These of empty blocks, empty lines, etc (see figure 6). The experiments policy sets are designed to support agents in performing different in figure 6 highlight the frequency of manual annotation request actions. The rule sets are analyzed by an inference engine that of 5 rule agents(case-I agents-with annotated files, and case-II makes sense of how each rule matters and applies to the case. 5 rule agents-unannotated metadata files). Each time an agent Finally, an action center is a set of actions an agent is permitted cannot categorize content i.e. data object, header, content, text or allowed. Thus, a policy agent can perform a certain limited block, etc., it sends a request for manual annotation to the user. set of actions based on an inference derived by rule application This represents indecisiveness as these agents are simple policy on metadata files. agents and are not reinforced, agents. From this experiment, we expected to learn about the usability of adding annotations for 4 EVALUATION metadata headers, group headers, content, and types. As a result, we observed that unannotated files required significantly more The role of annotated metadata files for rule agents was observed annotation requests as compared to files that were pre-annotated. by conducting experiments that evaluated and compared the rule agent’s ability to categorize and organize metadata files with and without annotations. Another objective of the experiments was 4.2 Detecting Metadata Types and Analyzing to observe how the rule agents performed if rules were altered or Policy Frequency manipulated. The experiments were conducted on a spreadsheet The second experiment was to observe if rule agents can inde- dataset that was gathered from Kaggle metadata files, Kaggle pendently detect metadata types with and without annotations. resource descriptions, and Data.gov metadata files from various The objective was to alter policies for rule agents and observe categories. A total of 1123 metadata files were retrieved from if it would affect the overall change in metadata type detection. both repositories and added to a “mixed collection” to the work- Firstly, rule agents without annotations were 80% of the times ing repository for rule agents. Throughout the research paper, requesting for a manual annotation if a metadata type had to we will refer to the mixed collection as a collection of metadata be detected. Figure 7 indicates detectable metadata types (De- files from both Kaggle and Data.gov. As far as our research con- scriptive (DMD), Administrative (AMD), Structural (SMD), Rights tribution is concerned, to the best of our knowledge, there are (RMD), Preservation (PMD)) in a number of files(recorded in per- no substantial contributions that directly address and aim at the centages). We observed that most of the metadata files did not exact definition of the problem. However, there are related tools contain the exact information such as resource link to the original such as Metanome [24] that can be tested with the technique data repository, update policies, copyrights, etc. In most cases, we have developed to assess how their performance improves this metadata could however be retrieved, we deal with this and or deteriorates. With regard to the scope of this research paper, regard it as “misplaced metadata”. Thus, rule agents with anno- we would like to emphasize that the experiments carried out are tated files were able to detect more metadata types and were based on the usability of the technique and how meaningful ob- intelligent enough to identify discrepancies or missing informa- servations have been made with regard to metadata management tion from files. In regards to “misplaced metadata”, we made a using rule agents. It would go beyond the scope of this research few observations on collected files. Some of the most prominent paper to compare and examine the improvement of existing tools observations were as follows: Lack of metadata support and by the technique we have developed. Nevertheless, our future updates: most websites, resources, and portals do not maintain a work will be concentrated on observing performance change and complete metadata update cycle. The information on the dataset the applicability of our technique to other tools. description page and the actual metadata file is observed to be in- coherent. Lack of metadata download support: repositories 4.1 Evaluating Rule Agent Performance and resources describe a resource on the main page or portal but To understand the role of annotated metadata files for metadata disregard the facility to download this valuable metadata. For categorization and file organization. It is crucial to observe how example, the repository pages at Kaggle and Data.gov both con- agents behave with and without annotations. This experiment tain essential descriptive and rights metadata. This information was performed with a set of five rule agents. In the first phase, such as resource description, keywords is not readily available these five rule agents were provided metadata files that were not for downloads. It is also not properly tagged and is not automati- annotated, or labeled. The same set of agents were analyzed and cally embedded or added to downloadable metadata files. In most observed in a setting with annotated metadata files. Each agent cases, it has to be manually downloaded or added to the metadata was randomly assigned metadata files from the entire mixed col- collection. lection metadata files retrieved from Kaggle and Data.gov. Figure Another important experiment was to observe how many 5 illustrates the performance of each agent with and without an- times a particular rule was accessed and applied by the rule notations. For instance, the AR1 represents rule agent_001 with agents. This was observed by maintaining a Rule log in each annotated metadata files and, UR1 represents the performance of cycle for all rule agents. Figure 8 depicts the frequency (recorded the same agent with unannotated metadata files. Each agent in in percentages) of rules fired by each rule agent. The most useful both categories (annotated, unannotated files) was evaluated on aspect of this observation experiment was to understand the type four factors: (1) accurate identification of metadata headers, (2) and frequency of metadata content prevalent in collected files. correct identification of metadata group headers, (3) MD content For example, R_1006 firing frequency is quite high indicating that or values against metadata types, and (4) accurate categorization the rule agents encountered null and empty values in the majority of available metadata into metadata types. File annotation re- of metadata files. Similarly, R_1022 firing frequency was quite quest was also examined for all agents, to understand how many low in comparison to other rules indicating that the majority times an expert user was prompted for a manual annotation re- of the metadata files did not contain information relevant to quest. This happened when the rule agent could not decide on data archives, metadata archives, or legacy support files and Figure 5: The overall action performance of agents on annotated and unannotated metadata files that involves use of content, annotations and set of action sequences (mixed collection). Annotation Percentage RA_101 RA_102 RA_103 RA_104 RA_105 25% 47% 45% 57% 59% 43% 50% 65% 78% 69% 75% 84% 75% 95% 92% 89% 78% 88% Table 3: The table illustrates the change in behavior of rule agents based on different annotation levels of 25%, 50%, and 75%. Dataset Total Files Pre-Processing Rule Applicability Rate Rule Failure Rate Rule In-applicability Rate Kaggle 563 91% 73% 11% 16% Data GOV 560 45% 87% 8% 5% Table 4: Dataset Collection Comparison and Analysis. Figure 7: The figure represents the percentage of files that rule agents were able to detect different metadata types Figure 6: The number of manual annotations requested by (mixed collection). agents with annotated and unannotated files mixed collec- tion. The performance of agents depicted in the above fig- ure is based on 75 percent annotation level of metadata fired and how many times it is utilized on average by each agent. files from the mixed collection. The table 1 provides insight into policies available to all rule agents for both annotated resources and unannotated resources. Each policy defined in table 1 is made available to rule agents hyperlinks. Another aspect that can be observed from figure based on their tuning. We refer to tuning, i.e.“a set of policies” 8 is that not all metadata files contain relevant hyperlinks. As as profile setting available to a particular agent. In our research, indicated R_1024 was accessed by agents a couple of times. we experiment with agents in different settings by increasing or The table 1 includes a sample of 24 rule sets that are measured decreasing the number of policies accessible by a particular agent. in figure 8 that demonstrates the rule firing frequency by rule For example, R_1010 searches the file for keywords as “publisher”, agents on the mixed collection. It visualized how each rule is synonyms, and approximate matches for this keyword group. The functionality allows a rule agent to scan headers, group headers, Actions RA_101 RA_102 RA_103 RA_104 RA_105 and text blocks to locate a valid publisher and its details in a Read 1189 1189 1189 1189 1189 metadata file. Search 1458 1650 1687 1465 1093 Write 1189 1192 1125 1143 1061 Match 4567 4987 4976 3456 1148 Drop 10 120 56 34 26 Delete 0 3 8 12 8 Compare 201 189 300 450 120 Annotation Req 109 80 344 450 122 Table 5: An overview of average agent actions on a meta- data file. The table illustrates the importance of different profile settings for rule agents and their impact on agent actions. 4.4 Agent Performance and Analysis To obtain a better picture and understanding of agent perfor- mance we designed three major categories of experimental obser- vation. Firstly, we tested our technique and rule agents on mixed collection i.e., a collection of metadata files obtained from Kaggle Figure 8: An observation of average rule firing frequency and Data.gov. Figure 9 illustrates the performance of each rule by rule agents on mixed collection agent on the mixed collection. The second experimental setting was to observe how agents performed with different levels of annotated datasets. Table 3 provides an analysis of each agent’s performance in accordance with the annotation percentage. We defined three annotation levels starting from 25%, 50%, and con- 4.3 Agent Actions cluding with 75% annotated metadata files. By observing how The policy agents are assigned a set of actions that can be chosen agents performed we were able to understand that annotated to attain the next state and complete the assigned task (based metadata files do significantly improve the agent performance on available policies provided to each agent). The allowed set of even in the most difficult cases such as metadata type identifica- actions for any rule agent in this experimental setup include the tion. Most importantly, from this experiment, we observed how following: Read:when the agent scans and reads the file ingested each agent was affected based on its primary policy selection (i.e., into the system. Search:a search action that allows the agent to setting for each agent in terms of allowed or accessible policies). scan the file, gather insight and look for common words, indi- For example, in table 3. cators and headers. Write: allows the agent to write into a new metadata file. Match: is an action where the agent matches its knowledge of annotation with available tags in the metadata file and cases where there is a need to search for synonyms, etc. Drop: is the case where a content is dropped or considered invaluable by the agent and added in the metadata file as ‘supplementary in- formation’. Delete: is when an agent decides to delete a piece of content such as consecutive delimiters, misspelled words. Com- pare: when two or more headers, sub-headers, or content pieces are compared for duplicate data identification. Request Anno- tation: when an agent is indecisive on a set of sentences such as comments (if not annotated), the agent requests for a manual annotation from the user(this action is typically requested after the agent completes the read cycle and before the write opera- tion). In table 5, five rule agents are observed for action behaviors on the same metadata file. The experiment observes the role of policies and how it changes the action initiation sequence for Figure 9: An observation of overall agent performance in each agent. For example, RA_101 does not exclude the dropped terms of accuracy, precision, and recall. and deleted items. Thus, the read items are the same as written items in new metadata files but with a different context. On the other hand, agents like RA_102, RA_103, etc., do not add these 4.5 Metadata Collection Analysis items into the total write count for newly generated metadata It is very critical to understand how the technique responds to files. The difference in operations is based on the accessibility of different types of datasets. In our case, the aim was to first collect policies and action preferences assigned to each agent. All five metadata files that can be analyzed by rule agents. In the context agents were observed for different settings to understand how of our research problem, we did not limit the collection of meta- actions and operations differ based on agent action allowances. data files to a certain domain. Thus, the collection comprised Thus, by changing policies we were able to observe significant metadata files from various domains such as accidents, Geospa- changes in the performance of agents for the same metadata files. tial metadata files, pharmacy, music, movies to name a few. The challenge was, however, to understand and find relevance (if it formats, the information does not strictly adhere to standards, persists) in these metadata files and secondly to identify different and the files are populated differently for each dataset. Thus, constituents of metadata. Based on our collection, we worked metadata file pre-processing is one of the challenges that we with a total of 1123 metadata files. A total of 563 files were ob- currently deal with and address. Most of the metadata files were tained from the Kaggle repository (on numerous domains and raw, and metadata was dispersed and misplaced a pre-processing topics), and a total of 560 metadata files were retrieved from the step had to be established. Data.gov repository (on numerous domains and topics). Table 4 Conflicting Labels and Label Cleaning We address this depicts our overall observations and understanding of metadata by designing labels to avoid conflicts in the ground truth. The collection. We have separately analyzed both collections on the strategy we defined includes a growth pattern that allows for following criteria: labels to evolve and change over time if needed. We set this up by providing functions that allow the user to manipulate non-critical 4.5.1 Pre-Processing. : this metric identifies the amount of labels (the implementation details of this feature are beyond the pre-processing required on files before it can be fed to the rule scope of this research paper). For the context of the current agents for further categorization and analysis. Pre-processing is research paper, the labels added are marked for each file, and a an important aspect as most metadata files are raw and do not non-conflicting hierarchy of labels is executed to avoid confusion come processed with preambles, labels, and annotations. More- for the rule agents. For example, if a file contains more than one over, the file structures and standards vary for the majority of main header, it should explicitly contain separate information the cases. We observed that metadata files from Kaggle required and should be a different word. One file cannot have two main more pre-processing in terms of missing values, incorrect values, headers title says “Descriptive Metadata” or “Publisher”. Such and file standardization. Also, metadata from Kaggle had many hierarchical control allows for information categorization and cases of “misplaced metadata” making pre-processing a necessary avoids conflicting labels. We have extended our work on dealing step in the process. On the other hand, metadata retrieved from with conflicting labels and intend to improve as we move along. Data.gov required basic cleaning and file preparation and was Nested Files and Data In the context of our research we only already in structured formats. deal with non-nested semi-structured files. 4.5.2 Rule Applicability Rate. : we also observed how many Failure Despite Annotations we prepared ground truth to rules were applicable on a pre-processed metadata file from both deal with the most frequent and available information in meta- collections. Table 4 identifies a relative difference between rule data files by providing annotations. However, even with annota- applicability on both data collections. However, due to more tion practices, the rule agents were unable to provide a decisive header and tag oriented file structure, the metadata files from result on cases such as identification of the most recent meta- Data.gov were perceived to be more rule applicable in comparison data file version. This was a failed use case for rule agents. The to Kaggle collection. annotation and content information was insufficient to make a necessary conclusion. 4.5.3 Rule Failure Rate. : the rule failure rate is described as Unreadable Text The collection contained metadata files that a failed attempt of rule application by a designated rule agent. had discrepancies in terms of incorrect (text in English with If an agent executes a rule and there is no profitable outcome incorrect spellings etc.)and indiscernible text (text in another such as no conclusive agent action is performed and no task is language). In both cases, the rule agents were unable to classify completed it is recorded as a failure rate. The files from Kaggle the text in any of the categories. Since we do not provide multi- had an 11%failure rate and metadata collection from Data.gov lingual support and many metadata files contain information had an 8% failure rate as depicted in table 4. from different languages such as French, Chinese, Italian, etc. This remains a challenge for our current research. 4.5.4 Rule In-applicability Rate. : this is defined in terms of Incomprehensible Metadata Files The collection comprised lack of metadata elements or lack of metadata content inside of files that contained metadata information such as schema and metadata files. The unavailability of information in metadata files legacy links contained inside an image. This was a challenge and causes a rule in-applicability. It means that certain rules were a limitation of our system since we do not process image files. never applied by agents as those elements, keywords, headers, Thus, we had to exclude such files from our collection. or information content was not available. From a collection of Inconclusive Column Names This is a semantic challenge agent polices, a total of 16% rules were inapplicable in different that is not a part of our problem description. As we deal with phases on files retrieved from Kaggle, and a total of 5% rules were metadata files, and categorize the information inside them, clean inapplicable on metadata gathered from Data.gov (see table 4). it for better reusability. Nonetheless, we identified in about 25% of metadata files in the extended collection (before scrutiny, the final 5 CHALLENGES collection contained 1123 files only) contained column names Some of the most eminent challenges we faced were: Metadata that did not represent the information accurately. Also, most collection: Metadata files were available for some datasets. How- of the time column names and column values were conflicting. ever, most of the metadata was misplaced or can be gathered from If the column name was, for example, “Publisher” it would be other resources. Thus, one file with all metadata was a rarity. In expected to contain the dataset publisher name, date, or resource this section, we discuss some of the most difficult challenges we link. But it contained a phone number or city name. encountered. for some of these challenges, we aim to provide Empty Files We address this issue partially in our current partial solutions and extend them to our future work. work by identifying blocks of empty data and empty files. This is Metadata Pre-processing In almost all of the files collected another challenge that is more oriented towards understanding from the Kaggle repository and Data.gov, the metadata had to be file structures and identifying different tables or contents inside cleaned and prepared before it can be utilized by our system. The a file. It is beyond the scope of our research, we only identify prepossessing is a particular challenge as all files are in different empty metadata files from our extended collection and disregard Information Technologies for Business Intelligence-Doctoral College them from the main collection i.e., 1123 files. (IT4BI-DC). Inconclusive Headers This is a challenge that we address in our current research and we are extending on this challenge in our REFERENCES future work as well. The problem inhibits naming conventions [1] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data profil- ing: A tutorial. In Proceedings of the 2017 ACM International Conference on and identifying whether a title or text is to be considered a header Management of Data. 1747–1751. or sub-header. Primarily we deal with this by intuitive annotation [2] Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. 1986. A com- techniques in our ground truth. We intend to extend this approach parative analysis of methodologies for database schema integration. ACM computing surveys (CSUR) 18, 4 (1986), 323–364. by developing a tree-based structure for the title, header, and sub- [3] Rafael Berlanga, Oscar Romero, Alkis Simitsis, Victoria Nebot, Torben Bach header. The tree would be designed to identify different levels of Pedersen, Alberto Abelló, and María José Aramburu. 2012. Semantic web headers and text to remove the ambiguity of inconclusive headers technologies for business intelligence. In Business Intelligence Applications and the Web: Models, Systems and Technologies. IGI global, 310–339. and their hierarchy in metadata files. [4] Sanjay P Bhat and Dennis S Bernstein. 2000. Finite-time stability of continuous Other Challenges Another case was misplaced commas and autonomous systems. SIAM Journal on Control and Optimization 38, 3 (2000), 751–766. delimiters, rule agents failed because preparation was manually [5] Costin Bǎdicǎ, Lars Braubach, and Adrian Paschke. 2011. Rule-based dis- done and no rules directly addressed punctuation, delimiters, tributed and agent systems. In International Workshop on Rules and Rule misplaced commas, etc. Another example of rule agent failure Markup Languages for the Semantic Web. Springer, 3–28. [6] Serena H Chen, Anthony J Jakeman, and John P Norton. 2008. Artificial intel- includes the case of broken text paragraphs and empty columns. ligence techniques: an introduction to their use for modelling environmental Most of the time the rule agents were able to identify an empty systems. Mathematics and computers in simulation 78, 2-3 (2008), 379–400. row or column if it was labeled, on the other hand, we did observe [7] Jens Dietrich, Alexander Kozlenkov, Michael Schroeder, and Gerd Wagner. 2003. Rule-based agents for the semantic web. Electronic Commerce Research rule agents in a cycle of interpretation of empty rows in sheets. and Applications 2, 4 (2003), 323–338. Finally, missing values were also an issue even with annotations, [8] Erik Duval. 2001. Metadata standards: What, who & why. Journal of universal computer science 7, 7 (2001), 591–601. as rule agents work by example meaning, the rule agents failed in [9] Neil Foshay, Avinandan Mukherjee, and Andrew Taylor. 2007. Does data identifying if the content or “metadata value” was placed against warehouse end-user metadata add value? Commun. ACM 50, 11 (2007), 70–77. the correct metadata category. [10] Dov M Gabbay. 1985. Theoretical foundations for non-monotonic reasoning in expert systems. In Logics and models of concurrent systems. Springer, 439–457. [11] S Christopher Gladwin, Matthew M England, Dustin M Hendrickson, Zachary J 6 CONCLUSION Mark, Vance T Thornton, Jason K Resch, and Dhanvi Gopala Krishna Kapila Lakshmana Harsha. 2009. Metadata management system for an infor- In this paper, we have discussed the prospect of utilizing rule mation dispersed storage system. US Patent 7,574,579. agents that could facilitate the metadata categorization based on [12] Jane Greenberg. 2009. Metadata and digital information. In Encyclopedia of library and information sciences. CRC Press, 3610–3623. two types of metadata files i.e., annotated MD files and unanno- [13] IEEE. [n.d.]. IEEE LOM: IEEE Standard for Learning Object Metadata. tated metadata files. To support and enhance the use of metadata [14] William H Inmon, Bonnie O’Neil, and Lowell Fryman. 2010. Business metadata: Capturing enterprise knowledge. Morgan Kaufmann. in data management and business information systems it is cru- [15] S Jha and Mahbub Hassan. 2002. Building agents for rule-based intrusion cial to minimize the amount of time spent on pre-processing detection system. Computer Communications 25, 15 (2002), 1366–1373. and categorization. We provide a policy-based system attributed [16] Phokion G Kolaitis. 2005. Schema mappings, data exchange, and metadata man- agement. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART by policy agents to identify, observe, and categorize available symposium on Principles of database systems. 61–75. metadata files for reusability and easy understanding. To obtain [17] Brian F Lavoie and Richard Gartner. 2005. Preservation metadata. OCLC. this, we explored the possibilities of changing policies and ob- [18] Zhen Li and Manish Parashar. 2004. Rudder: A rule-based multi-agent infras- tructure for supporting autonomic grid applications. In International Confer- serving the behavior of agents. By our observation, change in ence on Autonomic Computing, 2004. Proceedings. IEEE, 278–279. available and accessible policies affected the agent’s ability to [19] Marilyn McClelland. 2003. Metadata standards for educational resources. Computer 36, 11 (2003), 107–109. correctly categorize a piece of metadata in a file. We also observed [20] Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endow- that different files contain information that might be completely ment 11, 12 (2018), 2130–2139. irrelevant based on the policies defined in our research. This [21] Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40–49. method of experimentation helped us in understanding the core [22] Nils J Nilsson and Nils Johan Nilsson. 1998. Artificial intelligence: a new components of metadata files and how policy agents can best synthesis. Morgan Kaufmann. fit a metadata file categorization use-case. The second experi- [23] NISO. [n.d.]. Understanding metadata. National Information Standards Organization. NISO Press, 2004. ment we conducted was to utilize manually annotated files for [24] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and rule agents to observe improvement and ease in metadata file Felix Naumann. 2015. Data profiling with metanome. Proceedings of the VLDB Endowment 8, 12, 1860–1863. categorization and processing. As a result, we observed that an- [25] David Peto and Stef Lewandowski. 2015. Metadata tagging of moving and notated files are helpful to policy agents as the search space is still image content. US Patent 8,935,204. more reasonable, and the policies can easily identify the type of [26] IBM Redbooks. [n.d.]. http://www.redbooks.ibm.com/. [27] Arun Sen. 2004. Metadata management: past, present and future. Decision item in the file. We intend to take a few steps to improve the Support Systems 37, 1 (2004), 151–173. readability and categorization of metadata files using rule agents. [28] John R Smith and Peter Schirling. 2006. Metadata standards roundup. IEEE In our upcoming experiments, we intend to develop self-learning MultiMedia 13, 2 (2006), 84–88. [29] Alessandra Toninelli, Jeffrey Bradshaw, Lalana Kagal, Rebecca Montanari, and service-based AI agents that can learn from their success et al. 2005. Rule-based and ontology-based policies: Toward a hybrid approach and failure rate as well as user feedback. We will be focusing to control agents in pervasive environments. In proceedings of the Semantic Web and Policy Workshop. on allowing agents to learn from a feedback network and how [30] Jason Tsay, Alan Braz, Martin Hirzel, Avraham Shinnar, and Todd Mummert. it affects the user decision process. A feedback system that can 2020. AIMMX: Artificial Intelligence Model Metadata Extractor. In Proceedings identify how each policy contributed to metadata categorization. of the 17th International Conference on Mining Software Repositories. 81–92. [31] John W Tukey et al. 1977. Exploratory data analysis. Vol. 2. Reading, Mass. [32] Jovan Varga, Oscar Romero, Torben Bach Pedersen, and Christian Thomsen. ACKNOWLEDGMENT 2014. Towards next generation BI systems: the analytical metadata chal- lenge. In International conference on data warehousing and knowledge discovery. The work of Hiba Khalid is supported by the European Com- Springer, 89–101. mission through the Erasmus Mundus Joint Doctorate project [33] Marcia Lei Zeng. 2008. Metadata. Neal-Schuman Publishers, Inc.