292


 Automated Similarity Modeling for Real-World
                Applications

                                     Rotem Stram

Knowledge Management Group, German Research Center for Artificial Intelligence,
                        Kaiserslautern, Germany
                          rotem.stram@dfki.de


      Abstract. Many Case-Based Reasoning (CBR) applications rely on ex-
      perts opinions and input to design the knowledge base. Even though these
      experts are an integral part of the modeling process, they are human and
      cannot always provide the amounts of information that is needed. To that
      effect, the idea behind this thesis research is to utilize the data’s struc-ture
      to extract relationships between knowledge entities in cases where expert
      knowledge is not enough. The goal is to automatically model the
      similarity measure between cases and their attributes using methods such
      as information retrieval (IR), natural language processing (NLP),
      machine learning, graph theory, and social network analysis (SNA), with
      an emphasis on SNA, to extract contextual knowledge from a dataset.

      Keywords: Similarity, Social Network Analysis, Graph Theory, Sensi-
      tivity Analysis, Machine Learning, Information Retrieval


1   Introduction
In the heart of CBR is the similarity measure between two cases. Several suc-
cessful models have been used to compare cases, both locally and globally. Most
notably are taxonomies and matrices for local similarities of attribute-value-type
fields, and weighted sum for global similarities of entire cases. What these meth-
ods have in common is that they rely on a pre-existing similarity value between
two items or concepts. These are traditionally modelled by experts in the domain
of the system.
    Statistical methods have been used in the past, mainly in Textual CBR, to
measure the proximity of two items. Methods such as TF-IDF and cosine similar-
ities have been used successfully in the IR field, and augmented with additional
information when used in CBR. These augmentations include abbreviations,
synonyms, taxonomies, and ontologies as provided by experts [4].
    Practical examples include SCOOBIE, where subject-predicate-object triplets
in the form of RDF graphs are extracted from text, with the help of a pre-existing
domain-specific ontology supplemented with linked open data such as DBpedia
[7]. In KROSA a pre-existing ontology was used to extract requirements from

Copyright c 2016 for this paper by its authors. Copying permitted for private and
academic purposes. In Proceedings of the ICCBR 2016 Workshops. Atlanta, Georgia,
United States of America
                                                                                     293


text. Here phrases and words were obtained using NLP methods and matched
against items in the ontology [3]. These approaches have poor adaptability, since
extending the vocabulary requires a lot of effort both in term acquisition and
similarity measures.
    Other examples allow better adaptability to changing and expanding vocab-
ulary. One such approach is Probst et al., where an attempt was made to extract
attribute-value pairs from text. Seeds were used to train a model to classify noun
phrases in a semi-supervised manner. However, this work relies on texts with a
predictable structure and does not describe how the extracted values relate to
each other [5]. In Bach et al. terms are extracted from text with basic NLP
techniques, and are then assigned to classes by experts, who also model their
similarity [1].
    When dealing with real-world data it is practically impossible to model all
items that may occur, due to the usually large amount of data and the different
identities of the contributors. Expert input is an important tool to identify the
main concepts of a domain and have an understanding of how they relate to
each other, but cannot possibly cover the entire extent of actual values of an
attribute and the relationships between those values.
    In the scope of the OMAHA project for fault diagnosis in the aircraft domain
[9], we are dealing with a large semi-structured data set of faults and their
solutions. Information about the aircraft and the faulty system is given in the
form of symbolic attributes, while the fault description is in natural language
form, as written by the technicians on site. We have been provided with a list
of abbreviations, taxonomies of major concepts, and in the future white- and
black-lists. WordNet is being employed for synonyms, but is too general for our
purposes.


2   PhD Research Focus

It is the goal of the PhD thesis research to automatically identify important
concepts in the corpus, and to model the relationships between them. A balance
should be reached between expert knowledge and machine learning, taking both
into account. Since an expert can assist in finding the main concepts of a corpus
and the relationships between them, but cannot predict all possible values, both
present and future, this process needs to be automated so that the system can
evolve and develop without constant supervision. Methods on which the thesis
may focus on are IR, NLP, graph theory, SNA, and machine learning.
    The current area of research is the definition of similarity of symbolic at-
tributes, with a potentially unlimited number of values. These values originate
from textual representation of cases, and are extracted using IR and NLP meth-
ods. They are then ordered into different attributes of a case. With this in mind,
cases and values can be seen as nodes in a social network, connected between
them and with each other. This allows the utilization of SNA methods to model
this interaction and extract information about it. Using SNA to model similarity
                                                                                        294


is rooted in the idea that nodes that share a similar environment become simi-
lar [2]. The knowledge gained can then be used as a starting point for machine
learning methods to optimize global similarity.
    The OMAHA project is reaching its final stages, but there is still much to
be done within the scope of automatic acquisition of similarity. The measures
should be applicable to a wide range of scenarios, and not specific to the given
project. The research is still in a very early stage, and a focus has not been
decided yet.


3     Current Progress

A few concrete steps have been taken to model similarity within the OMAHA
project, and in general. The following will give a short description.


3.1   Global Similarity with Sensitivity Analysis

A new method of global similarity assessment has been developed based on sen-
sitivity analysis of case attributes to the corresponding diagnosis, and the paper
on the matter has been accepted for publication [8]. The idea is that in order to
distinguish between different diagnoses, different attributes may be more
important. For instance, in a set of diagnoses, deciding if a case belongs to a
specific one it would be most beneficial to look into the value of attribute a. If it
does not belong to it, then attribute b may play a key role in deciding whether it
belongs to a different diagnosis, and so on.
    The sensitivity analysis combines a statistical analysis of attribute values
and their association with each diagnosis, together with a learning stage where
weights are learned by supervised learning. During this stage a set of retrievals
are performed and their outcome analyzed. The weights are then updated ac-
cording to their contribution to the retrieval of irrelevant cases. In the end a
weight matrix is outputted, giving each attribute a weight under each diagnosis.
This method builds on Richter and Wess’ work, and expands it to include any
type of attribute [6].


3.2   Natural Language Processing

This is a means to an end, mainly to obtain values for the attribute. Since the
fault description text we were given was written by experts in the field, who
were probably in a hurry and whose English is not their native language, POS
tagging yielded no coherent results. It was needed instead to manually create
patterns with the help of regular expressions and phrase extraction. This part is
dataset-specific.
                                                                                           295


3.3   Similarity Assessment with SNA
After concepts were extracted with the help of NLP, the dataset was regarded
as a bipartite graph, connecting concepts with diagnoses. Using one-mode pro-
jection, a weight between each attribute value was calculated and regarded as
the similarity value. This method is still being tested and will be extended in
the future.
    Since we are dealing with potentially unlimited (but not infinite) number of
concepts per attribute, it is very likely that application domains have a long tail
of concepts that are used in very few cases. SNA allows to include those concepts
as part of the whole system, creating an asymmetrical similarity measure that
describes the connection of each value to another.


References
1. Bach, K., Althoff, K. D., Newo, R., Stahl, A. A case-based reasoning approach for
   providing machine diagnosis from service reports. In International Conference on
   Case-Based Reasoning (pp. 363-377). Springer Berlin Heidelberg. 2011.
2. Borgatti, S. P., Mehra, A., Brass, D. J., Labianca, G. Network analysis in the social
   sciences. science, 323(5916), 892-895. 2009.
3. Daramola, O., Stlhane, T., Omoronyia, I., Sindre, G. Using ontologies and machine
   learning for hazard identification and safety analysis. In Managing requirements
   knowledge (pp. 117-141). Springer Berlin Heidelberg. 2013.
4. Lenz M.: Textual CBR and Information Retrieval A Comparison. Proceedings 6th
   German workshop on CBR. 1998.
5. Probst, K., Ghani, R., Krema, M., Fano, A. E., Liu, Y. Semi-Supervised Learning of
   Attribute-Value Pairs from Product Descriptions. In IJCAI (Vol. 7, pp. 2838-2843).
   2007.
6. Richter, M. M., Wess, S. Similarity, uncertainty and case-based reasoning in PAT-
   DEX. In Automated Reasoning pp. 249-265. Springer Netherlands. 1991.
7. Roth-Berghofer, T., Adrian, B., Dengel, A. Case acquisition from text: Ontology-
   based information extraction with SCOOBIE for myCBR. In International Confer-
   ence on Case-Based Reasoning (pp. 451-464). Springer Berlin Heidelberg. 2010
8. Stram R., Reuss P., Althoff KD., Henkel W., Fischer D.: Relevance Matrix Gener-
   ation using Sensitivity Analysis in a Case-Based Reasoning Environment. ICCBR
   (accepted). 2016.
9. German Aerospace Center - DLR, LuFo-Projekt OMAHA gestartet,
   http://www.dlr.de/lk/desktopdefault.aspx/tabid-4472/15942 read-45359/           (last
   followed on July 25, 2016)