=Paper=
{{Paper
|id=Vol-2958/paper5
|storemode=property
|title=Automated Assistance for Data Modelers combining Natural Language Processing and Data Modeling Heuristics: A Prototype Demonstration
|pdfUrl=https://ceur-ws.org/Vol-2958/paper5.pdf
|volume=Vol-2958
|authors=Benjamin Ternes,Kristina Rosenthal,Stefan Strecker
|dblpUrl=https://dblp.org/rec/conf/er/TernesRS21
}}
==Automated Assistance for Data Modelers combining Natural Language Processing and Data Modeling Heuristics: A Prototype Demonstration==
Automated Assistance for Data Modelers
combining Natural Language Processing and
Data Modeling Heuristics: A Prototype
Demonstration?
Benjamin Ternes1 , Kristina Rosenthal1 , and Stefan Strecker1
Enterprise Modelling Research Group, University of Hagen, Hagen, Germany,
{benjamin.ternes,kristina.rosenthal,stefan.strecker}@fernuni-hagen.de
Abstract. Identifiers of model elements convey semantics of concep-
tual models essential to interpretation by human viewers. Prior research
shows that devising meaningful identifiers for model elements challenges
data modelers from early learning stages to advanced levels of mod-
eling expertise, constituting one of the most common difficulties data
modelers face. We demonstrate the Automated Assistant, an integrated
modeling tool support component combining natural language process-
ing techniques and data modeling heuristics to provide data modelers
with modeling-time feedback on identifying and signifying entity types,
relationship types, and attributes with meaningful identifiers. Different
from other approaches to automating assistance for data modelers, the
Automated Assistant implementation does not rely on fixed reference
solutions for modeling tasks as it processes (m)any natural language de-
scriptions of modeling tasks. We report on the current state of prototype
development, discuss the Automated Assistant implementation and out-
line future work.
Keywords: Conceptual Modeling · Data Modeling · Modeling Tool ·
Natural Language Processing · Process-oriented Feedback
1 Introduction
Model element identifiers (labels), e. g., for entity types, relationship types and
attributes in an Entity-Relationship (ER) diagram, carry and convey semantics
important to sensible interpretation of conceptual data models by human view-
ers, semantics transcending the formal semantics of model elements [12]. Devising
meaningful, appropriate, and expedient identifiers is a prerequisite for compre-
hensible and usable conceptual models [7]. Empirical research shows that data
modelers oftentimes face difficulties devising meaningful identifiers for model
elements from early learning stages to advanced levels of modeling expertise, de-
scribed as one of the most common difficulties beginning learners face [1,11]. To
?
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
26 B. Ternes, K. Rosenthal, S. Strecker
support learners of data modeling in devising meaningful identifiers for model
elements, we design and implement the Automated Assistant, a research pro-
totype that provides modeling-time feedback to data modelers on their choice
of model element identifiers based on the natural language description of the
respective modeling universe of discourse (cf. [14]).
The primary use case of the Automated Assistant is a learning context in
which learners of data modeling are asked to create an ER diagram based on
a textual description of a modeling task—a common learning scenario in many
settings. Research, design and prototype implementation of the Automated As-
sistant combine and build on research on data modeling heuristics for identifier
formulation from natural language descriptions [2,15] and on research on Natu-
ral Language Processing (NLP) [3,4]. Different from prior work (e. g.[9,15,5,8]),
the Automated Assistant does not require the time-consuming ex ante construc-
tion of (multiple) reference solutions to a specific modeling task description but
works on (m)any natural language descriptions of a modeling task (as long as
they comply with English grammar).
The Automated Assistant prototype described in the present work substan-
tially extends an earlier prototype implementation (cf. [14]). Specifically, we
substantially extend the number of implemented data modeling heuristics for
better detection of compound nouns signifying entity types as well as for detect-
ing associated attributes, and we enhance the earlier prototype implementation
by an identification component for data types of attributes. Moreover, the pro-
totype implementation now presents generated candidates for element identifiers
color highlighted to the modeler.
2 The Automated Assistant Prototype Overview
The Automated Assistant prototype implementation builds on TOOL, a model-
ing tool web application [13]. TOOL provides a graphical data modeling editor
which implements a variant of the Entity-Relationship Model [10] to create ER
diagrams. The Automated Assistant hooks into the graphical editor to assist
data modelers at modeling-time with suggestions on signifying element identi-
fiers. The Automated Assistant implementation utilizes the Stanford CoreNLP
toolkit [6] which provides a pipeline framework for deriving linguistic annota-
tions from natural language text input, in combination with heuristics rules for
data modeling that we adapt and refine to the in-browser ER modeling vari-
ant. Moreover, the Automated Assistant implements widely accepted naming
conventions for ER modeling (e. g., [2]).
In a nutshell, the prototype implementation of the Automated Assistant es-
sentially applies a two-step procedure: First, starting from a (1) (natural lan-
guage) modeling task description, a list of potential candidates for meaningful
identifiers for entity types, associated attributes, and relationship types is gener-
ated. Then, based on the results of step (1), feedback (2) is provided to modelers
at modeling-time in the graphical modeling editor on their choice of model el-
ement identifiers and data types—potential candidates for meaningful entity
Automated Assistance for Data Modelers 27
Execution flow Executing heuristic rules &
naming conventions
Rule head (if)
Named Entity Recognition
Part-of-Speech Tagging
Morphological Analysis
(e.g., gender, dcoref)
Sentence Splitting
Syntactic Parsing
Other Annotators
Tokenization
(tokenize)
(lemma)
(parse)
Heuristics rules
(ssplit)
(pos)
(ner)
& naming conventions
37 Heuristics
Natural language Annotated text 3 Conventions Lists of
description of a (e.g., amod, NN) candidates
modeling task
Stanford CoreNLP pipeline Rule tail (then)
(adapted for our application context)
Linguistic
annotations generates
Fig. 1. Overview of the adapted annotation pipeline to generate a list of candidates
for entity types, associated attributes, and relationship types (exemplary linguistic
annotations generated using the the Stanford CoreNLP toolkit at https://corenlp.run).
type identifiers as well as corresponding attributes, and data types are color
highlighted in the textual description displayed to the modelers. To generate a
list of potential candidates based on natural language description (cf. step 1),
the pipeline of the CoreNLP toolkit is adapted to the primary use case of the
Automated Assistant (see Fig. 2), and enhanced in several design iterations by
conscientiously revising the applied heuristic rules for classifying certain pat-
terns of statements of the English language for identifying identifiers of model
elements (e. g., [2]). The identification of potential candidates mainly relies on
part-of-speech tagging and statistical dependency parsing (e. g., words which
modify nominal phrases—adjectival modifiers) and to establish (grammatical)
relations between words (e. g., by typed dependencies). The Automated Assis-
tant utilizes multiple NLP techniques (built into the CoreNLP implementation)
for the analysis of modeling task descriptions in a specific order: tokenization,
sentence splitting, part-of-speech tagging, morphological analysis, and syntactic
parsing (see Fig. 2). The subsequent identification of candidates for identifiers
for entity types, associated attributes, and relationship types builds on linguis-
tic annotations returned from CoreNLP as a semantic graph consisting of lists
of tuples. Starting from these linguistic annotations, the revised prototype im-
plementation applies 37 heuristic rules (13 on relationship types, 13 on entity
types, and 11 on associated attributes) and 3 naming conventions for formulat-
ing model element identifiers to match them with the syntactical functions of
the annotation pipeline. The heuristic rules and naming conventions are recon-
structed, adapted and refined from a comprehensive review of prior work (the
adapted annotation pipeline is described in further detail in [14]).
28 B. Ternes, K. Rosenthal, S. Strecker
From a modeler’s perspective, the Automated Assistant suggests feedback
(cf. step 2), i. e., whenever a modeler devises identifiers of a model elements on
the modeling canvas of the graphical modeling editor (see Fig. 2). The provided
feedback aims at supporting modelers in developing an understanding of how to
identify and signify model elements meaningfully, and, in particular, to encourage
modelers to rethink their choices for model element identifiers based on sensible
auto-generated suggestions of the Automated Assistant. Hence, the Automated
Assistant provides feedback in three categories: (a) ‘Great’ for positive feedback,
e. g., if a label matches an entity type of the generated candidate list (in green);
(b) ‘Reconsider’ for a label that is not mentioned in the modeling task description
or that is not identified as a possible candidate for an entity type, attribute, or
relationship type (problem-oriented feedback; in orange); (c) ‘Convention’ if the
input does not follow the implemented naming conventions, e. g., if an entity
type label begins with a lowercase letter, or a spelling error is examined (neutral
feedback; in yellow). As a major usability enhancement to an earlier version of the
Automatic Assistant (cf. [14]), we have further extended the integrated modeling
tool support to color highlight possible entity types and associated attributes in
the textual description, e. g., when there is uncertainty or when the modeler
wants to compare them to the chosen model element identifiers on the modeling
canvas (see Fig. 2, right side). Compared to earlier prototype implementations,
the Automated Assistant now provides more accurate feedback on data types of
attributes based on named entity recognition. A short video demonstrator of the
Automated Assistant is available at: https://video.fernuni-hagen.de/Play/896
Fig. 2. Feedback on the modeler’s choice of model element identifiers (‘consultant’,
‘ProgrammingLanguage’, etc.) provided by the Automated Assistant (highlighted in
color below ER diagram and on the right side in the task description) based on the
processing of the natural language description of the task shown on the right by the
Automated Assistant.
Automated Assistance for Data Modelers 29
3 Discussion and Outlook
The Automated Assistant provides modeling-time feedback to modelers on their
choice of model element identifiers based on natural language processing of mod-
eling task descriptions to assist in overcoming one of the most common (learning)
difficulties data modelers face, i. e., devising meaningful and expedient identi-
fiers for model elements [1,10]. Evaluation of the feedback generated by the
Automated Assistant from three modeling task descriptions of increasing length
and complexity suggests that the generated feedback is similar to human (e. g.,
instructor’s) advice, and helpful to learner of data modeling.
Compared to earlier prototype implementations (see [14]), precision and re-
call for five demonstration cases, i. e., modeling tasks used in teaching concep-
tual data modeling, improved in recognizing entity type and relationship type
identifiers from the text descriptions by an average of 0.045 with the largest in-
crease in attribute identification (entity types: precision 0.88 (from 0.84), recall
0.97 (from 0.94); attributes: precision 0.86 (from 0.79) 0.86, recall 0.79 (from
0.71); relationship types: precision 0.84 (from 0.84), recall 0.74 (from 0.69)).
The newly added identification of data types achieves a precision of 1.00 and a
recall of 0.66. In terms of precision and recall, the current prototype implemen-
tation performs on par or better than related approaches (e. g., [5,9]), especially
with respect to finding candidates for entity type identifiers, with prior research
demonstrating values for recall from 0.92 [9] to 0.95 [8]. Closer inspection of
precision and recall results show, as expected, the heuristic nature of the rules
applied to recognize relationship types, entity types, attributes, and correspond-
ing data types—and of the NLP techniques we use. Still, compound nouns with
three (and more) words such as ’social security number’ are, for example, not
always properly recognized—a limitation we have already addressed by revising
the implemented data modeling heuristics for entity types and attributes but
which needs further work. For example, further prototype evaluation exemplifies
that the identification of inherent relationships of compound nouns should be
improved posing a particular challenge as there are many ways to combine them
(see [5]). Moreover, we are currently implementing heuristic rules for identifying
generalization hierarchies in textual descriptions to provide data modelers with
automated assistance in modeling generalization relationships. Future work on
the Automated Assistant will also address the identification and recognition of
synonyms in the textual description by utilizing word embedding approaches
such as WordNet.
We are currently planning for further systematic evaluation by empirical
studies investigating how well the feedback provided by the Automated Assis-
tant supports learners of data modeling in devising meaningful and appropriate
identifiers for model elements. Building on prior mixed-methods research into
data modeling processes [10,11,13], modelers will be observed from complemen-
tary perspectives, e. g., based on the think-aloud method and surveying modelers
on their perception of the feedback as provided by the Automated Assistant. Sub-
sequently, we are planning to apply the Automated Assistant in an introductory
course on data modeling with 200+ students per semester to collect feedback on
the support provided by the Automated Assistant.
30 B. Ternes, K. Rosenthal, S. Strecker
References
1. Batra, D., Hoffer, J.A., Bostrom, R.P.: Comparing representations with relational
and EER models. Communications of the ACM 33(2), 126–139 (1990)
2. Chen, P.P.: English Sentence Structure and Entity-Relationship Diagrams. Infor-
mation Sciences 29, 127–149 (1983)
3. Chowdhury, G.: Natural language processing. Annual review of Information science
and technology 37, 51–89 (2003)
4. Liddy, E.D.: Enhanced Text Retrieval Using Natural Language Processing. Bulletin
of the American Society for Information Science and Technology 24(4), 14–16
(2005)
5. Lucassen, G., Robeer, M., Dalpiaz, F., van der Werf, J.M.E., Brinkkemper, S.: Ex-
tracting conceptual models from user stories with Visual Narrator. Requirements
Engineering 22(3), 339–358 (2017)
6. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.:
The Stanford CoreNLP Natural Language Processing Toolkit. In: 52nd Annual
Meeting of the Association for Computational Linguistics: System Demonstrations.
pp. 55–60. Baltimore, Maryland (2014)
7. Mendling, J., Reijers, H.A., Recker, J.: Activity labeling in process modeling: Em-
pirical insights and recommendations. Information Systems 35(4), 467–482 (2010)
8. Omar, N., Hanna, P., Mc Kevitt, P.: Heuristics-based entity-relationship modelling
through natural language processing. In: 15th Artificial Intelligence and Cognitive
Science Conference. pp. 302–313. Dublin, Ireland (2004)
9. Omer, M., Wilson, D.: Implementing a Database from a Requirement Specification.
International Journal of Computer and Information Engineering 9(1), 33–41 (2015)
10. Rosenthal, K., Strecker, S.: Toward a taxonomy of modeling difficulties: A multi-
modal study on individual modeling processes. In: 40th International Conference on
Information Systems, ICIS 2019, Munich, Germany, December 15–18, 2019 (2019)
11. Rosenthal, K., Strecker, S., Pastor, O.: Modeling difficulties in data modeling:
Similarities and differences between experienced and non-experienced modelers. In:
39th International Conference on Conceptual Modeling, ER 2020, Vienna, Austria,
November 3–6, 2020. LNCS, vol 12400. pp. 501–511 (2020)
12. Speaks, J.: Theories of Meaning. In: Zalta, E.N. (ed.) The Stanford Encyclopedia
of Philosophy. Metaphysics Research Lab, Stanford University, winter 2019 edn.
(2019)
13. Strecker, S., Rosenthal, K., Ternes, B.: Studying Conceptual Modeling Processes:
A Modeling Tool, Research Observatory, and Multimodal Observation Setup. In:
Gimpel, H., Krämer, J., Neumann, D., Pfeiffer, J., Seifert, S., Teubner, T., Veit, D.,
Weidlich, A. (eds.) Market Engineering – Insights from Two Decades of Research,
pp. 99–111. Springer, Cham (2021)
14. Ternes, B., Rosenthal, K., Strecker, S.: Automated Assistance for Data Modelers:
A Heuristics-based Natural Language Processing Approach. In: 29th European
Conference on Information Systems, ECIS 2021, Marrakech, Morocco, June 15–17,
2021 (2021)
15. Tjoa, A.M., Berger, L.: Transformation of requirement specifications expressed in
natural language into an EER model. In: Elmasri, R., Kouramajian, V., Thal-
heim, B. (eds.) Proceedings of the 12th International Conference on the Entity-
Relationship Approach, ER 1993, Arlington, Texas, USA, December 15–17, 1993.
LNCS, vol 823. pp. 206–217 (1994)