<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Easy-to-use interfaces for supporting the semantic annotation of web tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sara Bonfitto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Perlasca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Mesiti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science (University of Milan)</institution>
          ,
          <addr-line>via Celoria 18, 20133 Milan (MI)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the last few years, many approaches have been proposed for the semantic annotation of Web tables according to the concepts of a domain ontology and for the semantic description of the relationships existing among the identified concepts. However, these approaches are probabilistic and they are not always able to identify the correct semantic annotation because of the heterogeneity of the table contents, the eventual presence of mistakes, and the lack of standardization. The user intervention is thus required for checking the proposed annotations, correcting mistakes, and eventually providing new ones. In this paper, we propose diferent easy-to-use graphical facilities for supporting the user in this activity when dealing with web tables presenting a complex structure and syntactic and semantic mistakes. Diferent semantic annotation techniques can be integrated into the web application that produces results according to the data structures that are discussed in the paper. A usability analysis was conducted to assess the quality of the provided graphical tools.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Table Understanding</kwd>
        <kwd>GUIs for Web tables</kwd>
        <kwd>Graphical representation of semantic description</kwd>
        <kwd>Usability analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the predicted annotations. For what concerns CTI, the can have associated basic properties taken from a set
proposed interfaces allow showing errors occurring in  = {(1, 1), . . . , (.)}, where  is the basic
columns (i.e. values that do not adhere to the column type of the values of property name ; the properties of
type), identifying more than one annotation for the same a concept  include those specifically defined for  and
column, annotating the string components with diferent those inherited from ancestors of .
ontology properties. For what concerns RD, we consider A semantic description for a table  is a graph 
the possibility of identifying a semantic description (in representing the mapping between the columns of 
the same spirit of [18, 19, 20, 21]) and propose graphical and the "meta-instances" of the concepts in . We talk
tools for completing the semantic description and chang- about meta-instances instead of concepts of  because
ing concepts and properties automatically determined. A  can contain diferent instances of the same
conusability test has been conducted on the proposed visual cept, and we need to discriminate them. Formally, a
interfaces with good appreciation from our volunteers. semantic description for a table  = ⟨, , ⟩</p>
      <p>By means of the data structures that our interfaces is a graph  = (,  , ,  ), where:  is
rely on, our web application can integrate diferent CTI a set of nodes representing meta-instances of the
conand RD approaches. In the examples presented in the cepts in ;  ∈  denotes a vertex corresponding
paper we refer to the CTI approach developed in [14] to the ℎ occurrence of the concept ;  is a set of
and the RD approach developed in [21]. However, other nodes corresponding to the columns in  (| | ≤ | |);
approaches can be easily integrated.  ⊆ × ×  represents the relationships among</p>
      <p>In the remainder, Section 2 introduces the data struc- concepts in ;  ⊆  ×  ×  denotes the
proptures for tables, types, ontology, and semantic description erties associated with the columns of  .
that are exploited from our interfaces. Section 3 shows
the interfaces developed in the context of CTI. Section 4
shows the interfaces developed for the result of RD and 3. Web Table Visualization with
for their modification. Section 5 shows the usability test Column Type Annotation
and the obtained results. Finally, concluding remarks are
reported in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>A web table  is a triple ⟨, , ⟩,
where  denotes the list of column names
[1,. . . , , . . . , ] (when available,
otherwise the symbol ? is used to denote its absence),
and  = {1, . . . , } is the set of table
rows (each row , 1 ≤  ≤ , is a list of values
 = [,1, . . . , , , . . . , ,], one value for
each column identified in the column schema).  is
a partial function that represents the type/annotation
associated with each value and column name of  . The
type annotation can be: a basic/domain-specific type, a
mixed/union type, or the property  of a concept  of
an Ontology  (denoted ⟨, ⟩). Basic types include
integer, Boolean, decimal, date, whereas
domainspecific types can be, for example, Social Security
Number (SSN), VAT, currency, email, province,
zip code. Mixed types are record-types associated with a
set of patterns for extracting the record components from
strings, and union types for representing the occurrence
of diferent types of values in a column. A domain
Ontology  contains a set of concepts  = {1, . . . , } and
relationships ℛ = {(1, , 2)|1, 2 ∈ ,  ∈ },
where  is the set of relation names. Concepts can
be organized in an inheritance hierarchy: 1 ⊑ 2
denotes that 1 is sub concept of 2. Each concept</p>
      <sec id="sec-2-1">
        <title>Once one of the CTI approaches is applied, a table</title>
        <p>= ⟨, , ⟩ is generated and annotated
with a set of types and possibly with pairs ⟨, ⟩ of the
Ontology . In some cases, table columns that uniquely
identify an instance of the concept (e.g. SSN of a person)
can be discriminated from those that are simple
characteristics of the identified concept (e.g. gender of a person).
More than a single type can be used for annotating the
values of a single column. Whenever the frequency of
cells of a given type is above a given threshold, the type
is added to the union type identified for the column.
Conversely, if the frequency is below the threshold, the cell
is considered an error. Moreover, CTI approaches can
also extract annotations for sub-components of strings,
thus a record type can be extracted according to a given
pattern (introduced as mixed type in Section 2).</p>
        <p>Sometimes CTI approaches are not able to identify the
pair ⟨, ⟩ for all the table columns. Columns that did
not receive the annotation are named unmatched and
need to be properly handled by our graphical interfaces.</p>
        <p>In this section, the interfaces for showing  with the
type annotations and for updating them are introduced.</p>
        <sec id="sec-2-1-1">
          <title>3.1. Main interface</title>
          <p>Figure 1 shows the main interface we have developed
for showing a table  representing information about
invoices for the payment of taxes. The invoice can be
related to a person or a company (with the relations
high, it can be dificult to detect all the red cells; for this
reason, we provide an error panel, on the right part of the
screen, that summarizes the issues that need to be solved.
An example of the panel is shown in Figure 2. When the
user clicks the check button on one of the tabs in the
panel, only the rows presenting the error are shown in
the main interface. Once the errors are corrected, the
corresponding panel tab is removed.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>3.2. Modification of type annotations</title>
          <p>holder and owner). For each holder or owner, the associ- Since the CTI approach can produce false positives or
negated address can be the residential address (in the case atives (e.g. a ZIP code that has been labeled as Integer),
of people or the headquarter in the case of a company). specific GUIs have been developed for supporting the
The invoice contains taxes that the associated person or user in modifying the predicted annotations and easily
company should pay along with the penalty. Each cell applying the modification to the entire column or
suband column is associated with the predicted type annota- set of cells. In the modification process, the user should
tion. The first line reports the column schema and it is be supported in the specification of pairs ⟨, ⟩ instead
followed by a drop-down menu containing the inferred of basic or domain-specific types that can be obtained
type annotations for each column. If more than one an- through the CTI approach. Indeed, users can easily
idennotation is reported in a single column, this means that tify the domain concept to be associated with the column
each annotation is a member of a union type, which is and thus improve the semantic description of the column.
represented using diferent background colors for distin- The user is also supported in the modification of the value
guishing their instances, such as in the second column of a cell when it contains errors. Consider for example
of the table in Figure 1 where the occurrences of SSN the red cell in the date of birth column in Figure 1.
and VAT are distinguished using two diferent shades of It contains a value not compliant with the column type
green. Similarly, the presence of mixed types is repre- since the separators of the date are missing. In this case,
sented using diferent text colors for each component of the user can fix the mistake by directly editing the cell.
the pattern. If a column presents a single annotation, the For performing bulk modifications on the values, a
spebackground remains white, if a value is missing, the cell cific interface has been developed. For each column, the
background color is yellow. The usage of diferent colors interface groups the occurrences of the identical string
can help the users in the process of checking the type of the column, followed by the number of occurrences.
predictions and the empty values (in some cases a value Similar strings are clustered together relying on the edit
should be provided) and performing error corrections. A distance and then reported together in the interface. In
cell is considered an error when its type annotation is this way, it is easier to visually detect the errors and
not compliant with the ones identified for the column; in correct them with the aim of obtaining a homogeneous
this case, the cell background color and the column type representation of the same kind of information. The user
background are marked in red. Facilities are provided for can edit the single value, and the proposed modification
showing only rows presenting values in a column of a is applied to all the occurrences (note that when
correcgiven type (for easily checking and correctly them). tions lead to a value already present in the column, the
When the number of rows contained in a document is two rows are collapsed).</p>
          <p>The interface in Figure 3 was developed for easily Example 1. Two errors occur in the ZIP code
colchanging the type of an entire column or a subset of umn in Figure 1. Through the interface in Figure 3,
its values when column annotations are not correctly the user highlights Type_2 and substitutes the errors
identified. The interface can be activated on a single cell, with Address.ZIP. Moreover, Type_1 can be modified
which becomes the current target of the modification, or in Address.ZIP leading to a single column type. □
on the entire column. Through this interface: ) the in- The possibility of modifying types according to the
ferred data type can be modified into a new type, or into a concepts contained in the domain ontology can also
in⟨, ⟩ pair of the domain ontology; ) a mixed type can troduce some issues that need to be properly managed.
be created or modified. The interface is organized into 5
areas. In (1) the hierarchy of concepts available in  is Example 2. Consider the column Name/Company
reported along with basic types (collected under the but- in Figure 1 that the CTI approach has typed
ton General). The user can select one of the available union(mixed_1, text, company), where the
strucconcepts, and the corresponding properties are reported ture associated with mixed_1 is rec(name, surname).
in (2) (when the General button is pressed, the basic The value Danielle Gray Greeen is of type text and
types are reported). In (3), when the interface is activated can be changed with the mixed type mixed_2 whose
on a target cell, a single type is reported (the value type), structure is rec(Person.name, Person.surname). So, a
otherwise, the components of the union types specified more complex type than the one expected is generated. □
for the column are reported. In this way, the type can be
changed for each component of the union type. In (4) it
is reported the target value or the column name and is
highlighted with the current type for the column. The
user can remove the current labeling (by clicking on the
x button on the top left corner of the string) and apply a
new ⟨, ⟩ pair. In (5) values of the same type present
in the column are reported and the user can select those
to which the type modification should be applied ( all is
the default behaviour). The user can also decide to select
the “text” checkbox reported in (6) to unify undesired
union types (e.g. decimal and integer) and to treat the
whole column as an instance of a single type. Then, the
user can select the new type to be assigned to all values.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>To face this issue, a re-writing system based on</title>
        <p>rules [24] has been developed for the simplification
of the type expression after the modifications applied
by the user. The re-writing rules express
correspondence between simple types and ⟨, ⟩ pairs of the
domain ontology occurring in the same table column.
Once the re-writing rules are applied, the union-type
components presenting the same structures are
compacted. The union type is finally transformed into a
simple type when a single component is identified. In
the previous example, the application of the re-writing
system leads to the type union(mixed_1, company),
where the record type associated with mixed_1 is
rec(Person.name, Person.surname).</p>
        <sec id="sec-2-2-1">
          <title>3.3. Identification of a mixed type</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Even if diferent approaches for the extraction of concepts</title>
        <p>from texts have been proposed [25], the identification of
sub-components of a mixed type is quite hard to be
handled automatically, especially when errors and variability
in the pattern occur. Our interfaces support the user in
the specification and modification of mixed types.
Example 3. Consider the column address in Figure 1
and suppose the ML algorithm was able to identify the type
mixed_1 for some of its values. The others are marked
of type text and we can see that they follow two specific
patterns. These patterns can be manually detected on a
single instance and applied to all the others. □</p>
        <p>The interface for the identification of mixed types is
similar to the one presented in Figure 3 but it works on
specific cell values that are reported in (4). Once the user
has selected the property of a concept (in this case the
municipality of an Address), he/she can highlight the
part of the string of such a type. This behaviour applied
to all the components will lead to the situation reported
in the top part of Figure 4. In this way, we identify the
terminal and non-terminal symbols that form our pattern.
The non-labeled items are considered terminal symbols,
while the labeled items are exploited for the generation of
the pattern. Note that the void symbol can be applied for
skipping variable parts of the string. Once the labeling is
complete, the user can check if the generated pattern can
be applied to other strings occurring in the same column
that adhere to the same pattern (the instances in (5) that
follow the pattern are highlighted). When the user tries to
apply the labeling to other strings, the interface in Figure
5 is shown. The top part of the figure reports the labeled
string, whereas the left panel reports strings that do not
present the same pattern and the right panel contains
the strings that have been re-written according to the
identified pattern. The user can check the correctness of
the applied pattern in the right panel and move to the
left one those erroneously annotated. Moreover, he/she
can take note of the strings in the left panel because
they require the specification of diferent patterns or the
identification of diferent types.</p>
      </sec>
      <sec id="sec-2-4">
        <title>In our example, the pattern can be applied only to two strings. For the remaining two strings of type text, the pattern in the bottom part of Figure 4 should be specified on one of them and applied to the other.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Relation discovery</title>
      <sec id="sec-3-1">
        <title>As a result of the RD task, a graph  can be generated.</title>
        <p>Its vertices correspond to the concepts that occur in the
table or concepts induced by the presence of
relationships with the table concepts. Moreover, graph edges
are predicted by the adopted ML algorithm. Besides that,
nodes representing the table columns are also included in
the graph and are associated with the concepts by means
of the corresponding properties.</p>
        <p>Starting from , a graphical representation can be
devised and reported in the main canvas of Figure 6 for
being checked and approved by the user. The green nodes
(representing meta-instances) are laid out in the top part
of the canvas, whereas, light blue nodes (representing
the table columns) are in the lower part of the canvas.
Edges between instance nodes (i.e. ) and edges
between instance nodes and terminal nodes (i.e.  ) are
represented in the same way (labeled arrows) because
their meaning is easily understandable from the context.
The label on the edges is the relation/property name. For
each light blue node in the graph, a single incoming edge
is present if the column has a single basic type (e.g. the
zip column). Multiple incoming edges can be present
when the light blue node represents a mixed or
uniontype column. For example, the Address column is of
type mixed and three incoming edges are present (for
representing the properties streetName, streetNumber,
and municipality). Moreover, the SSN/VAT column is
an example of column of type union and two incoming
edges are present (one representing the SSN property
of the instance-node P1erson and the other representing
the VAT identifier of the instance-node C1ompany. We have
decided to maintain this simplified representation for
keeping simple the illustration. Isolated nodes (i.e.
terminal nodes without incoming edges) are not included in
our graph representation.</p>
        <p>The left panel (1) contains buttons corresponding to
the table column. We exploit a double representation
of the table columns (buttons in the left panel and light
blue nodes in the central panel) because they are used
for checking the correctness of the semantic values
associated with each column and for adding missing
annotations to the unmatched columns. Moreover, graphical
edges are used for verifying the connections among the
components and modifying/adding new ones.</p>
        <p>The buttons in the left panel can be colored in two
ways: green, i.e. the associated column has been already
included in ; pink, i.e. the associated column is not
yet included in . By clicking on the arrow positioned
on the left side of the button, it is possible to show the
data type associated with that column (single type or
union of types). By right-clicking on the button itself
it is possible to specify a new pair ⟨, ⟩ of the domain
ontology for each data type of the column.</p>
        <p>In the remainder, we discuss the operations that can
be invoked on the two parts of the interface.</p>
        <sec id="sec-3-1-1">
          <title>4.1. Visual operations on table columns</title>
          <p>The following operations can be invoked on the table
columns reported in the left panel for the correction of
errors or in the definition of new nodes:</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>1. Association of properties. It allows the specifica</title>
        <p>tion of properties to unmatched columns.
2. Modification of properties . It allows changing the
current association of properties for a column.
3. Removal of properties. It removes the semantic
concept associated with the column.</p>
        <p>Operation 1 can be invoked on unmatched columns
(i.e. pink buttons) and used for including them in  in
two steps. First, the identification of the properties that
represent the column content in the ontology concepts is
specified. Then, the instance nodes in  (or new nodes
that need to be added in ), to which the properties
can be associated, must be defined.</p>
        <p>Example 4. Consider the unmatched tax column in
Figure 6. When Operation 1 is invoked on it, an interface
similar to the one in Figure 3 is shown. In this case, the
property taxes associated with the concept Invoice on
the left bar is used for annotating the entire column. At
the end of the operation, since no node in  represents
an instance of Invoice, the node I1nvoice is introduced
in  with the node  for representing the table column
tax. The edge (I1nvoice, taxes,  ) is included in  . □</p>
      </sec>
      <sec id="sec-3-3">
        <title>Whenever the chosen concept is already present in</title>
        <p>, an interface is shown to the user for deciding if
the identified property should be associated with one of
the meta-nodes in  or a new one should be included.
In this way, it is possible to distinguish the presence of
diferent instances of the same concept.</p>
        <p>Example 5. Consider the situation of the previous
example, and suppose that the table column penalty is now
semantically annotated with Invoice.penalty. Since
node I1nvoice is already included in , a panel is shown
to the user for deciding if the property should be associated
with I1nvoice or a new meta-instance should be created. □</p>
      </sec>
      <sec id="sec-3-4">
        <title>Regardless of the number of instances, after the in</title>
        <p>troduction of a new concept, the system identifies the
relations existing between the newly inserted element
and the other concepts in the ontology. If a single relation
is present, it is automatically added to .
Table 1 shows the operations that can be executed on
the graphical representation of . Some of them (1
and 2) can be invoked on light blue nodes and produce
the same efect as the corresponding operations that can
be applied to the buttons on the left sidebar. Operation
3 can be invoked on an instance node and allows the
introduction of a new link with another instance node.
The inserted links must be coherent with the domain
ontology so that, for each pair of nodes, only existing
relations in the correct direction can be added.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Operation 2 is used for changing the already associ</title>
        <p>ated semantic annotation to a column (i.e. it is invoked
on a green button). Besides changing the semantic anno- Example 6. Consider the unmatched columns tax and
tation, this operation also allows changing the instance penalty that have been associated with the instance node
node to which the properties are associated (if needed). I1nvoice. This instance node should be linked with its holder
By invoking Operation 3 on a green button, the existing or its owner (that can be a person or a company). These
semantic annotation is removed along with the corre- bindings are realized by means of the interface in Figure 7.
sponding nodes in the graphical representation. The interface shows the lists of relation names (ingoing and
outgoing) that can be exploited for the nodes of this concept
4.2. Visual Operations on SD by taking into account the instance nodes in the current
 and the constraints of the domain ontology. The user
can select the correct relation and insert it in the graphical
representation of  that is updated accordingly. In this
case, the interface is used two times for including two links
(relation holder and owner) for representing the relation
with the person and the company. □</p>
      </sec>
      <sec id="sec-3-6">
        <title>Operation 4 of Table 1 can be invoked on a node of the</title>
        <p>graphical representation of  with the aim of
modifying the name of the relation between two nodes or one
of the nodes connected by the link. Finally, Operation 5
allows the deletion of an edge occurring in .
goal
mixed type
errors
bulk editing
management of
unmatched columns
connections
among concepts
new instance
time
7 min
5 min
3 min
7 min
5 min
4 min
success
Definition of a mixed type and application
of the labeling to other strings through the
“apply” function
Detection and correction of the errors on
values/types through the error panel
Rows are updated in a single operation
The user specifies a concept and a
property for each unmatched column.</p>
        <p>The user identifies the correctness of the
existing links, adds the missing ones and
modifies the wrong ones
the user is able to define a new instance
for a concept
failure
Lack of the pattern definition or
application of the new procedure every time
The errors are not corrected and the error
panel is not exploited
Rows are updated one at a time
One or more columns have not been
associated with a concept and property of the
domain ontology
The final  is not complete or the links
are wrong
the user uses the same instance for
multiple occurrences</p>
        <p>Figure 8 shows the final  that can be realized by with Excel. Most of the students are currently attending
means of our tool that describes the diferent kinds of data their bachelor’s degree, therefore they have only a high
that can be extracted from our running example. This school diploma. Users have an average knowledge of
difsemantic description can then be used for translating the ferent operative systems and use a computer or a laptop
table content as RDF triples. mostly for working or studying.</p>
        <p>Table 2 reports the tasks that we have identified for
checking the main functionalities of our system. Each
5. Experimental results task requires the processing of a spreadsheet that is
specifically created for the purpose of the task and whose
conWe organized a usability test of the Web application. The tent can be easily understood also by non-expert of the
aim of this test is to evaluate if the users can smoothly domain. Specifically, two spreadsheets have been
deinteract with the application and use the provided tools, signed for pointing out the issues that each task was
what level of knowledge in computer science is needed, intended to address. Even if these spreadsheets
correand check the existence of critical aspects that should be spond to real documents of our domain, their content has
ifxed or improvements to be applied. This test is com- been anonymized for preserving user privacy. For each
posed of three parts: first, the user watches a video that task, Table 2 reports the main goal, the time required for
introduces him/her to the problem and shows the system completing the task and when the task can be considered
usage. Then, some tasks are assigned to be carried out successfully completed or completely a failure. Tasks 1,
on specific files. Finally, the user fills out a questionnaire 2, and 3 are used for evaluating the usability of interfaces
about his experience with the system, containing: ) per- developed for CTI, while the remaining tasks are used
sonal information (age, gender, level of instruction) and for evaluating the interfaces in handling the result of RD.
technical abilities (computer skills in general, knowledge Almost all the volunteers (70%) were able to complete
of operating systems, skills in the use of spreadsheets, ...); the assigned tasks within the specified time limits. The
and, ) users’ opinions about the assigned tasks and their others would have been able to complete the task with
adcomplexity ) users’ opinions about the functionalities ditional time. A good fraction (70% of the users) thought
of the proposed tool. The questions are rated using a Lik- that the assigned tasks were easy and enough intuitive.
ert scale (from “strongly disagree” to “strongly agree”). For task 1, most of the individuals (85%) were able to</p>
        <p>We selected 20 participants, 12 males and 8 females, specify a mixed type through the interface. All of them
60% of them were between 21 and 23 years old, 20% be- used the “apply” button to label all the mixed types in
tween 24 and 26 years old and the remaining ones were a single column. The main reason for the failure of this
more than 26. Most of the users were recruited among task was the choice of the wrong interface (they selected
personnel and students of the department of computer the interface for the modification of column type instead
science of the University of Milan and therefore they have of the one for modifying the cell type).
good technical skills. However, they are not involved in For task 2, 75% of the individuals used the error panel
this project and they have little knowledge of the domain. and the general impression about its usefulness is very
Only a small part of the participants (50%) feels confident positive (from partially to strongly agree). The users that
6. Concluding remarks
did not exploit the error panel, tried to increase the size of
table pagination to identify the errors. In these cases, the
identification and correction of the errors required more In this paper we have discussed diferent user interfaces
time. Concerning this task, only 28% of users had trouble that can be exploited after the application of CTI and
in distinguishing errors occurring on the data type (i.e. RD approaches for correcting the automatic predicted
the component of a union type was not identified by the annotations and thus improving the semantic
descripused CTI approach) from errors occurring on the data tion of web tables. The developed interfaces allow, in
(i.e. a date is written without separators). many cases, the specification of a single modification</p>
        <p>For task 3, most of the individuals (86%) were able to and its propagation to other values in the same column
use the bulk editing functionality and all of them thought that follows the same type or the same pattern. Once the
it speeds up the editing process. The remaining part did semantic description has been generated and validated,
not notice the error occurring within the data (usually an it can be exploited for the translation of the table
conadditional letter in the name of a city) and they corrected tent in a KG representation, thus obtaining a meaningful
it by editing the data type. representation of the table content.</p>
        <p>For task 4 and task 5, 90% of the individuals completed The problem of supporting the user in the
interpretathe job in just 10 minutes. Only 10% of them had some tion of table content was initially faced in Karma [18].
trouble remembering the procedures to complete task 5. Our approach deeply extends this work by considering</p>
        <p>For task 6, 20% of the users needed to watch again the a more sophisticated data model both for the CTI and
training videos to complete the assignment correctly. The RD interfaces. Our semantic description allows the
manadditional time required for watching videos were not agement of tuples of diferent types that need a diferent
counted in the total time of the task completion. The fact knowledge representation. Moreover, interfaces for the
that all users, possibly after watching again the training semantic labeling of columns and for extracting patterns
video, have completed the tasks correctly highlights a for strings are new contributions of this paper.
possible dificulty for a novice user to learn the various A usability test has been run on the graphical
interprocedures rather than apply them. faces for assessing their facility of use. Our results show</p>
        <p>We tested user satisfaction in using the developed in- that almost all users believe the application is easy-to-use
terfaces to support users in solving both CTI and RD and intuitive. Some more eforts should be devoted to
issues. For the interfaces developed for CTI, 95% of the improving the interfaces for handling the semantic
deusers agreed that the application is easy-to-use and in- scription and for showing the results of the modifications
tuitive and 85% of them declared that they did not have on the table data. We are currently working on providing
problems during the error correction process. The greater further facilities for supporting the user in this activity.
dificulties were related to the understanding of the spe- The work discussed in this paper can be extended in
cific domain; most of the users did not know the meaning several directions. Even if we have focused on developing
of the concepts of the domain ontology and tried to iden- graphical interfaces for supporting the CTI and RD tasks,
tify the most suitable one. Moreover, the application also entity linking approaches [26] can be used for table
provides a lot of functionalities and the user needs time understanding and specific interfaces can be included
to gain confidence in the system. in our system for their management. Moreover, once</p>
        <p>For the interfaces developed for RD, 85% of the users the semantic description is obtained, it can be exploited
agreed that the interfaces are easy-to-use and intuitive for the creation of KGs reporting the table content [21].
whereas the remaining ones expressed a neutral position. Specific interfaces can be also developed for supporting
The greatest uncertainties concerned the application of the user in obtaining this result and for the management
the operations on the graphical representation of . In of duplication and for fusing together alternative
repparticular, a few users had dificulty in recognizing or resentations of the same entity. We would like also to
applying operations such as the insertion of new concepts collect the user modifications on the automatically
genor the insertion or removal of links between concepts. erated annotations provided by CTI and RD approaches
Half of the users declared that they had to remove an and use them for tuning the underline approaches.
Fiedge because it was not correct. Only 16% of the users nally, we would like to use the proposed interfaces for
could not connect all the graph nodes because they did the construction of biological knowledge graphs [27].
not have enough knowledge about the domain (e.g. they
did not know that a company can be the invoice holder). Acknowledgments</p>
        <p>In conclusion, the usability test suggests that although
some aspects of the application could be improved, for This research was supported by the ”National Center for
example by adding contextual help possibly supported Gene Therapy and Drugs based on RNA Technology”,
by short videos, the overall opinion is that the system is PNRR-NextGenerationEU program [G43C22001320007].
intuitive and easy to use.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Limaye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          , Annotat-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>ing and searching web tables using entities</article-title>
          , types [1]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          , et al.,
          <article-title>Webtables: Exploring the and relationships</article-title>
          ,
          <source>Proc. VLDB</source>
          <volume>3</volume>
          (
          <year>2010</year>
          )
          <fpage>1338</fpage>
          -
          <lpage>1347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>power of tables on the web</article-title>
          ,
          <source>Proc. VLDB</source>
          .
          <volume>1</volume>
          (
          <issue>2008</issue>
          ) doi:10.14778/1920841.1921005.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          538-
          <fpage>549</fpage>
          . doi:
          <volume>10</volume>
          .14778/1453856.1453916. [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Venetis</surname>
          </string-name>
          , et al.,
          <source>Recovering semantics of tables</source>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bonfitto</surname>
          </string-name>
          , E. Casiraghi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mesiti</surname>
          </string-name>
          ,
          <article-title>Table under- on the web</article-title>
          ,
          <source>Proc. VLDB</source>
          .
          <volume>4</volume>
          (
          <year>2011</year>
          )
          <fpage>528</fpage>
          -
          <lpage>538</lpage>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>standing approaches for extracting knowledge from 14778/2002938</source>
          .2002939.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>heterogeneous tables, WIREs Data Mining Knowl</article-title>
          . [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mulwad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , Semantic message
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          Discov.
          <volume>11</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1002/widm.1407.
          <article-title>passing for generating linked data from tables</article-title>
          , in: [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          , et al.,
          <source>Wrangler: Interactive visual spec- The Semantic Web Conference</source>
          , Springer, Berlin,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>ification of data transformation scripts</article-title>
          , in: ACM Heidelberg,
          <year>2013</year>
          , pp.
          <fpage>363</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>Human Factors in Computing Systems (CHI)</source>
          ,
          <year>2011</year>
          , [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Taheriyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. L</surname>
          </string-name>
          . Am-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          p.
          <fpage>3363</fpage>
          -
          <lpage>3372</lpage>
          . doi:
          <volume>10</volume>
          .1145/1978942.1979444.
          <article-title>bite, Learning the semantics of structured data [4] Trifacta</article-title>
          , Wrangler,
          <year>2020</year>
          . www.trifacta.com/. sources,
          <source>Journal of Web Semantics 37-38</source>
          (
          <year>2016</year>
          )
          <volume>152</volume>
          [5]
          <string-name>
            <surname>Google</surname>
          </string-name>
          ,
          <article-title>Openrefine: A free, open source</article-title>
          , pow- -
          <volume>169</volume>
          . doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2015</year>
          .
          <volume>12</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>erful tool for working with messy data</article-title>
          ,
          <year>2020</year>
          . [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Futia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vetrò</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. De Martin</surname>
          </string-name>
          , Semi: A semantic
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          Https://openrefine.org/.
          <article-title>modeling machine to build knowledge graphs with [6] I</article-title>
          .
          <string-name>
            <surname>Valera</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <article-title>Automatic discovery of graph neural networks</article-title>
          ,
          <source>SoftwareX</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <fpage>100516</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>the statistical types of variables in a dataset</article-title>
          , in: [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pujara</surname>
          </string-name>
          , Learning semantic
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>Proc. of Machine Learning Research</source>
          , volume
          <volume>70</volume>
          ,
          <article-title>models of data sources using probabilistic graph-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <year>2017</year>
          , pp.
          <fpage>3521</fpage>
          -
          <lpage>3529</lpage>
          . ical models,
          <source>in: The WWW Conf., ACM</source>
          ,
          <year>2019</year>
          , p. [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ceritli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K. I.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geddes</surname>
          </string-name>
          , ptype:
          <fpage>prob</fpage>
          - 1944
          <source>-1953. doi:10.1145/3308558</source>
          .3313711.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>abilistic type inference, Data Mining</article-title>
          and Knowl- [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bonfitto</surname>
          </string-name>
          , et al.,
          <article-title>A semantic approach for con-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>edge Discovery</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>870</fpage>
          -
          <lpage>904</lpage>
          . doi:
          <volume>10</volume>
          .1007/
          <article-title>structing knowledge graphs extracted from tables,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>s10618-020-00680-1. Tech. Rep</source>
          , Dept. Computer Science, Uni. of Milano, [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Abdelhédi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Darmont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ravat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Teste</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>Automatic machine learning-based olap measure</article-title>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , et al.,
          <article-title>Link prediction techniques, ap-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>and Knowledge</given-names>
            <surname>Discovery</surname>
          </string-name>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>A</lpage>
          : Statistical
          <source>Mechanics and Its Applications 553</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          188. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -12670-3_
          <fpage>15</fpage>
          . (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1016/j.physa.
          <year>2020</year>
          .
          <volume>124289</volume>
          . [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pham</surname>
          </string-name>
          , et al.,
          <article-title>Semantic labeling: A domain-</article-title>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          , et al.,
          <source>Modeling relational data</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>independent approach, in: The Semantic Web Con- with graph convolutional networks</article-title>
          ,
          <year>2017</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          ference, Springer, Germany,
          <year>2016</year>
          , pp.
          <fpage>446</fpage>
          -
          <lpage>462</lpage>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.1703.06103. [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rümmele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tyshetskiy</surname>
          </string-name>
          , A. Collins, Evaluating [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dershowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Plaisted</surname>
          </string-name>
          , Chapter 9 -
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>Workshop on Linked Data on the Web</source>
          , volume
          <volume>2073</volume>
          North-Holland,
          <year>2001</year>
          , pp.
          <fpage>535</fpage>
          -
          <lpage>610</lpage>
          . doi:
          <volume>10</volume>
          .1016/
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>of</surname>
            <given-names>CEUR</given-names>
          </string-name>
          , Lyon, France,
          <year>2018</year>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>40</lpage>
          .
          <fpage>B978</fpage>
          -044450813-3/
          <fpage>50011</fpage>
          -
          <lpage>4</lpage>
          . [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jimenez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Horrocks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          , [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          , et al.,
          <article-title>A hybrid ontology-based in-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>column type prediction</article-title>
          ,
          <source>in: Proc. of AAAI Conf. on mation Science</source>
          <volume>42</volume>
          (
          <year>2016</year>
          )
          <fpage>798</fpage>
          -
          <lpage>820</lpage>
          . doi:
          <volume>10</volume>
          .1177/
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>Artificial Intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>36</lpage>
          .
          <fpage>0165551515610989</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>doi:10</source>
          .1609/aaai.v33i01.
          <fpage>330129</fpage>
          . [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. Balog, Web table extraction, retrieval, [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          , et al.,
          <article-title>Sherlock: A deep learning ap- and augmentation: A survey</article-title>
          ,
          <source>ACM Trans. Intell.</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>proach to semantic data type detection</article-title>
          ,
          <source>in: SIGKDD Syst. Technol</source>
          .
          <volume>11</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1145/3372117.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>ing</surname>
          </string-name>
          , ACM,
          <year>2019</year>
          , p.
          <fpage>1500</fpage>
          -
          <lpage>1508</lpage>
          . Alex Patak, Antonio Puertas-Gallardo, Alberto Pac[13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Sato</surname>
          </string-name>
          : Contextual semantic type de- canaro, Giorgio Valentini, Elena Casiraghi
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>tection in tables</article-title>
          ,
          <source>Proc. VLDB</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>1835</fpage>
          -
          <lpage>1848</lpage>
          . [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gliozzo</surname>
          </string-name>
          , et al.,
          <source>Heterogeneous data integration</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>doi:10.14778/3407790</source>
          .3407793.
          <article-title>methods for patient similarity networks</article-title>
          ,
          <source>Brief</source>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bonfitto</surname>
          </string-name>
          , et al.,
          <article-title>Semi-automatic column type infer-</article-title>
          ings
          <source>in Bioinformatics</source>
          <volume>23</volume>
          (
          <year>2022</year>
          ) doi:10.1093/bib/
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>ence for CSV table understanding</article-title>
          ,
          <source>in: Proc. of 47th bbac207.</source>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>tice of Computer Science</source>
          , SOFSEM, volume
          <volume>12607</volume>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>of</surname>
            <given-names>LNCS</given-names>
          </string-name>
          , Springer, Bolzano, Italy,
          <year>2021</year>
          , pp.
          <fpage>535</fpage>
          -
          <lpage>549</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>doi:10</source>
          .1007/978-3-
          <fpage>030</fpage>
          -67731-2_
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>