1. Introduction

Easy-to-use interfaces for supporting the semantic annotation of web tables

Sara Bonfitto

Paolo Perlasca

Marco Mesiti

0 0 Department of Computer Science (University of Milan) , via Celoria 18, 20133 Milan (MI) , Italy

In the last few years, many approaches have been proposed for the semantic annotation of Web tables according to the concepts of a domain ontology and for the semantic description of the relationships existing among the identified concepts. However, these approaches are probabilistic and they are not always able to identify the correct semantic annotation because of the heterogeneity of the table contents, the eventual presence of mistakes, and the lack of standardization. The user intervention is thus required for checking the proposed annotations, correcting mistakes, and eventually providing new ones. In this paper, we propose diferent easy-to-use graphical facilities for supporting the user in this activity when dealing with web tables presenting a complex structure and syntactic and semantic mistakes. Diferent semantic annotation techniques can be integrated into the web application that produces results according to the data structures that are discussed in the paper. A usability analysis was conducted to assess the quality of the provided graphical tools.

eol>Table Understanding GUIs for Web tables Graphical representation of semantic description Usability analysis

1. Introduction

the predicted annotations. For what concerns CTI, the can have associated basic properties taken from a set proposed interfaces allow showing errors occurring in = {(1, 1), . . . , (.)}, where is the basic columns (i.e. values that do not adhere to the column type of the values of property name ; the properties of type), identifying more than one annotation for the same a concept include those specifically defined for and column, annotating the string components with diferent those inherited from ancestors of . ontology properties. For what concerns RD, we consider A semantic description for a table is a graph the possibility of identifying a semantic description (in representing the mapping between the columns of the same spirit of [18, 19, 20, 21]) and propose graphical and the "meta-instances" of the concepts in . We talk tools for completing the semantic description and chang- about meta-instances instead of concepts of because ing concepts and properties automatically determined. A can contain diferent instances of the same conusability test has been conducted on the proposed visual cept, and we need to discriminate them. Formally, a interfaces with good appreciation from our volunteers. semantic description for a table = ⟨, , ⟩

By means of the data structures that our interfaces is a graph = (, , , ), where: is rely on, our web application can integrate diferent CTI a set of nodes representing meta-instances of the conand RD approaches. In the examples presented in the cepts in ; ∈ denotes a vertex corresponding paper we refer to the CTI approach developed in [14] to the ℎ occurrence of the concept ; is a set of and the RD approach developed in [21]. However, other nodes corresponding to the columns in (| | ≤ | |); approaches can be easily integrated. ⊆ × × represents the relationships among

In the remainder, Section 2 introduces the data struc- concepts in ; ⊆ × × denotes the proptures for tables, types, ontology, and semantic description erties associated with the columns of . that are exploited from our interfaces. Section 3 shows the interfaces developed in the context of CTI. Section 4 shows the interfaces developed for the result of RD and 3. Web Table Visualization with for their modification. Section 5 shows the usability test Column Type Annotation and the obtained results. Finally, concluding remarks are reported in Section 6.

2. Preliminaries

A web table is a triple ⟨, , ⟩, where denotes the list of column names [1,. . . , , . . . , ] (when available, otherwise the symbol ? is used to denote its absence), and = {1, . . . , } is the set of table rows (each row , 1 ≤ ≤ , is a list of values = [,1, . . . , , , . . . , ,], one value for each column identified in the column schema). is a partial function that represents the type/annotation associated with each value and column name of . The type annotation can be: a basic/domain-specific type, a mixed/union type, or the property of a concept of an Ontology (denoted ⟨, ⟩). Basic types include integer, Boolean, decimal, date, whereas domainspecific types can be, for example, Social Security Number (SSN), VAT, currency, email, province, zip code. Mixed types are record-types associated with a set of patterns for extracting the record components from strings, and union types for representing the occurrence of diferent types of values in a column. A domain Ontology contains a set of concepts = {1, . . . , } and relationships ℛ = {(1, , 2)|1, 2 ∈ , ∈ }, where is the set of relation names. Concepts can be organized in an inheritance hierarchy: 1 ⊑ 2 denotes that 1 is sub concept of 2. Each concept

Once one of the CTI approaches is applied, a table

= ⟨, , ⟩ is generated and annotated with a set of types and possibly with pairs ⟨, ⟩ of the Ontology . In some cases, table columns that uniquely identify an instance of the concept (e.g. SSN of a person) can be discriminated from those that are simple characteristics of the identified concept (e.g. gender of a person). More than a single type can be used for annotating the values of a single column. Whenever the frequency of cells of a given type is above a given threshold, the type is added to the union type identified for the column. Conversely, if the frequency is below the threshold, the cell is considered an error. Moreover, CTI approaches can also extract annotations for sub-components of strings, thus a record type can be extracted according to a given pattern (introduced as mixed type in Section 2).

Sometimes CTI approaches are not able to identify the pair ⟨, ⟩ for all the table columns. Columns that did not receive the annotation are named unmatched and need to be properly handled by our graphical interfaces.

In this section, the interfaces for showing with the type annotations and for updating them are introduced.

3.1. Main interface

Figure 1 shows the main interface we have developed for showing a table representing information about invoices for the payment of taxes. The invoice can be related to a person or a company (with the relations high, it can be dificult to detect all the red cells; for this reason, we provide an error panel, on the right part of the screen, that summarizes the issues that need to be solved. An example of the panel is shown in Figure 2. When the user clicks the check button on one of the tabs in the panel, only the rows presenting the error are shown in the main interface. Once the errors are corrected, the corresponding panel tab is removed.

3.2. Modification of type annotations

holder and owner). For each holder or owner, the associ- Since the CTI approach can produce false positives or negated address can be the residential address (in the case atives (e.g. a ZIP code that has been labeled as Integer), of people or the headquarter in the case of a company). specific GUIs have been developed for supporting the The invoice contains taxes that the associated person or user in modifying the predicted annotations and easily company should pay along with the penalty. Each cell applying the modification to the entire column or suband column is associated with the predicted type annota- set of cells. In the modification process, the user should tion. The first line reports the column schema and it is be supported in the specification of pairs ⟨, ⟩ instead followed by a drop-down menu containing the inferred of basic or domain-specific types that can be obtained type annotations for each column. If more than one an- through the CTI approach. Indeed, users can easily idennotation is reported in a single column, this means that tify the domain concept to be associated with the column each annotation is a member of a union type, which is and thus improve the semantic description of the column. represented using diferent background colors for distin- The user is also supported in the modification of the value guishing their instances, such as in the second column of a cell when it contains errors. Consider for example of the table in Figure 1 where the occurrences of SSN the red cell in the date of birth column in Figure 1. and VAT are distinguished using two diferent shades of It contains a value not compliant with the column type green. Similarly, the presence of mixed types is repre- since the separators of the date are missing. In this case, sented using diferent text colors for each component of the user can fix the mistake by directly editing the cell. the pattern. If a column presents a single annotation, the For performing bulk modifications on the values, a spebackground remains white, if a value is missing, the cell cific interface has been developed. For each column, the background color is yellow. The usage of diferent colors interface groups the occurrences of the identical string can help the users in the process of checking the type of the column, followed by the number of occurrences. predictions and the empty values (in some cases a value Similar strings are clustered together relying on the edit should be provided) and performing error corrections. A distance and then reported together in the interface. In cell is considered an error when its type annotation is this way, it is easier to visually detect the errors and not compliant with the ones identified for the column; in correct them with the aim of obtaining a homogeneous this case, the cell background color and the column type representation of the same kind of information. The user background are marked in red. Facilities are provided for can edit the single value, and the proposed modification showing only rows presenting values in a column of a is applied to all the occurrences (note that when correcgiven type (for easily checking and correctly them). tions lead to a value already present in the column, the When the number of rows contained in a document is two rows are collapsed).

The interface in Figure 3 was developed for easily Example 1. Two errors occur in the ZIP code colchanging the type of an entire column or a subset of umn in Figure 1. Through the interface in Figure 3, its values when column annotations are not correctly the user highlights Type_2 and substitutes the errors identified. The interface can be activated on a single cell, with Address.ZIP. Moreover, Type_1 can be modified which becomes the current target of the modification, or in Address.ZIP leading to a single column type. □ on the entire column. Through this interface: ) the in- The possibility of modifying types according to the ferred data type can be modified into a new type, or into a concepts contained in the domain ontology can also in⟨, ⟩ pair of the domain ontology; ) a mixed type can troduce some issues that need to be properly managed. be created or modified. The interface is organized into 5 areas. In (1) the hierarchy of concepts available in is Example 2. Consider the column Name/Company reported along with basic types (collected under the but- in Figure 1 that the CTI approach has typed ton General). The user can select one of the available union(mixed_1, text, company), where the strucconcepts, and the corresponding properties are reported ture associated with mixed_1 is rec(name, surname). in (2) (when the General button is pressed, the basic The value Danielle Gray Greeen is of type text and types are reported). In (3), when the interface is activated can be changed with the mixed type mixed_2 whose on a target cell, a single type is reported (the value type), structure is rec(Person.name, Person.surname). So, a otherwise, the components of the union types specified more complex type than the one expected is generated. □ for the column are reported. In this way, the type can be changed for each component of the union type. In (4) it is reported the target value or the column name and is highlighted with the current type for the column. The user can remove the current labeling (by clicking on the x button on the top left corner of the string) and apply a new ⟨, ⟩ pair. In (5) values of the same type present in the column are reported and the user can select those to which the type modification should be applied ( all is the default behaviour). The user can also decide to select the “text” checkbox reported in (6) to unify undesired union types (e.g. decimal and integer) and to treat the whole column as an instance of a single type. Then, the user can select the new type to be assigned to all values.

To face this issue, a re-writing system based on

rules [24] has been developed for the simplification of the type expression after the modifications applied by the user. The re-writing rules express correspondence between simple types and ⟨, ⟩ pairs of the domain ontology occurring in the same table column. Once the re-writing rules are applied, the union-type components presenting the same structures are compacted. The union type is finally transformed into a simple type when a single component is identified. In the previous example, the application of the re-writing system leads to the type union(mixed_1, company), where the record type associated with mixed_1 is rec(Person.name, Person.surname).

3.3. Identification of a mixed type Even if diferent approaches for the extraction of concepts

from texts have been proposed [25], the identification of sub-components of a mixed type is quite hard to be handled automatically, especially when errors and variability in the pattern occur. Our interfaces support the user in the specification and modification of mixed types. Example 3. Consider the column address in Figure 1 and suppose the ML algorithm was able to identify the type mixed_1 for some of its values. The others are marked of type text and we can see that they follow two specific patterns. These patterns can be manually detected on a single instance and applied to all the others. □

The interface for the identification of mixed types is similar to the one presented in Figure 3 but it works on specific cell values that are reported in (4). Once the user has selected the property of a concept (in this case the municipality of an Address), he/she can highlight the part of the string of such a type. This behaviour applied to all the components will lead to the situation reported in the top part of Figure 4. In this way, we identify the terminal and non-terminal symbols that form our pattern. The non-labeled items are considered terminal symbols, while the labeled items are exploited for the generation of the pattern. Note that the void symbol can be applied for skipping variable parts of the string. Once the labeling is complete, the user can check if the generated pattern can be applied to other strings occurring in the same column that adhere to the same pattern (the instances in (5) that follow the pattern are highlighted). When the user tries to apply the labeling to other strings, the interface in Figure 5 is shown. The top part of the figure reports the labeled string, whereas the left panel reports strings that do not present the same pattern and the right panel contains the strings that have been re-written according to the identified pattern. The user can check the correctness of the applied pattern in the right panel and move to the left one those erroneously annotated. Moreover, he/she can take note of the strings in the left panel because they require the specification of diferent patterns or the identification of diferent types.

In our example, the pattern can be applied only to two strings. For the remaining two strings of type text, the pattern in the bottom part of Figure 4 should be specified on one of them and applied to the other. 4. Relation discovery As a result of the RD task, a graph can be generated.

Its vertices correspond to the concepts that occur in the table or concepts induced by the presence of relationships with the table concepts. Moreover, graph edges are predicted by the adopted ML algorithm. Besides that, nodes representing the table columns are also included in the graph and are associated with the concepts by means of the corresponding properties.

Starting from , a graphical representation can be devised and reported in the main canvas of Figure 6 for being checked and approved by the user. The green nodes (representing meta-instances) are laid out in the top part of the canvas, whereas, light blue nodes (representing the table columns) are in the lower part of the canvas. Edges between instance nodes (i.e. ) and edges between instance nodes and terminal nodes (i.e. ) are represented in the same way (labeled arrows) because their meaning is easily understandable from the context. The label on the edges is the relation/property name. For each light blue node in the graph, a single incoming edge is present if the column has a single basic type (e.g. the zip column). Multiple incoming edges can be present when the light blue node represents a mixed or uniontype column. For example, the Address column is of type mixed and three incoming edges are present (for representing the properties streetName, streetNumber, and municipality). Moreover, the SSN/VAT column is an example of column of type union and two incoming edges are present (one representing the SSN property of the instance-node P1erson and the other representing the VAT identifier of the instance-node C1ompany. We have decided to maintain this simplified representation for keeping simple the illustration. Isolated nodes (i.e. terminal nodes without incoming edges) are not included in our graph representation.

The left panel (1) contains buttons corresponding to the table column. We exploit a double representation of the table columns (buttons in the left panel and light blue nodes in the central panel) because they are used for checking the correctness of the semantic values associated with each column and for adding missing annotations to the unmatched columns. Moreover, graphical edges are used for verifying the connections among the components and modifying/adding new ones.

The buttons in the left panel can be colored in two ways: green, i.e. the associated column has been already included in ; pink, i.e. the associated column is not yet included in . By clicking on the arrow positioned on the left side of the button, it is possible to show the data type associated with that column (single type or union of types). By right-clicking on the button itself it is possible to specify a new pair ⟨, ⟩ of the domain ontology for each data type of the column.

In the remainder, we discuss the operations that can be invoked on the two parts of the interface.

4.1. Visual operations on table columns

The following operations can be invoked on the table columns reported in the left panel for the correction of errors or in the definition of new nodes:

1. Association of properties. It allows the specifica

tion of properties to unmatched columns. 2. Modification of properties . It allows changing the current association of properties for a column. 3. Removal of properties. It removes the semantic concept associated with the column.

Operation 1 can be invoked on unmatched columns (i.e. pink buttons) and used for including them in in two steps. First, the identification of the properties that represent the column content in the ontology concepts is specified. Then, the instance nodes in (or new nodes that need to be added in ), to which the properties can be associated, must be defined.

Example 4. Consider the unmatched tax column in Figure 6. When Operation 1 is invoked on it, an interface similar to the one in Figure 3 is shown. In this case, the property taxes associated with the concept Invoice on the left bar is used for annotating the entire column. At the end of the operation, since no node in represents an instance of Invoice, the node I1nvoice is introduced in with the node for representing the table column tax. The edge (I1nvoice, taxes, ) is included in . □

Whenever the chosen concept is already present in

, an interface is shown to the user for deciding if the identified property should be associated with one of the meta-nodes in or a new one should be included. In this way, it is possible to distinguish the presence of diferent instances of the same concept.

Example 5. Consider the situation of the previous example, and suppose that the table column penalty is now semantically annotated with Invoice.penalty. Since node I1nvoice is already included in , a panel is shown to the user for deciding if the property should be associated with I1nvoice or a new meta-instance should be created. □

Regardless of the number of instances, after the in

troduction of a new concept, the system identifies the relations existing between the newly inserted element and the other concepts in the ontology. If a single relation is present, it is automatically added to . Table 1 shows the operations that can be executed on the graphical representation of . Some of them (1 and 2) can be invoked on light blue nodes and produce the same efect as the corresponding operations that can be applied to the buttons on the left sidebar. Operation 3 can be invoked on an instance node and allows the introduction of a new link with another instance node. The inserted links must be coherent with the domain ontology so that, for each pair of nodes, only existing relations in the correct direction can be added.

Operation 2 is used for changing the already associ

ated semantic annotation to a column (i.e. it is invoked on a green button). Besides changing the semantic anno- Example 6. Consider the unmatched columns tax and tation, this operation also allows changing the instance penalty that have been associated with the instance node node to which the properties are associated (if needed). I1nvoice. This instance node should be linked with its holder By invoking Operation 3 on a green button, the existing or its owner (that can be a person or a company). These semantic annotation is removed along with the corre- bindings are realized by means of the interface in Figure 7. sponding nodes in the graphical representation. The interface shows the lists of relation names (ingoing and outgoing) that can be exploited for the nodes of this concept 4.2. Visual Operations on SD by taking into account the instance nodes in the current and the constraints of the domain ontology. The user can select the correct relation and insert it in the graphical representation of that is updated accordingly. In this case, the interface is used two times for including two links (relation holder and owner) for representing the relation with the person and the company. □

Operation 4 of Table 1 can be invoked on a node of the

graphical representation of with the aim of modifying the name of the relation between two nodes or one of the nodes connected by the link. Finally, Operation 5 allows the deletion of an edge occurring in . goal mixed type errors bulk editing management of unmatched columns connections among concepts new instance time 7 min 5 min 3 min 7 min 5 min 4 min success Definition of a mixed type and application of the labeling to other strings through the “apply” function Detection and correction of the errors on values/types through the error panel Rows are updated in a single operation The user specifies a concept and a property for each unmatched column.

The user identifies the correctness of the existing links, adds the missing ones and modifies the wrong ones the user is able to define a new instance for a concept failure Lack of the pattern definition or application of the new procedure every time The errors are not corrected and the error panel is not exploited Rows are updated one at a time One or more columns have not been associated with a concept and property of the domain ontology The final is not complete or the links are wrong the user uses the same instance for multiple occurrences

Figure 8 shows the final that can be realized by with Excel. Most of the students are currently attending means of our tool that describes the diferent kinds of data their bachelor’s degree, therefore they have only a high that can be extracted from our running example. This school diploma. Users have an average knowledge of difsemantic description can then be used for translating the ferent operative systems and use a computer or a laptop table content as RDF triples. mostly for working or studying.

Table 2 reports the tasks that we have identified for checking the main functionalities of our system. Each 5. Experimental results task requires the processing of a spreadsheet that is specifically created for the purpose of the task and whose conWe organized a usability test of the Web application. The tent can be easily understood also by non-expert of the aim of this test is to evaluate if the users can smoothly domain. Specifically, two spreadsheets have been deinteract with the application and use the provided tools, signed for pointing out the issues that each task was what level of knowledge in computer science is needed, intended to address. Even if these spreadsheets correand check the existence of critical aspects that should be spond to real documents of our domain, their content has ifxed or improvements to be applied. This test is com- been anonymized for preserving user privacy. For each posed of three parts: first, the user watches a video that task, Table 2 reports the main goal, the time required for introduces him/her to the problem and shows the system completing the task and when the task can be considered usage. Then, some tasks are assigned to be carried out successfully completed or completely a failure. Tasks 1, on specific files. Finally, the user fills out a questionnaire 2, and 3 are used for evaluating the usability of interfaces about his experience with the system, containing: ) per- developed for CTI, while the remaining tasks are used sonal information (age, gender, level of instruction) and for evaluating the interfaces in handling the result of RD. technical abilities (computer skills in general, knowledge Almost all the volunteers (70%) were able to complete of operating systems, skills in the use of spreadsheets, ...); the assigned tasks within the specified time limits. The and, ) users’ opinions about the assigned tasks and their others would have been able to complete the task with adcomplexity ) users’ opinions about the functionalities ditional time. A good fraction (70% of the users) thought of the proposed tool. The questions are rated using a Lik- that the assigned tasks were easy and enough intuitive. ert scale (from “strongly disagree” to “strongly agree”). For task 1, most of the individuals (85%) were able to

We selected 20 participants, 12 males and 8 females, specify a mixed type through the interface. All of them 60% of them were between 21 and 23 years old, 20% be- used the “apply” button to label all the mixed types in tween 24 and 26 years old and the remaining ones were a single column. The main reason for the failure of this more than 26. Most of the users were recruited among task was the choice of the wrong interface (they selected personnel and students of the department of computer the interface for the modification of column type instead science of the University of Milan and therefore they have of the one for modifying the cell type). good technical skills. However, they are not involved in For task 2, 75% of the individuals used the error panel this project and they have little knowledge of the domain. and the general impression about its usefulness is very Only a small part of the participants (50%) feels confident positive (from partially to strongly agree). The users that 6. Concluding remarks did not exploit the error panel, tried to increase the size of table pagination to identify the errors. In these cases, the identification and correction of the errors required more In this paper we have discussed diferent user interfaces time. Concerning this task, only 28% of users had trouble that can be exploited after the application of CTI and in distinguishing errors occurring on the data type (i.e. RD approaches for correcting the automatic predicted the component of a union type was not identified by the annotations and thus improving the semantic descripused CTI approach) from errors occurring on the data tion of web tables. The developed interfaces allow, in (i.e. a date is written without separators). many cases, the specification of a single modification

For task 3, most of the individuals (86%) were able to and its propagation to other values in the same column use the bulk editing functionality and all of them thought that follows the same type or the same pattern. Once the it speeds up the editing process. The remaining part did semantic description has been generated and validated, not notice the error occurring within the data (usually an it can be exploited for the translation of the table conadditional letter in the name of a city) and they corrected tent in a KG representation, thus obtaining a meaningful it by editing the data type. representation of the table content.

For task 4 and task 5, 90% of the individuals completed The problem of supporting the user in the interpretathe job in just 10 minutes. Only 10% of them had some tion of table content was initially faced in Karma [18]. trouble remembering the procedures to complete task 5. Our approach deeply extends this work by considering

For task 6, 20% of the users needed to watch again the a more sophisticated data model both for the CTI and training videos to complete the assignment correctly. The RD interfaces. Our semantic description allows the manadditional time required for watching videos were not agement of tuples of diferent types that need a diferent counted in the total time of the task completion. The fact knowledge representation. Moreover, interfaces for the that all users, possibly after watching again the training semantic labeling of columns and for extracting patterns video, have completed the tasks correctly highlights a for strings are new contributions of this paper. possible dificulty for a novice user to learn the various A usability test has been run on the graphical interprocedures rather than apply them. faces for assessing their facility of use. Our results show

We tested user satisfaction in using the developed in- that almost all users believe the application is easy-to-use terfaces to support users in solving both CTI and RD and intuitive. Some more eforts should be devoted to issues. For the interfaces developed for CTI, 95% of the improving the interfaces for handling the semantic deusers agreed that the application is easy-to-use and in- scription and for showing the results of the modifications tuitive and 85% of them declared that they did not have on the table data. We are currently working on providing problems during the error correction process. The greater further facilities for supporting the user in this activity. dificulties were related to the understanding of the spe- The work discussed in this paper can be extended in cific domain; most of the users did not know the meaning several directions. Even if we have focused on developing of the concepts of the domain ontology and tried to iden- graphical interfaces for supporting the CTI and RD tasks, tify the most suitable one. Moreover, the application also entity linking approaches [26] can be used for table provides a lot of functionalities and the user needs time understanding and specific interfaces can be included to gain confidence in the system. in our system for their management. Moreover, once

For the interfaces developed for RD, 85% of the users the semantic description is obtained, it can be exploited agreed that the interfaces are easy-to-use and intuitive for the creation of KGs reporting the table content [21]. whereas the remaining ones expressed a neutral position. Specific interfaces can be also developed for supporting The greatest uncertainties concerned the application of the user in obtaining this result and for the management the operations on the graphical representation of . In of duplication and for fusing together alternative repparticular, a few users had dificulty in recognizing or resentations of the same entity. We would like also to applying operations such as the insertion of new concepts collect the user modifications on the automatically genor the insertion or removal of links between concepts. erated annotations provided by CTI and RD approaches Half of the users declared that they had to remove an and use them for tuning the underline approaches. Fiedge because it was not correct. Only 16% of the users nally, we would like to use the proposed interfaces for could not connect all the graph nodes because they did the construction of biological knowledge graphs [27]. not have enough knowledge about the domain (e.g. they did not know that a company can be the invoice holder). Acknowledgments

In conclusion, the usability test suggests that although some aspects of the application could be improved, for This research was supported by the ”National Center for example by adding contextual help possibly supported Gene Therapy and Drugs based on RNA Technology”, by short videos, the overall opinion is that the system is PNRR-NextGenerationEU program [G43C22001320007]. intuitive and easy to use.

[15]

Limaye ,

Sarawagi ,

Chakrabarti , Annotat-

ing and searching web tables using entities , types [1]

M. J.

Cafarella , et al., Webtables: Exploring the and relationships , Proc. VLDB 3 ( 2010 ) 1338 - 1347 .

power of tables on the web , Proc. VLDB . 1 ( 2008 ) doi:10.14778/1920841.1921005.

538- 549 . doi: 10 .14778/1453856.1453916. [16]

Venetis , et al., Recovering semantics of tables [2]

Bonfitto , E. Casiraghi,

Mesiti , Table under- on the web , Proc. VLDB . 4 ( 2011 ) 528 - 538 . doi:10.

standing approaches for extracting knowledge from 14778/2002938 .2002939.

heterogeneous tables, WIREs Data Mining Knowl . [17]

Mulwad ,

Finin ,

Joshi , Semantic message

Discov. 11 ( 2021 ). doi: 10 .1002/widm.1407. passing for generating linked data from tables , in: [3]

Kandel , et al., Wrangler: Interactive visual spec- The Semantic Web Conference , Springer, Berlin,

ification of data transformation scripts , in: ACM Heidelberg, 2013 , pp. 363 - 378 .

Human Factors in Computing Systems (CHI) , 2011 , [18]

Taheriyan ,

C. A.

Knoblock ,

Szekely , J. L . Am-

p. 3363 - 3372 . doi: 10 .1145/1978942.1979444. bite, Learning the semantics of structured data [4] Trifacta , Wrangler, 2020 . www.trifacta.com/. sources, Journal of Web Semantics 37-38 ( 2016 ) 152 [5] Google , Openrefine: A free, open source , pow- - 169 . doi: 10 .1016/j.websem. 2015 . 12 .003.

erful tool for working with messy data , 2020 . [19]

Futia ,

Vetrò , J. C. De Martin , Semi: A semantic

Https://openrefine.org/. modeling machine to build knowledge graphs with [6] I . Valera , Z. Ghahramani , Automatic discovery of graph neural networks , SoftwareX 12 ( 2020 ) 100516 .

the statistical types of variables in a dataset , in: [20]

Vu ,

Knoblock ,

Pujara , Learning semantic

Proc. of Machine Learning Research , volume 70 , models of data sources using probabilistic graph-

2017 , pp. 3521 - 3529 . ical models, in: The WWW Conf., ACM , 2019 , p. [7]

Ceritli ,

C. K. I.

Williams ,

Geddes , ptype: prob - 1944 -1953. doi:10.1145/3308558 .3313711.

abilistic type inference, Data Mining and Knowl- [21]

Bonfitto , et al., A semantic approach for con-

edge Discovery 34 ( 2020 ) 870 - 904 . doi: 10 .1007/ structing knowledge graphs extracted from tables,

s10618-020-00680-1. Tech. Rep , Dept. Computer Science, Uni. of Milano, [8]

Yang ,

Abdelhédi ,

Darmont ,

Ravat ,

Teste , 2023 .

Automatic machine learning-based olap measure [22]

Kumar , et al., Link prediction techniques, ap-

and Knowledge

Discovery , Springer, 2022 , pp. 173 - A : Statistical Mechanics and Its Applications 553

188. doi: 10 .1007/978-3- 031 -12670-3_ 15 . ( 2020 ). doi: 10 .1016/j.physa. 2020 . 124289 . [9]

Pham , et al., Semantic labeling: A domain- [23]

Schlichtkrull , et al., Modeling relational data

independent approach, in: The Semantic Web Con- with graph convolutional networks , 2017 . URL:

ference, Springer, Germany, 2016 , pp. 446 - 462 . doi: 10 .48550/ARXIV.1703.06103. [10]

Rümmele ,

Tyshetskiy , A. Collins, Evaluating [24]

Dershowitz ,

D. A.

Plaisted , Chapter 9 -

Workshop on Linked Data on the Web , volume 2073 North-Holland, 2001 , pp. 535 - 610 . doi: 10 .1016/

CEUR

, Lyon, France, 2018 , pp. 30 - 40 . B978 -044450813-3/ 50011 - 4 . [11]

Chen ,

Jimenez-Ruiz ,

Horrocks ,

Sutton , [25]

Gutierrez , et al., A hybrid ontology-based in-

column type prediction , in: Proc. of AAAI Conf. on mation Science 42 ( 2016 ) 798 - 820 . doi: 10 .1177/

Artificial Intelligence , volume 33 , 2019 , pp. 29 - 36 . 0165551515610989 .

doi:10 .1609/aaai.v33i01. 330129 . [26]

Zhang , K. Balog, Web table extraction, retrieval, [12]

Hulsebos , et al., Sherlock: A deep learning ap- and augmentation: A survey , ACM Trans. Intell.

proach to semantic data type detection , in: SIGKDD Syst. Technol . 11 ( 2020 ). doi: 10 .1145/3372117.

ing , ACM, 2019 , p. 1500 - 1508 . Alex Patak, Antonio Puertas-Gallardo, Alberto Pac[13]

Zhang , et al., Sato : Contextual semantic type de- canaro, Giorgio Valentini, Elena Casiraghi

tection in tables , Proc. VLDB . 13 ( 2020 ) 1835 - 1848 . [27]

Gliozzo , et al., Heterogeneous data integration

doi:10.14778/3407790 .3407793. methods for patient similarity networks , Brief [14]

Bonfitto , et al., Semi-automatic column type infer- ings in Bioinformatics 23 ( 2022 ) doi:10.1093/bib/

ence for CSV table understanding , in: Proc. of 47th bbac207.

tice of Computer Science , SOFSEM, volume 12607

LNCS

, Springer, Bolzano, Italy, 2021 , pp. 535 - 549 .

doi:10 .1007/978-3- 030 -67731-2_ 39 .