Appendix: WHiSe 2020 Diary

Introduction / 10:00 -10:20

Chair: Alessandro Adamou (@anticitizen7x) Session 1: Linked Data and Libraries / 10:20 -12:00 Chair : Albert Meroño-Peñuela (@albertmeronyo) Presentation: Minna Tamper, Petri Leskinen, Jouni Tuominen and Eero Hyvönen. "Modeling and Publishing Finnish Person Names as a Linked Open Data Ontology".

Albert HISCO is a similar effort to AMMO -are these two linked in any way? Max First, thank you, this was a nice talk and an interesting research effort. My question/idea: This may be an overkill, but if there is linking to professions, etc., maybe it can be connected to general dictionaries using, e.g. Ontolexlemon instead/in addition to specific ontologies Alessandro I was thinking the same in relation to the fact that some derivations are cross-cultural: for example, being related to priests is highly present in Greek (the prefix papa-) Max Yes, exactly. It also could be interesting to compare cross-linguistically Antoine (Only for reference, or if there's a lot of time for questions -it's not crucial!) I'm wondering if the authors have looked at how libraries handle and publish names, e.g. the Library of Congress Name Authority File (and the MADS format it uses). I guess this great work is much richer in detail, but maybe there's room for interoperability work in the future. On another topic, would it be possible to use DCAT for describing the dataset itself? This work is probably worth publishing on some portals that use DCAT. Minna Thank you for the ideas, we will take a look at MADS and DCAT and indeed it might be a good idea and addition to describe the dataset. Antoine Cool. Note that in fact MADS is not used a lot in a SW context, even though it has an RDF form. Maybe there are things in Bibframe, as an alternative..

Presentation: Fabian Hoppe, Tabea Tietz, Danilo Dessì, Nils Meyer, Mirjam Sprau, Mehwish Alam and Harald Sack. "The Challenges of German Archival Document Categorization on Insufficient Labeled Data".

Albert

Curious about how open-ended the annotated categories are, and whether vocabularies were reused? Also, if you think the performance might be related to the training of word2vec with German Wikipedia? Harald The categories have been provided by the archivists with respect to the available archival data (who organized them in a kind of hierarchical schema. It seems usual for German archivists, always to create new schemata depending on the current topic to be processed...). Unfortunately, labels can occur multiple times on different branches of this hierarchy. For your 2nd question, yes, the German wikipedia trained model is not the best for the task, since we are dealing with a historical subject and the language and topics used are 100 years old (and bound to a specific region in Germany). We plan to make use of historical newspaper archives for a better suited model. Enrico This is very interesting! I am curious about how you join text and categories in the preparation phase for the embeddings. I would assume categories are somehow more important than the raw text -how did you combine the two, considering the algorithm wants a sequence of text as input? Harald We try out different variants to come up with a representation for the categories ranging from simple category name embedding to aggregations of embeddings of members of a category as well as taking into account longer descriptions of the categories. Enrico Exciting project! I am curious about the policy for publishing people?s contributions and findings. The location of new archeological findings is quite a sensitive topic as there are countries that have a huge amount of heritage difficult to maintain or monitor, that can be subject to smuggling (e.g. Mexico, Italy). Any discussion on that in your project? Eero Yes, lots of discussion. At the moment the feeling is that exact coordinate info will be published. It seems that professional archaeologists in Finland trust in amateurs more than in some other countries where only fuzzified data is published. Albert Really cool project. I?m curious on the specific Linked Data features that were useful to users in the evaluation? E.g. reasoning, entity linking, etc.? Eero In FindSampo Reporter, not much data linking is visible to the end-users.

More important at the moment is integration of different systems such as GIS systems with the mobile system, and guiding the user to provide the data using harmonized terminology. In FindSampo Portal we are now focusing more on data linking, semantic faceted search, data analysis/visualization, and recommender systems. sent us. We're always eager to receive more material that fits user needs :-) Albert A third one: On your massive vocabulary reuse, do you think the engineering of EDM adjusted well to standard ontology engineering practices? Or were there new/singular practices you had to implement to fit the domain? Antoine excellent/tricky question, I hope I'm going to answer it right? In fact we have not followed the regular "formal" ontology engineering methods.

It was not by pleasure (my first steps in SW were about such methodologies!). It's just that it would have been too hard to follow exactly a method and write down all the documentation. But in the end a lot of the best practices we've followed (and doc we've written) can be related to what is presented in these methodologies. Jan Martin Which quality criteria did you use for source selection? And regarding vocabulary sources? Antoine This was in one of my slides: Availability and access: open license, published as linked data Granularity, size and coverage: multilingual data, with a rather generic scope. But too generic or too large datasets can create too much ambiguity for the simple processes we have (e.g., enrichment) Quality: intrinsic aspects like correctness of representation Connectivity: good data sources are well-connected internally and externally to other datasets Jan Martin How did you measure the "correctness of representation"? Enrico On a similar angle, how do you deal with different and multiple (or even conflicting) perspectives on the same object/artwork? E.g. conflicting attribution statements (e.g. I am thinking about WikiData and their notion of truthy statements ?)

Antoine EDM uses a pattern from the OAI-ORE model whereby information from different sources is carried by different "proxies". We also re-use the Web Annotation models, which is a bit more intuitive way to represent annotations (but then it works better for individual data elements, not parts of graphs). We would have liked to use named graphs, but this required too advanced SW tools (our technical base is not an RDF quad store!) Enrico Interested in the Linked Open Usable Data (LOUD) concept: what parts of Linked Data do you consider not usable? ... useful? Antoine Very useful: URIs, links, lightweight data models that can be re-used, maybe pattern languages like ShEx/SHACL (though we couldn't use them yet). Not so useful and in fact often deceiving: OWL axioms and reasoning.

Break / 13:15 -14:00 Albert For this experiment in the Zeeland province we matched 270k-310k newborns in marriage certificates to brides/grooms, and 205k-244k parents of brides/grooms in marriage certificates to their own marriage certificate (depending on the Levenshtein distance). Scale is not really an issue due to our use of HDT and efficient data structures for computing Levenshtein distances. The major challenge is on the variability of person names, since people used to have many given names that sometimes changed among certificates. Enrico Very interesting approach to entity linking, are you considering applying a similar strategy to other entity linking problems? Albert Yes, definitely. The source code1 is very generic, and we are working towards making all parameters dataset-agnostic. The idea is to have a dataset independent framework for entities in knowledge graphs that need to be linked using string similarity at large scale. Enrico What about other entities, e.g. places -are they easily aligned to places of today or are there challenges in doing such linking? Albert Interesting answer pointing at AMCO/gemeentegeschiedenis for temporal placenames, and HISCO for historical occupations Enrico Very interesting! A common notion in linguistics is that meaning is contextual. How -in your opinion -does this affect the quality or usability of a sentiment lexicon?

Rachele Given the results of our application to the Medea of Seneca, we think that the lexicon could be useful. But we sure need to improve the coverage.

Enrico How do you consider to evaluate the quality of the automatically generated silver GS?

Rachele We chose only derivational and semantic relations that were not ambiguous so to have a high-quality silver standard. For example, there are two in-prefixes but we used only the one expressing negation because the other can have different meanings. Details on the evaluation can be found in the LREC 2020 paper. Enrico Very interesting presentation! I particularly liked the case studies that capture the variety of aspects that related to KG evolution. I understand the paper is about the challenges ? but are you already thinking about strategies for making static KG incorporate ?some? of these dynamics? Albert Really cool; I'd love to know more about further thoughts on using: (a) the typography (e.g. in your apfelstrudel example); and (b) semantic linking or language models to understand that the meaning of ?bomb? is very far away from what?s usual in recipe foods?