Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 Advancing on the Linked Open University Context: a Cuban linked Open University Yoan Antonio López Rodrı́guez,1[0000−0001−5615−375X] University of Informatics Sciences, Cuba. yalopez@uci.cu Abstract. Linked Open Universities apply Linked Data to publish in- formation about universities. In 2010, the Open University in the UK was launched as the first initiative to expose public information from the university in an accessible, open, integrated, and Web-based format. Since then, universities around the world have been joining that initia- tive by deploying their own linked open data platforms. However, during the publication of their data using the linked data principles, universities face challenges such as the lack of a unified, well-accepted vocabulary; the need of coping with the heterogeneity of datasets; the high cost to host the existing SPARQL endpoints; the performance shortcomings in federated queries over current SPARQL endpoints and the incomplete- ness of datasets. The aim of this Ph.D. research proposal is to advance on the Linked Open University context with a proposal for the University of Informatics Sciences from Cuba addressing some of these challenges. Keywords: Knowledge graph · linked data · linked open university · ontologies 1 Problem Statement Linked Open Universities apply Linked Data to publish information about uni- versities [3,5,4]. Traditionally, universities produce large amounts of data, much of which should be publicly available [3,5]. In this context, sharing and reusing knowledge is a challenge where Linked Open Data (LOD) can play an important role[3,5]. In 2010, the Open University in the UK was launched as the first ini- tiative to expose public information from the university in an accessible, open, integrated, and Web-based format[3]. Since then, universities around the world have been joining that initiative by deploying their own LOD platforms[6,8,12]. The process of generating linked datasets in universities consists of the fol- lowing stages[10]: i) raw data collection, ii) defining the vocabulary model based on reusing existing ontologies and extend them when it is needed, iii) extracting and generating RDF datasets according to the defined vocabulary, iv) achieving interlinking among datasets internally and externally, v) storing the outcome datasets and exposing them via SPARQL endpoints, vi) exploiting datasets by developing applications and services on top, and, vii) providing optimization and quality. In this sense, during the process, institutions have faced several issues. Some of those issues are the following (without the intention of being exhaustive)[10,8,1,13,7]: Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 49 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 – Lack of a unified, well-accepted vocabulary that satisfies all universities’ re- quirements. Vocabularies (or ontologies, more strictly), define the concepts and relations (also referred to as “terms”) used to describe and represent an area of concern. A standard set of vocabularies provides a unified access to data consumers [5]. That is why former works about linked open universi- ties agree with reusing existing vocabularies as much as possible. However, despite the fact that there are a lot of vocabularies/ontologies to describe entities in this context, there is no a standard vocabulary/ontology (or group of them) that meets the requirements of all universities. – The need of coping with the heterogeneity of datasets. Open university repos- itories look like a sea full of wealth resources, and the problem is how we can reach the needed resources easily. Description of dataset metadata is a crucial step to cope with the heterogeneity of datasets. – The high cost of the existing SPARQL endpoint interfaces. Due to the high expressive power of SPARQL, the processing of many kinds of queries on SPARQL endpoints is very expensive in terms of server CPU time and mem- ory consumption. Sometimes that consumption is unpredictable and thus, SPARQL endpoints suffer from frequent downtime. – Performance shortcomings of federated queries over current SPARQL end- points. A 2013 survey revealed that the majority of public SPARQL end- points had an uptime of less than 95%. This unavailability increases in fed- erated queries where we query many knowledge graphs. – Incomplete data is a problem that could complicate data querying and re- trieving if it is not considered. As of the previous problem statement, the aim of this Ph.D. research pro- posal is to advance on the linked open university context (LOU context) with a proposal for the University of Informatics Sciences from Cuba, applying novel techniques developed by the Semantic Web community and our lab to address previous challenges. The importance of this problem statement is described as follows in Section 2. 2 Importance Universities use LOD platforms to publish and access data about staff profiles, courses, scholarly publications, and open educational resources[3]. Implementing LOD can be summarized as using the web both as a channel to access data and, as a platform for the representation and integration of data. Having university linked datasets facilitates effective reuse of data and increases the opportunities to build more effective, integrated, and innovative applications based on these datasets[6]. It is also a trend revealing the contributions and achievements of the universities and making them visible on the Web as LOD as it is a means to measure the university’s reputation and standing among other international universities and institutions[10,9]. Advancing on the LOU context means resolving some existing issues with the objective of improving current and future LOD platforms. Aiming at a unified 50 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 vocabulary (or group of them) entails to get better current semantic modeling processes in order to select the better vocabularies/ontologies and extend them if it is needed. Current knowledge globalization makes the standardized access to resources further relevant in the LOU context than in any other one. Besides the data itself, it is also significant to describe datasets in order to increase their discoverability and interoperability on the web via associated metadata. Related work about semantic modeling in the LOU context is described in Section 3. On the other hand, RDF datasets access and query in the LOU context are extremely important for both, internal and external clients. A SPARQL endpoint is the common interface to triple live queryable and its shortcomings are quite well known[13,7]. To cope with the shortcomings of SPARQL endpoints, the Semantic Web Community has created some technologies such as Triple Pattern Fragment [13]. To advance on the LOU context this proposal takes into account these technologies along with common interfaces to query RDF triples. Related work about dataset´s access and query in the LOU context is described in Section 3. Regarding incomplete data, the open word assumption of the Semantic Web assumes incomplete information by default, and thus, the possibility of finding missing facts contributes to dataset optimization. Related work about knowledge graph completion in the LOU context is described as follows in Section 3. 3 Related work On the semantic modeling process, according to[3], the process to choose the terms is based on the following process: i) identify the concept to be expressed; ii) search for a widespread existing vocabulary to be used; iii) if found, use it, otherwise iv) search for a less-known vocabulary to reuse; v) if not found, create a new term. Moreover, according to [8] this process includes: i) listing common data categories about university information; and ii) getting a summary of useful vocabularies for describing university datasets. When it is necessary to create new ontologies, methodologies such as Methontology can be taken into consideration. Despite there has been a recent trend towards using the same vocabularies, in the beginning, each platform used its own vocabularies/ontologies. For instance, despite the fact that a course is the same in all universities, we can find it modeled both as a Teach: Course1 as well as an Aiiso: Course2 . With the aim of getting closer to a unified model, our proposal aims to use the most popular vocabularies/ontologies of the state of the art, and thus, we define the first research question and hypothesis related to the semantic modeling process in Section 4. Among the dataset metadata vocabularies, some notable ones are VoID3 used by most university platforms to describe datasets[3,6,12], the DataCube 1 http://linkedscience.org/teach/ns#Course 2 http://purl.org/vocab/aiiso/schema#Course 3 https://www.w3.org/TR/void/ 51 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 vocabulary4 to describe datasets of multidimensional data, and the DCAT5 , a feasible way to standardize metadata for catalogs, datasets and data services. In this research proposal, we define the second research question and hypothesis related to dataset metadata vocabularies in Section 4. In regard to datasets access and query, the SPARQL language6 is the W3C standard to express declarative queries over collections of RDF triples. There are three common interfaces to RDF triples: Data dumps, SPARQL endpoints and Linked Data documents [13]. To cope with the shortcomings of SPARQL endpoints, the Semantic Web Community has created some technologies, such as Triple Pattern Fragments (TPFs), which divides the query processing between clients and servers and allows to restrict the kinds of queries the client can send to the server[13]. Linked Connections are an example of a customized query in- terface for the consumption of open data in the Transportation area[2]. Linked Connections implement HTTP content negotiation7 [1]. Both, Linked Connec- tions and TPFs use vocabularies such as Hydra8 and Tree9 that contribute to the automation of the client-server communication and facilitate the federated queries[1]. On the LOU platforms, to the best of our knowledge, customized query interfaces of triples have not been implemented yet, publication and con- sumption are mostly solved only via SPARQL endpoints. We define the third research question and hypothesis related to datasets access and query in Section 4. Regarding incomplete data, given a Knowledge Graph G = (E, R, T ), where E and R denote the set of entities and relations and T = (h, r, t)|h, tE, rR is the set of triplets (facts), the task of Knowledge Graph Completion (KGC) in- volves inferring missing facts based on the known facts[11]. Much research work has been devoted to KGC. A common approach to carry out KGC has been knowledge graph embeddings. Most knowledge graph embedding models only use structure information in observed triple facts, nevertheless, advanced tech- niques use other information besides facts such as entity types, relation paths, textual descriptions, and logical rules [14]. To advance on the LOU context, our proposal aims at defining a knowledge graph embedding model able to use other information besides facts on the output university datasets. We define the fourth research question and hypothesis related to knowledge graph completion in Section 4. 4 https://www.w3.org/TR/vocab-data-cube/ 5 http://www.w3.org/TR/vocab-dcat-2/ 6 http://www.w3.org/tr/sparql11-query/ 7 http://tools.ietf.org/html/rfc7231#section-5.3 8 http://www.hydra-cg.com/spec/latest/core/ 9 https://treecg.github.io/specification/ 52 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 4 Research questions and hypotheses Regarding the lack of a unified, well-accepted vocabulary that satisfies all uni- versities’ requirements, we define the first research question and hypothesis as follow: Q1: How to contribute to having a unified, well-accepted vocabulary in the LOU context? H1: If we use the most used vocabularies/ontologies or extend them in our LOD platform, we contribute to having a unified, well-accepted vocabulary that satisfies all universities’ requirements. Regarding dataset metadata vocabularies, we consider that advancing on the LOU context means to advance on the automation of the client-server commu- nication, therefore we need a metadata vocabulary to allow client applications to get resources easily. In this sense, we define the second research question and hypothesis related to dataset metadata vocabularies as follows: Q2: How to support client applications getting resources easily in the LOU context? H2: Using the DCAT vocabulary to describe dataset metadata, client appli- cations are able to easily get the resources in the LOU context. With regard to datasets access and query, to advance on the LOU context means to incorporate novel approaches and technologies developed by the Se- mantic Web Community along with common interfaces. Query interfaces cus- tomized according to the client applications’ needs can contribute to the automa- tion of the client-server communication. We define the third research question and hypothesis related to datasets access and query as follows: Q3: Can query interfaces customized according to the client applications’ needs along with common interfaces improve datasets access and query in the LOU context? H3: Query interfaces customized according to the client applications’ needs along with common interfaces can improve the datasets access and query in the LOU context. Regarding the incompleteness of the datasets, we define the fourth research question and hypothesis related to incomplete data in the LOU context as fol- lows: Q4: What kinds of information besides facts in the knowledge graph em- bedding models produce an acceptable trade-off in the LOU context to predict missing facts? H4: Integrating textual descriptions and logical rules besides facts in the knowledge graph embedding models produces an acceptable trade-off in the LOU context to predict missing facts. Considerations about how to test these previous hypotheses are described as follows in Section 5. 53 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 5 Evaluation With the aim of finding out the most popular vocabularies/ontologies in the LOU context and testing the hypothesis H1, we carried out a state-of-the-art study. As outcomes, we found out: i) AIISO ontology10 is applied in most works while other vocabularies such as Teach11 and Courseware include crucial terms about university courses; ii) most works followed the principle of reusing existing ontologies as much as possible, however, all of them had to extend these vocab- ularies to fulfill their requirements; iii) general ontologies such as Dublin Core Metadata Element Set, FOAF, W3C Basic Geo Vocabulary, and W3C Ontology for Media Resources had a common usage along with the above vocabularies. Unlike existing LOD university platforms that have been using the VoID vocabulary to describe their datasets, in our proposal, the DCAT vocabulary is promoted for advancing in the LOU context because of its broader coverage than other vocabularies like VoID. The explanation of the DCAT vocabulary usage (H2 evaluation) is shown below along with the customized query interface. In order to evaluate the hypothesis H3, since all the state-of-the-art LOU platforms include a course dataset, we developed a customized query interface for the course dataset called coursesld server12 . We do not aim at replacing common triple interfaces, but using our proposal as a complementary query interface. University course data are a special segment of open data at the university given the public nature of the data[12,10]. With the increase of distance learning, there are a lot of course recommender applications that assist clients to find suitable courses. Our LOD university platform publishes on its homepage the DCAT-catalog file where course recommender applications can find datasets about courses. By accessing the dataset subject property, client applications can distinguish one dataset among many other ones (as in this case, teach: Course is the wanted subject). Additionally, by getting to the dataset distribution, clients know where to retrieve the dataset (URL via distribution accessURL property) and how to retrieve it (file format via distribution mediaType property, e.g., in this case, one of the four options: JSON-LD, Turtle, N-quads, CSV). In order to support the automation of the client-server communication a Hydra API Documentation was also defined, which can be obtained via catalog-dataset-conformsTo property. Normally, course recommender applications ask for courses about one specific topic that start on a certain date. That is why, coursesld server interface offers the possibility of getting courses related to that topic sorted by the start date via URL templates. To deal with memory requirements and advancing on the data reuse, the interface splits the answer into fragments with no more than 100 courses per fragment. Course recommender applications are able to request the server each fragment through the Tree/Hydra vocabulary without human 10 http://purl.org/vocab/aiiso/schema# 11 http://linkedscience.org/teach/ns# 12 https://github.com/yalopez84/coursesld server 54 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 intervention. To complete the hypothesis H3 evaluation, a client application was developed called coursesld client13 . In regard to the evaluation of the graph completion model (H4 evaluation), a controlled experiment will be carried out. The tests will be first on at least two general knowledge graphs: DBpedia and Freebase, and then, on specific LOU datasets. Currently, the model is being developed. 6 Results Besides the vocabularies/ontologies found in the state-of-the-art study, we put into consideration our semantic modeling process, which has benefited from pre- vious works[3,8], and promotes the usage of the Linked Open Vocabularies Ini- tiative (LOV14 ) to look for terms/vocabularies. Regarding H1 evaluation, we consider that it is true. If we look at the semantic model of our platform, we see AIISO vocabulary to represent the university structure and Tech vocabulary to describe the university course according to the tendency in the state of the art. Each time a platform is developed following this principle, such a university contributes to having a unified vocabulary or group of vocabularies in the LOU context. Regarding the coursesld server interface (H3 evaluation), its effectiveness is appreciated as an additional service for the exploitation of linked datasets. Since most developers are familiar with APIs with descriptions more than SPARQL endpoints over the RDF format, customized interfaces like the one presented in this work can serve to pave that issue. At the same time, we confirm the hypothesis H2 as the DCAT vocabulary allows to advance on the customized interfaces, federated queries, automation of client-server communication, and therefore the DCAT vocabulary allows to advance in the LOU context. 7 Reflection and future work Future work aims at continuing with this research. Some of the previous hy- potheses need further evaluation in order to reach the goal of this proposal: to advance on the LOU context. Given the wide range of topics at the university (research, academic programs, software production, internal places, energy and water consumption, etc.), adding new datasets to the LOD platform allows us to have a more complete knowledge graph. Afterward, we plan to extend the solution to other universities and evaluate the research contributions beyond the LOU context. Acknowledgments This research has been partially sponsored by VLIR-UOS Network University Cooperation Programme - Cuba. Special thanks go to: 13 https://github.com/yalopez84/coursesld client 14 https://lov.linkeddata.es/dataset/lov/terms 55 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 – Erik Mannens. Ghent University. Research interests: Data Science. He re- ceived his Ph.D. degree in Computer Science Engineering (2011) at UGent. – Pieter Colpaert Ghent University. Research interests: Linked Open Data, Transport. He received his Ph.D. degree in Computer Science Engineering (2017) at UGent. – Hector R. González. University of Informatics Science. Research interests: Machine Learning. He received his Ph.D. degree in Computer Science (2019) at University of Havana. – Yusniel Hidalgo Delgado. University of Informatics Science. Research inter- ests: Semantic Web. References 1. Colpaert, P.: Publishing transport data for maximum reuse. PhD Thesis, Ghent University (2017) 2. Colpaert, P., Llaves, A., Verborgh, R., Corcho, O., Mannens, E., Van de Walle, R.: Intermodal public transit routing using liked connections. In: International Semantic Web Conference: Posters and Demos. pp. 1–5 (2015) 3. Daga, E., dAquin, M., Adamou, A., Brown, S.: The open university linked data–data. open. ac. uk. Semantic Web 7(2), 183–191 (2016), publisher: IOS Press 4. dAquin, M.: Putting Linked Data to Use in a Large Higher-Education Organisa- tion. In: ILD@ ESWC. pp. 9–21. Citeseer (2012) 5. dAquin, M., Dietze, S.: Open education: A growing, high impact area for linked open data. ERCIM News,(96) (2014) 6. Keßler, C., Kauppinen, T.: Linked open data university of münster–infrastructure and applications. In: Extended Semantic Web Conference. pp. 447–451. Springer (2012) 7. Khan, H.: Towards more intelligent SPARQL querying interfaces. In: International Semantic Web Conference (2019) 8. Ma, Y., Xu, B., Bai, Y., Li, Z.: Building linked open university data: Tsinghua university open data as a showcase. In: Joint International Semantic Technology Conference. pp. 385–393. Springer (2011) 9. Meymandpour, R., Davis, J.G.: Ranking universities using linked open data. In: LDOW (2013) 10. Nahhas, S., Bamasag, O., Khemakhem, M., Bajnaid, N.: Added values of linked data in education: A survey and roadmap. Computers 7(3), 45 (2018), publisher: Multidisciplinary Digital Publishing Institute 11. Sun, Z., Vashishth, S., Sanyal, S., Talukdar, P., Yang, Y.: A re-evaluation of knowl- edge graph completion methods. arXiv preprint arXiv:1911.03903 (2019) 12. Szász, B., Fleiner, R., Micsik, A.: A case study on Linked Data for University Courses. In: OTM Confederated International Conferences” On the Move to Mean- ingful Internet Systems”. pp. 265–276. Springer (2016) 13. Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., Colpaert, P.: Triple Pattern Fragments: a low- cost knowledge graph interface for the Web. Journal of Web Semantics 37, 184–206 (2016), publisher: Elsevier 14. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engi- neering 29(12), 2724–2743 (2017), publisher: IEEE 56