Discovering Educational Resources on the Web for Technology Enhanced Learning Applications Matteo Lombardi School of Information and Communication Technology, Griffith University, 170 Kessels Road, Nathan, QLD, 4111 Australia matteo.lombardi@griffithuni.edu.au Abstract. There are many works in TEL literature aiming to support learners and educators in their educational tasks. A current trend in this field is to develop ex- pert systems and agents able to retrieve the most appropriate educational material. Such systems are great when handling pools of resources specifically designed for education, where the material is associated to educational data about the charac- teristics of the resource itself and its context of usage. Instead, many Web re- sources potentially suitable for education are not considered in their reasoning, because of the lack of such educational information. This PhD project aims to propose a solution for exploiting the Internet as an enormous repository of edu- cational material, through an efficient discovering of Web resources that are suit- able for education. Then, educational features of the resources shall be deduced and associated to them, building an overall concept map depicting the knowledge about educational material available on the Web. Current and novel TEL tech- nologies are expected to benefit from the proposed research that will provide to them i) a novel huge amount of educational resources, and ii) a reusable repre- sentation of the educational knowledge coming from educators and institutions from all over the world. Keywords: Web Crawling, Technology Enhanced Learning, Educational Resources, Concept Map 1 Introduction and motivations The Web is widely recognized as a source of information for many different applica- tions, including educational ones. There are many websites and online platforms which offer educational material covering a huge range of topics. In this scenario, learners and educators have the chance to access a large number of Web resources anywhere and anytime, with many benefits for them. Unsurprisingly, the Internet is now one of the most popular sources of information for the retrieval of teaching resources [16]. Another benefit for learners and educators is the increasing number of online systems designed for sharing educational resources. A current trend in education is offering Massive Open Online Courses (MOOC), namely online courses open to users from all over the world. Coursera1 , developed by Stanford University, is one of the most popular 1 https://www.coursera.org/ 17 MOOC platforms [19]. At the moment, Coursera hosts more than 2,000 courses from around 140 universities. The Internet is also plenty of platforms for sharing resources. Among the most popular there are SlideShare2 , YouTube3 , and Wikipedia4 . The increasing trend of sharing educational resources on the Web has attracted sev- eral contributions from the research community. In fact, the research community in Technology Enhanced Learning (TEL) has produced contributions about the use of technology for the improvement of both learning and teaching processes [16]. Many of those works are proposals on the recommendation of educational resources [16, 27, 7, 23]. Since a substantial part of TEL contributions are towards the recommendation of Learning Objects (LO) [28] hosted on online Learning Object Repositories (LOR), it is clear that experts in TEL consider the Internet as a source of educational mate- rial. Therefore, it could be interesting to use the entire Web as a repository of teaching material. However, the retrieval of Web resources presents some critical issues, mostly due to the fact that the Internet is an enormous and unorganized space. Search engines like Google are able to extract thousands of Web pages exploiting a user query, but it is not possible to filter the results according to a desired purpose (e.g. extract only material suitable for education). Even the most acknowledged proposals for crawling the Web do not assure to retrieve Web pages appropriate for specific purposes such as teaching. Indeed, approaches like the focused crawling [8] and the semantic crawling [17] are remarkable in extracting Web resources according to a set of topics or terms of interests, but the purpose of the resources is not taken into account. There are several contributions that aim to structure the content of the Web, in par- ticular for TEL applications [1, 21, 31]. Regarding to that, one of the most popular ap- proach is to provide additional information to Web resources using metadata for LO. A number of standards have been proposed for annotating LO metadata, such as the IEEE Learning Object Metadata (LOM)5 , Dublin Core (DC)6 , and ADL SCORM7 . Nevertheless, annotating metadata is an additional task that has to be performed mostly manually by LO authors. Such extra effort required is one of the drawbacks of com- posing, sharing and reuse LOs [26]. Recently, Linked Data (LD) [13, 4] is emerged has the new standard for contextualizing learning material [30, 32, 20, 10, 9], with the aim to improve [14] the previously discussed LO. However, other LO issues such as the diversity of metadata standards and their lack in representing some educational aspects are still present also in LD. The first object of this doctoral project is to design and propose a new approach for Web crawling tailored to the educational field, able to fulfil the aforementioned gap performing an accurate extraction of resources suitable for teaching and learning, without any restriction on topics or terms. Then, the sharing and reuse of educational Web resources is expected to be promoted proposing an automatic extraction of edu- 2 http://www.slideshare.net/ 3 http://www.youtube.com/ 4 http://www.wikipedia.org/ 5 IEEE 1484.12.1-2002, IEEE standard for Learning Object Metadata 6 http://dublincore.org/documents/dces/ 7 http://www.adlnet.gov/scorm/scorm-2004-4th/ 18 cational features, in order to represent two groups of data: the characteristics proper of the resources, and the context where the resource has been used (including other related resources). Also in [12], the educational features of a resource are divided in the same two groups. Once the resources are equipped with contextual data, this PhD is expected to be concluded proposing a representation of the overall knowledge about educational Web resources in a Concept Map [6], a popular solution in TEL literature [22, 16]. Those findings are expected to be beneficial to researchers in Technology Enhanced Learning and Information Retrieval, towards a more effective support to students and educators in their educational tasks. 2 Related works The first step of this research regards the identification and extraction of Web resources that are potentially useful for educational usages. Some important contributions in lit- erature presented in this section are about the definition of educational characteristics of resources. Learning Object (LO) metadata standards are widely recognized by the research community as correct ways for representing educational information about a resource. Therefore, this study exploits the popular LO metadata standards IEEE Learn- ing Object Metadata (LOM) and Dublin Core (DC), together with related works with the goal of discovering which educational characteristics are important when describing an educational resource. Many works in TEL literature are focused on the feature selection and extraction processes. An interesting contribution in this scope [21] is the proposal of an auto- matic building of LOs using unstructured Web resources manually filtered by humans. In that work, the author selected a subset of LOM features that are important to be deduced for describing the educational characteristics of the resource itself. Another related work is the proposal of a framework for crawling Web resources and extracting their educational metadata [1]. In that study, focused crawling is used for restricting the mining to a domain deduced from a query given in input, and then metadata extraction is performed. The focused crawling approach can be used only when there are topics in input, so it cannot be applied to my research. About the extraction of educational metadata from Web resources, the authors suggest to represent the resources as vectors of terms, following a Vector Space Modeling (VSM) representation. Each term is then weighted according to its significance for the topic, so that the similarity among Web pages can be computed. A framework for analyzing textual resources gathered from the English version of Wikipedia is presented in [31]. In such work, the most impor- tant pieces of information are i) the article name, and ii) the text of the token used by Wikipedia (namely, the exact text used in the article for referring to another page). A number of interesting approaches for developing Web crawling algorithms have been presented by the research community. Among them, the focused crawling ap- proach [8] is defined as a selective seeking of Web pages that are relevant to a pre- defined set of topics, deduced from the analysis of documents [8] or Web pages [2] selected by the user, or from a set of terms [3]. Another suggestion is to estimate the relevance of a Web page before to visit it [25]. An interesting refinement to this ap- 19 proach is the Self-Adaptive Semantic Focused (SASF) crawler [15], where a module for learning patterns of pages of interest is involved for improving the filtering of novel Web pages. Although this approach is promising, SASF does not show a substantial im- provement when compared to other Web crawlers. Another popular crawling approach is the semantic crawling [17], that aims to discover Web pages exploiting an ontology of terms connected by semantic relationships. Such ontology represents the knowledge of interest and it can be defined directly by users or using textual documents. A recent contribution applies such approach for discovering Web resources about environment and forecasting [29], using topic directories such as the Open Directory Project8 for retrieving a set of words of interest for the specific domain. Although Web crawling is a very popular topic, specially among researchers in In- formation Retrieval, there is still a gap in current approaches because they are tailored on topics, terms and domains, but not on the context of usage of Web resources. There- fore, the new crawling approach proposed in this work is expected to fulfil the current gap in the state-of-the-art and to unveil currently unclassified Web resources for educa- tion, overcoming the limit of topic and domain specificity. The Web itself contains many known sources of educational material with semantic information. For instance, there are repositories of LOs such as MERLOT9 [5], Con- nexions10 and ARIADNE11 that are very popular among TEL users and researchers. Several proposals for improving the retrieval of LOs [23] and for comparing the perfor- mance of systems based on them [24] have been presented. Open Educational Resources (OER) are another example of resources enriched with semantic data, in this case LD. An example of an institution that uses OER is presented in [10]. In that paper, the au- thor depicts the Open University’s Linked Data platform12 , an open-access system that aims to expose the public information of such university through LD. Among other information, learning materials described as OER are shared. This is now very com- mon among universities and institutions, and there are even common platforms where OER are made publicly available13 . LOs and OER sources are expected to contain an important number of educational resources. Hence, they will be explored during the crawling process, exploiting the associated metadata and LD for the successive feature extraction. 3 Research questions This doctoral project is articulated in three main steps, each of them with a research question to be fulfilled. The first one is the following: RQ1: What are the features that describe an educational resource? 8 http://www.dmoz.org/ 9 http://www.merlot.org/ 10 http://cnx.org/ 11 http://www.ariadne-eu.org/ 12 http://data.open.ac.uk 13 https://www.oercommons.org/ 20 The review of popular metadata standards for annotating LO and the literature coming from the TEL research community has been fundamental for defining which features a resource suitable for education should have. Some of them like title and URL can be extracted from any Web resource, whether or not appropriate for education. Resources coming from MOOCs are among the only ones for which their educational features can be easily deduced, because such platforms present a standard structure that facilitates the feature extraction. Other sources of information that offer contextual data of re- sources are the repositories of LOs and OER already mentioned in the previous section of this contribution. So far, this research has exploited metadata standards, linked data dictionaries and the works presented in the literature review for proposing a comprehen- sive set of features, which are the most popular features used by educational platforms for describing their resources. Such set of features is the answer to this research ques- tion. Unfortunately, the majority of the remaining Web sites on the Internet are not structured for offering educational data. Then, deducing features very specific for edu- cation (like prerequisites, educational level and difficulty) from resources coming from other platforms than MOOCs, LOs and OER repositories is a foreseen challenge for the research activity. Nevertheless, my proposal aims to determine if a resource coming from the Web is appropriate to be used as educational resource. Therefore, the second main research question to be addresses during the PhD program is: RQ2: How to perform an efficient extraction of Web resources suitable for education? For addressing the previously mentioned gap in literature about Web crawling, this re- search proposes to exploit existent MOOCs, LORs and OER platforms as sources of i) a huge amount of resources, and ii) precious teaching knowledge. In this scope, the PhD includes the development of a classifier able to compare resources coming from the Web with others hosted on the aforementioned educational platforms, namely trustwor- thy educational material. The proposed system shall classify a resource coming from the Web either as suitable for education or as not appropriate for education, consider- ing its similarity to other resources already used in an educational context (i.e., already classified as suitable for education). At this stage, only MOOCs, LOs an OER reposi- tories are considered because they host resources developed by human instructors and used in real courses. The choice of considering only such platforms in this phase allows to build a highly reliable set of educational resources in a short time. Up until now, this research has extracted more than 20,000 resources from Coursera creating a dataset of educational resources called DAJEE (DAtaset of Joint Educational Entities) [18], all of them described by the list of features discovered during the first step of this research. Ta- ble 1 shows the entities currently included in DAJEE. It is necessary to specify that the crawling of Coursera has been performed in August 2015, when preview of the course content was still possible. Moreover, the data has been extracted without infringing the copyright of the content, fully respecting the terms and conditions of Coursera. Once the classifier is enough accurate in recognizing educational Web resources, an even bigger number of educational resources coming from other Web sites like Wikipedia, SlideShare and YouTube are expected to be included in the DAJEE dataset. Then, such new entries will be involved in the classification process as trusted edu- cational resources. In this way, the system is expected to learn many other structures 21 Table 1. Summary of educational entities of Coursera included in DAJEE. Entity Number of crawled instances University 99 Instructor 484 Course 407 Lesson 2,365 Concept 8,716 Resource 22,663 Transcript 14,327 (video resources only) of educational resources and to improve the discovery of Web resources suitable for teaching and learning. For concluding the PhD project, the third and last main research question to be satisfied is: RQ3: How to represent the discovered knowledge in a data structure compatible with current technologies in TEL? As anticipated in the previous section, building a Concept Map is one of the most popu- lar ways for representing knowledge in education. Many proposals in TEL literature are about expert systems and agents using concept maps for their reasoning [23, 16]. Thus, the proposal of an overall concept map of the Web resources discovered as appropriate to be used in education is expected to be beneficial for already existent and novel re- searches. However, a concept map is useful when the elements are connected by some kind of semantic relationships. One of the last challenges of this research is the defi- nition of educational relationships among the resources, starting from the structure of the MOOCs where resources are delivered, and the semantic information found on LOs and OER repositories [11]. Then, the relationships discovered so far shall be exploited for discovering connections also among novel Web resources classified as educational. 4 Potential applications The overall goal of this PhD project is to propose a solution for i) extracting educational resources from the Web, and ii) representing the discovered knowledge as a concept map where resources are entities connected by educational relationships. Despite the number of interesting proposals in TEL for supporting the retrieval of learning material, at the moment, generic search engines like Google are still preferred by the users when looking for resources that are suitable for their educational tasks [5]. In this scope, the contribution of this doctoral project would be a very important step towards offering better educational applications both to students and to educators. Researchers in TEL are expected to benefit from a crawler tailored to education, which could be the first component of a novel educational-oriented search engine. In order to achieve the overall goal of this research, the definition of the features that characterize educational resources is fundamental. Among many proposals and 22 standards for the representation of educational traits of digital material, this research provides the definition of a comprehensive structure compatible with current systems and useful for future contributions in TEL. Other applications may exploit the overall concept map for helping educators in delivering their courses. Indeed, the educational knowledge contained into it has a huge potential that can support the building of a course from scratch, e.g., suggesting the most popular concepts and resources for teaching a subject. Even the refinement of already existent courses would be easier, exploiting the semantic relationships among resources used by other colleagues in their courses. Novel and up-to-date resources may be added to a current pool after a certain amount of time, or at the beginning of the academic term. In this way, this research shall also foster collaborations among institutions and educators from all over the world. References 1. Atkinson, J., Gonzalez, A., Munoz, M., Astudillo, H.: Web metadata extraction and semantic indexing for learning objects extraction. Applied Intelligence 41(2), 649–664 (2014) 2. Batsakis, S., Petrakis, E.G., Milios, E.: Improving the performance of focused web crawlers. Data & Knowledge Engineering 68(10), 1001–1013 (2009) 3. Bedi, P., Thukral, A., Banati, H.: Focused crawling of tagged web resources using ontology. Computers & Electrical Engineering 39(2), 613–628 (2013) 4. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far (2009) 5. Brent, I., Gibbs, G.R., Gruszczynska, A.K.: Obstacles to creating and finding open educa- tional resources: the case of research methods in the social sciences. Journal of Interactive Media in Education 2012(1), Art–5 (2012) 6. Cañas, A.J., Novak, J.D.: Concept mapping using cmap tools to enhance meaningful learn- ing. In: Knowledge Cartography, pp. 25–46. Springer (2008) 7. Casali, A., Gerling, V., Deco, C., Bender, C.: A recommender system for learning objects personalized retrieval. Educational Recommender Systems and Technologies: Practices and Challenges, Hershey, PA: Information Science Reference pp. 182–210 (2012) 8. Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic- specific web resource discovery. Computer Networks 31(11), 1623–1640 (1999) 9. d’Aquin, M.: Linked data for open and distance learning. Commonwealth of Learning Re- ports (2012) 10. d’Aquin, M.: Putting linked data to use in a large higher-education organisation. Interacting with Linked Data (ILD 2012) p. 9 (2012) 11. De Medio, C., Gasparetti, F., Limongelli, C., Lombardi, M., Marani, A., Sciarrone, F., Tem- perini, M.: Discovering prerequisite relationships among learning objects: a coursera-driven approach. In: Proceedings of the 15th International Conference on Web-Based Learning. Springer (2016) 12. De Medio, C., Gasparetti, F., Limongelli, C., Lombardi, M., Marani, A., Sciarrone, F., Tem- perini, M.: Towards a characterization of educational material: an analysis of coursera re- sources. In: Proceedings of the 1st International Symposium on Emerging Technologies for Education. Springer (2016) 13. Dietze, S., Sanchez-Alonso, S., Ebner, H., Qing Yu, H., Giordano, D., Marenzi, I., Pereira Nunes, B.: Interlinking educational resources and the web of data: A survey of chal- lenges and approaches. Program 47(1), 60–91 (2013) 23 14. Dietze, S., Yu, H.Q., Giordano, D., Kaldoudi, E., Dovrolis, N., Taibi, D.: Linked education: interlinking educational resources and the web of data. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing. pp. 366–371. ACM (2012) 15. Dong, H., Hussain, F.K.: Self-adaptive semantic focused crawler for mining services infor- mation discovery. Industrial Informatics, IEEE Transactions on 10(2), 1616–1626 (2014) 16. Drachsler, H., Verbert, K., Santos, O.C., Manouselis, N.: Panorama of recommender systems to support learning. In: Recommender systems handbook, pp. 421–451. Springer (2015) 17. Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the 2003 ACM symposium on Applied computing. pp. 1174–1178. ACM (2003) 18. Estivill-Castro, V., Limongelli, C., Lombardi, M., Marani, A.: Dajee: A dataset of joint edu- cational entities for information retrieval in technology enhanced learning. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. pp. 681–684. ACM (2016) 19. Kay, J., Reimann, P., Diebold, E., Kummerfeld, B.: Moocs: So many learners, so much po- tential. Technology 52(1), 49–67 (2013) 20. Keßler, C., d’Aquin, M., Dietze, S.: Linked data for science and education. Semantic Web 4(1), 1–2 (2013) 21. Krieger, K.: Creating learning material from web resources. In: The Semantic Web. Latest Advances and New Domains, pp. 721–730. Springer (2015) 22. Limongelli, C., Lombardi, M., Marani, A., Sciarrone, F.: A teacher model to speed up the process of building courses. In: Human-Computer Interaction. Applications and Services, pp. 434–443. Springer (2013) 23. Limongelli, C., Lombardi, M., Marani, A., Sciarrone, F., Temperini, M.: A recommendation module to help teachers build courses through the moodle learning management system. New Review of Hypermedia and Multimedia 22(1–2), 58–82 (2015) 24. Lombardi, M., Marani, A.: A comparative framework to evaluate recommender systems in technology enhanced learning: a case study. In: Advances in Artificial Intelligence and Its Applications, pp. 155–170. Springer (2015) 25. Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Man- agement. pp. 1039–1048. ACM (2014) 26. Ochoa, X., Duval, E.: Quantitative analysis of learning object repositories. Learning Tech- nologies, IEEE Transactions on 2(3), 226–238 (2009) 27. Sergis, S., Sampson, D.: Learning object recommendations for teachers based on elicited ict competence profiles. Learning Technologies, IEEE Transactions on (2015) 28. Sosteric, M., Hesemeier, S.: When is a learning object not an object: A first step towards a theory of learning objects. The International Review of Research in Open and Distance Learning 3(2) (2002) 29. Tsikrika, T., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I.: Focussed crawling of environ- mental web resources based on the combination of multimedia evidence. Multimedia Tools and Applications pp. 1–25 (2015) 30. Vega-Gorgojo, G., Asensio-Pérez, J.I., Gómez-Sánchez, E., Bote-Lorenzo, M.L., Munoz- Cristobal, J.A., Ruiz-Calleja, A.: A review of linked data proposals in the learning domain. Journal of Universal Computer Science 21(2), 326–364 (2015) 31. Wojtinnek, P.R., Pulman, S., Völker, J.: Building semantic networks from plain text and wikipedia with application to semantic relatedness and noun compound paraphrasing. Inter- national Journal of Semantic Computing 6(01), 67–91 (2012) 32. Zablith, F.: Interconnecting and enriching higher education programs using linked data. In: Proceedings of the 24th International Conference on World Wide Web Companion. pp. 711– 716. International World Wide Web Conferences Steering Committee (2015) 24