Hybrid techniques for knowledge-based NLP Knowledge graphs meet machine learning and all their friends Jose Manuel Gomez-Perez Ronald Denaux Expert System Expert System Madrid, Spain Madrid, Spain jmgomez@expertsystem.com rdenaux@expertsystem.com Daniel Vila Carlos Badenes Recogn AI Universidad Politecnica de Madrid Madrid, Spain Madrid, Spain daniel@recogn.ai cbadenes@fi.upm.es ABSTRACT rigid and brittle in the face of different natural language processing Many different artificial intelligence techniques can be used to ex- applications, like e.g. question answering. plore and exploit large document corpora that are available inside In parallel, the last decade has witnessed a shift towards statisti- organizations and on the Web. While natural language is symbolic cal methods to text understanding due to the increasing availability in nature and the first approaches in the field were based on sym- of raw data and cheaper computing power. Such methods have bolic and rule-based methods, like ontologies, semantic networks proved to be powerful and convenient in many linguistic tasks. and knowledge bases, most widely used methods are currently Particularly, recent results in the field of distributional semantics based on statistical approaches, including linear methods, such as have shown promising ways to learn language models from text, support vectors machines or probabilistic topic models, and non- encoding the meaning of each word in the corpus as a vector in linear ones such as neural networks. Each of these two main schools dense, low-dimensional spaces. Among their applications, word em- of thought in natural language processing, knowledge-based and beddings have proved to be useful in term similarity, analogy and statistical, have their limitations and strengths and there is an in- relatedness, as well as many downstream tasks in natural language creasing trend that seeks to combine them in complementary ways processing. to get the best of both worlds. This tutorial will cover the founda- Aimed towards Semantic Web researchers and practitioners, this tions and modern practical applications of knowledge-based and tutorial elaborates on the idea introduced in [1] and shows how it is statistical methods, techniques and models and their combination possible to bridge the gap between knowledge-based and statistical for exploiting large document corpora. The tutorial will first focus approaches to further knowledge-based natural language process- on the foundations of many of the techniques that can be used ing. Following a practical and hands-on approach, the tutorial tries to this purpose, including knowledge graphs, word embeddings, to address a number of fundamental questions to achieve this goal, neural network methods, and probabilistic topic models, and will including: How can Machine Learning techniques be used to com- then show how these techniques are being effectively combined plement the knowledge already captured explicitly in knowledge in practical applications, including commercial projects where the graphs, extending and curating them in cost-efficient and practical instructors currently participate. ways, what are the main building blocks and techniques enabling such hybrid approach to natural language processing, how can KEYWORDS structured and statistical knowledge representations be seamlessly integrated, how can the quality of the resulting hybrid represen- Knowledge graphs, Hybrid natural language processing, embed- tations be inspected and evaluated, and how can this improve the dings, vecisgrafo, topic models overall quality and coverage of our knowledge graphs. 1 MOTIVATION For several decades, semantic systems were predominantly devel- 2 DESCRIPTION OF THE TUTORIAL oped around knowledge graphs at different degrees of expressivity. This half-day tutorial provides plenty of practical content, real-life Through the explicit representation of knowledge in well-formed, examples and applications, and exercises. We offer an interactive logically sound ways, knowledge graphs provide knowledge-based session where both instructors and participants can engage in rich text analytics with rich, expressive and actionable descriptions of discussions on the topic. The agenda addresses the following points. the domain of interest and support logical explanations of reason- • Probabilistic topic models and topic-based semantic similar- ing outcomes. On the downside, knowledge graphs can be costly to ity. produce since they require a considerable amount of human effort • Creating a language model through word embeddings. to manually encode knowledge in the required formats. Addition- • Extending word embeddings with structured knowledge. ally, such knowledge representations can sometimes be excessively • Creating knowledge graph embeddings. K-CAP2017 Workshops and Tutorials Proceedings, 2017 • Building a vecsigrafo - bringing knowledge from text into ©2017 Copyright held by the owner/author(s). knowledge graphs. K-CAP2017 Workshops and Tutorials Proceedings, 2017 Jose Manuel Gomez-Perez, Ronald Denaux, Daniel Vila, and Carlos Badenes • Evaluating vecsigrafos beyond visual inspection and intrinsic Daniel Vila is co-founder of recogn.ai, a Madrid-based startup methods. and spin-off from UPM, building next generation solutions for text • Applications in cross-lingual natural language processing. analytics and content management using the AI methods. Daniel • Putting it all together in a real-life system. holds a PhD in Artificial Intelligence by Universidad PolitÃľcnica • Beyond text understanding: Cross-modal extensions. de Madrid (2016), where he worked at the Ontology Engineering Group and developed the solution supporting a large knowledge 3 MATERIALS graph combining NLP and semantic technologies: the datos.bne.es The tutorial follows a highly practical approach. The teaching fun- data service from the National Library of Spain. damentally consist of Jupyter notebooks that participants can install Carlos Badenes: After more than 8 years working on the M2M locally through Docker images with all the necessary software to world, Carlos began researching about text mining within the con- run the examples and exercises in their own machines. The materi- text of the Semantic Web. Since then, he has moved more deeply als of the K-CAP 2017 tutorial can be found in GitLab1 into the study of topic modeling techniques to analyze large collec- tions of documents, incorporating semantic resources and working 4 AUDIENCE on multilingual domains. He currently works as an associate re- This tutorial seeks to be of special value for members of the Se- searcher at the Ontology Engineering Group doing a PhD at UPM. mantic Web community although it is also useful for related com- Oscar Corcho: Oscar Corcho is Full Professor at Departamento munities, e.g. Machine Learning and Computational Linguistics. de Inteligencia Artificial, UPM, and belongs to the Ontology En- We welcome researchers and practitioners both from industry and gineering Group. His research is focused on Semantic e-Science academia, as well as other participants with an interest in hybrid and Real World Internet, although he also works in more general approaches to knowledge-based natural language processing. areas of Semantic Web and Ontological Engineering. He has par- ticipated in numerous EU and Spanish R&D projects as well as 5 PRESENTERS privately-funded projects like ICPS (International Classification of Patient Safety), funded by the World Health Organisation, and The tutorial is offered by the following members instructors. HALO, funded by Vulcan Inc. Previously, he worked as a Marie Jose Manuel Gomez-Perez works in the intersection of sev- Curie research fellow at the University of Manchester, and was eral areas of Artificial Intelligence, including Natural Language a research manager at iSOCO. He holds a PhD in Computer Sci- Processing, Knowledge Discovery, Representation and Reasoning. ence and AI from UPM. He was awarded the Third National Award His long-term vision is to enable machines to understand text in a by the Spanish Ministry of Education in 2001. He has published way similar to how humans read, bridging the gap between both several books, from which âĂIJOntological Engineering" can be through semantically rich knowledge representations and user in- highlighted as it is being used as a reference book in a good num- terfaces. At Expert System, Jose Manuel leads the Research Lab in ber of university lectures worldwide, and more than 100 papers in Madrid where he focuses on the combination of structured knowl- journals, conferences and workshops. He usually participates in the edge graphs and probabilistic methods. Before Expert System, he organization or in the program committees of relevant international worked at iSOCO, one of the first European companies to deliver conferences and workshops. semantic and natural language processing solutions on the Web. He consults for companies like Coca-Cola or ING. Also active as ACKNOWLEDGMENTS an entrepreneur, he co-founded a startup and advised another. An ACM member and Marie Curie fellow, Jose Manuel holds a Ph.D. Partially funded by the EU H2020 project DANTE (700367) and the in Computer Science and AI from UPM and regularly publishes national Spanish project GRESLADIX (20160805). in top scientific conferences and journals. His views on AI and applications have appeared in magazines like Nature and Scientific REFERENCES [1] Ronald Denaux and Jose Manuel Gomez-Perez. 2017. Towards a vecsigrafo: American. In 2015, he was the program chair of the International Portable semantics in knowledge-based text analytics.. In Proceedings of the 2017 Conference on Knowledge Capture (K-CAP). workshop on Hybrid Statistical Semantic Understanding and Emerging Semantic Ronald Denaux is a senior researcher at Expert System. Ronald (HSSUES ’17). Held in conjunction with the 16th Intl. Semantic Web Conference, CEUR Workshop Proceedings. obtained his MSc in Computer Science from the Technical Univer- sity Eindhoven, The Netherlands. After a couple of years working in industry as a software developer for a large IT company in The Netherlands, Ronald decided to go back to academia. He obtained a PhD, again in Computer Science, from the University of Leeds, UK. Ronald’s research interests have revolved around making semantic web technologies more usable for end users, which has required research into (and resulted in various research publications in) the areas of Ontology Authoring and Reasoning, Natural Language Interfaces, Dialogue Systems, Intelligent User Interfaces and User Modelling. Besides research, Ronald also participates in knowledge transfer and product development. 1 https://gitlab.com/rdenaux/kcap17-tutorial