Exploiting Latent Information in Databases via Database Embedding: technology, applications, ethics (Invited Talk) Oded Shmueli Technion – Israel Institute of Technology Haifa, Israel oshmu@cs.technion.ac.il We are witnessing the emergence of AI-powered database sys- Limiting information disclosure is also an important consider- tems, embedding AI-ideas and techniques in query processors, ation. Especially within an organization, there is a need to share concurrency controllers, and more. We aim at improving rela- information. However, it would be desirable that this sharing tional querying, as well as other functionalities, by introducing enable productive work while hiding information that is not another layer of data, word vectors, into traditional database essential for that work. To this end, we introduce degrees of systems. Word vectors originate in Natural Language Processing disclosure. Here, some information in the database is encrypted, (NLP) where they are used to represent words in a language. In some is simply not supplied, while additional information is in- NLP, there are a number of methods for obtaining word vectors tentionally supplied in the form of a model. from text, we use a variation of one of these methods, word2vec. The idea in a nutshell is as follows: we produce text from a relation (or a view thereof) and then use this text to generate a model, i.e., a set of vectors, for all terms in the database. Once the model is available, we can formulate Cognitive Intelligence (CI) queries. These queries may be realized by SQL queries, enhanced by User Defined Functions (UDFs) that take advantage of the model to formulate conditions that were previously practically not expressible in SQL. The process of vector construction is different than in NLP. It reflects the characteristics of relations, with integrity constraints and named columns which contain various data types, strings, dates, numeric values, images and more. We call this process db2vec. There are a number of options for model generation: based on the textification of a single or multiple relations, incor- porating external text sources (e.g., Wikipedia), incorporating externally produced models, standalone, or as building material for constructing a local model. There are many application areas that may benefit from our approach: Commerce, Finance, HR, Science, and more. Whereas there are generic UDFs, some application areas require develop- ing specialized UDFs. One example is a food database application in which a record has a list of ingredients, in decreasing order of importance. A model reflects the textual and vector sources used to produce it. As decisions may be based on queries using the model, the production of models brings to the forefront issues of fairness and ethics. An important issue is the specifics of the data and text sources, their weighting in producing the model, and whether they are biased in some way. © 2020 Copyright held by the owner/author(s). Published in the Workshop Pro- ceedings of the EDBT/ICDT 2020 Joint Conference, March 30-April 2, 2020 on CEUR-WS.org. Distribution of this paper is permitted under the terms of the Cre- ative Commons license CC BY 4.0.