Capturing meaning: Toward an abstract Wikipedia Denny Vrandečić Google vrandecic@google.com Abstract. Semantic Web languages allow to express ontologies and knowledge bases in a way meant to be particularly amenable to the Web. Ontologies formalize the shared understanding of a domain. But the most expressive and widespread languages that we know of are human natu- ral languages, and the largest knowledge base we have is the wealth of text written in human languages. This paper looks for a path to bridge the gap between knowledge representation languages such as OWL and human natural languages such as English. We propose a project to si- multaneously expose that gap, allow to collaborate on closing it, make progress widely visible, and is highly attractive and valuable in its own right: a Wikipedia written in an abstract language to be rendered into any natural language on request. This would make current Wikipedia ed- itors about 100x more productive, and increase the content of Wikipedia by 10x. For billions of users this will unlock knowledge they currently do not have access to. Keywords: Semantics · Multilingual · Abstract language. 1 Wikipedia today Wikipedia is one of the most important sources of knowledge today. Following its vision to allow everyone to share in the sum of all knowledge, it aims to create comprehensive and current encyclopedias anyone can read and contribute to. In order to read Wikipedia it has to be available in the reader’s language. Wikipedia is available in 300 languages. But content is unevenly distributed: First, English has 5.6 million articles, 60 languages have over 100,000 articles – but many languages are tiny: Zulu has 901, Samoan 797, and Cree 131 articles. Second, topic coverage has a surprisingly low overlap. English and German are the two most active Wikipedias. English has 5.6 million, German 2.1 million articles – but only 1.1 million of the German topics are also available in English. Only 100,000 topics are common between the top ten most active Wikipedias. Third, the comprehensiveness of individual articles differ between languages: whereas the English Wikipedia has a single line on the Port of Călăraşi, the Romanian Wikipedia offers several paragraphs, including a picture. “Local” Wikipedias often have information that is missing from others. 2 D. Vrandečić There are currently Wikipedia articles on 17.9 million topics. Having them available in all 300 languages of Wikipedia would lead to more than 5.3 billion articles. This dwarfs the number of 50 million articles that actually exist – less than one percent of that goal. And this does not take into consideration the effort required to maintain these articles: there are about 69,000 active contributors. 31,000 are active on the English Wikipedia, followed by German with 5,500. Only eleven Wikipedias have thousand or more active editors. More than half of Wikipedias have less than ten active editors. Completing and maintaining an encyclopedia with only ten volunteers is ambitious. The underlying issue is that the current size of the problem is the number of topics multiplied with the number of languages. In the following we suggest a solution that reduces it so that the size of the problem is essentially the number of topics added to the number of languages. 2 Recognizing the formal language gap Take the first two paragraphs of a Wikipedia article, extract the content into an RDF graph, randomize the order of the triples, generate a natural language text from the graph, and compare the result with the original text.1 This round-tripping will expose the gap between natural language and RDF. Let us take a sentence from the Wikipedia article about the Rosetta Stone:2 “It was the first Ancient Egyptian bilingual text recovered in modern times, and it aroused widespread public interest with its potential to decipher this pre- viously untranslated hieroglyphic language.” Just to note one issue (among many): the sentence states that Ancient Egyp- tian is hieroglyphic. This is not knowledge about the Rosetta Stone, but about Ancient Egyptian, and it has been included in the article on the Rosetta Stone deliberately to emphasize the challenge of translating the language. We think that a naı̈ve RDF representation of the declarative knowledge ex- pressed in the text is not the best choice to meet the challenge posed here, but that we need an alternative representation that is capable of capturing the narrative, redundancy, and organization of the text more closely. 3 An abstract Wikipedia The question which data to assemble in which order from a large declarative knowledge base is a hard one. We avoid it by leaving this task to the contributors. We sketch a solution consisting of three main components (see Fig. 1): Content. The largest component is the Content: each individual article is represented by knowledge that captures the content of an article in an abstract, language-independent way. This captures paragraphs and sentences with their content, redundancies in the text, relevant knowledge about related topics, etc. 1 The people extracting the RDF and generating the text should ideally be different. 2 https://en.wikipedia.org/wiki/Rosetta Stone Capturing meaning: Toward an abstract Wikipedia 3 Renderers English German Chinese one per language Content San Francisco Oxygen Marie Curie Music one per article Constructors Wikidata: lexical and single set ontological knowledge Fig. 1. The three components of the abstract Wikipedia are Renderers, Content, and Constructors. Additionally, Wikidata is used as background knowledge. Constructors. The smallest component: this defines the ‘language’ used for the Content. If the Content is a series of function calls, these are the function definitions. If the Content are frame instantiations, these define the frames and slots. If the Content is akin to the ABox, then the Constructors are the TBox. Renderers. The Renderers translate Content to natural language. They define for each Constructor how to represent it in natural language, using Wiki- data as ontological and lexicographical background knowledge. The hope is that a Renderer requires only few contributors – maybe ten or less – in order to achieve interesting coverage. Then even a small group would be capable to cre- ate a comprehensive and up-to-date encyclopedia in their language. This glosses over many issues such as agreement, saliency, how Renderers in- teract over Constructors in order to achieve high readability and proper anaphora creation, or different languages requiring different knowledge in the Content.3 4 Desiderata The main feature of Wikipedia is that anyone can contribute to it. This must remain true of the abstract Wikipedia and its components. Content must be easy to create, refine, and change. It will constitute the largest part, followed by lexical knowledge and, far behind, the Renderers and Constructors. If trade-offs between the components are needed, it should be taken into account how many contributors each component requires. The set of Constructors has to be under control of and be extensible by the community. The Constructors and their individual slots, whether these slots are required or optional, etc., have to be editable and maintainable. The system must be able to deal with the Constructors evolving. The Renderers can only scale to hundreds of languages if they are created by the community. But not everyone will need to be able to write Renderers: it could be a task which can only be done by contributors who dedicate the necessary time. Such a separation of concern exists today already: contributors 3 An example: in English the sentence “She is an actress.” requires to specify gender, whereas in Turkish it is unnecessary as there is no grammatical gender: “O bir oyuncu.” Knowledge in the Content may be marked as core or supplemental. 4 D. Vrandečić have diverse skillsets and time commitments, from template developers and bot operators to casual contributors using the WYSIWIG interface. Lexical knowledge must be easy to contribute. The Renderers will require large amounts of lexical knowledge. Fortunately, Wikidata has recently launched support for capturing and maintaining lexical knowledge.4 The system must support graceful degradation. Each language will grow at its own speed: some languages will have complete Renderers, others will be stale and incomplete. It is important that a missing lexicalization does not block a whole article. A sentence that renders in English as “In 2013, the secretary of state was the first foreign official to visit the country since the revolution.” could degrade to “The minister visited the country in 2013.” Parts of the Content could be marked as optional or to require other parts of the Content to be rendered, and thus allow to be dropped in case the language resources prove insufficient. 5 Unique advantages There are several unique advantages to render the project more feasible: – we only aim at a single genre of text, encyclopedias – not poems, fiction, etc. – the exact text is less important as long as it contains the necessary content. – where we rely on the exact text, we can have ways to quote verbatim. – we can start with simple sentences and iterate. We won’t achieve the expres- sivity of natural language, and yet may achieve very valuable goals. – we do not need to understand and parse natural language, merely generate it, which is widely considered a much simpler task. – the baseline is low. The first priority is about languages that currently have a small number of articles, many of which are out of date and incomplete. 6 Challenge The abstract Wikipedia is a two-pronged approach: we want to push for new goals in knowledge representation, and we want to create an extremely valuable resource. We present a clearly defined, visible, and attractive, yet achievable challenge. It allows to collaborate and iterate on a system that captures more knowledge than current KR systems, and watch the system evolve and grow. This paper offers no concrete solution. It introduces the challenge of an ab- stract Wikipedia with three questions: What knowledge representation is suit- able? How are individual articles represented? How do the renderers work? This is a call for developers and the research communities to answer the challenge of the abstract Wikipedia and demonstrate how the current state of the art in natural language generation, knowledge representation, and collaborative systems can work together to create a novel system that will enable everyone to share in the sum of all knowledge. 4 see https://www.wikidata.org/wiki/Wikidata:Lexicographical data