The Revision History of Estonian Wordnet Neeme Kahusk and Kadri Vider Institute of Computer Science University of Tartu J. Liivi 2, 50409 Tartu, Estonia Abstract. Estonian Wordnet started in 1998 and since then at least once a year a new version has been released. The paper gives an overview of the main indicators for every year, as well as highlighting some prob- lems that have been solved or as yet have to be solved. 1 Introduction There is an old legend about the city of Tallinn, the capital of Estonia, that is situated on the northern coast of the country between the sea and a lake called Ülemiste. Once upon a time, a legend tells, there lived an old gray man, the gremlin of the lake. He used to ask the guards of the city whether the city was finished. Nobody was allowed to answer “yes, it is finished,” as then the gremlin would have opened the gates of the waters and flushed the city down into the sea. This old legend gives us a hint why creators of large and complicated language resources such as dictionaries, lexicons and thesauri, not to mention wordnets, tend to answer about the state of their product “well, under construction,” instead of “yes, it is completed”. Whatever the reason, the users of such resources are not willing to wait forever for a completed resource. Half a loaf is better than no bread. Thus, there is a pressure from the users for a (at least intermediate) release of the resource. Should a grant be involved, the creators would be under financial pressure too. Here we are going to present the twenty year long history of Estonian Word- net (EstWN), and try to answer the question whether it is ready or not. The history of EstWN started back in 1998 when the Estonian team joined the EuroWordNet-2 project.1 Back in that time, the only available database fol- lowing principles of WordNet structure was Princeton WordNet. The EuroWord- Net project followed the same principles, but additionally had the possibility to link each synset in different languages to a central InterLingualIndex (ILI) that was then based on Princeton Wordnet version 1.5. The EuroWordNet database and software are described in the first place by Dı́ez-Orzas, Forest, and Louw (1996). 1 For overview of EuroWordNet project, see https://www.illc.uva.nl/ EuroWordNet/ In construction of EstWN we used the expand method as a starting point. Base concepts from English were translated into Estonian as a first basis for a monolingual extension. The extensions have been compiled manually from Estonian monolingual dictionaries2 (first of all Explanatory Dictionary of the Estonian Language) and other monolingual resources like terminological collec- tions. In this sense, EstWN applies a hybrid method including both expand and monolingual techniques. EstWN includes nouns, verbs, adjectives and adverbs, and some multiword units among them. The EstWN data is available under a CC-BY-SA license and can be accessed via WordTies3 and TEKsaurus query tool4 . For details, see Pedersen et al. 2013. 2 Tools and Work Flow For EuroWordNet project a new tool was developed. It was made by Novell (member of Groupwise software) and used the Flaim database format. It was a desktop tool: there were neither shared files nor a central server. For exporting and importing data, a plain text format based on GEDCOM was used. This file structure was originally intended for genealogical use (see e.g. Eastman 2014) and dates back to the 1980s, but the basic ideas have lived on in the formats like yaml or pug. The file is structured hierarchically, consisting of level numbers, fields, and field values. A wordnet of a particular language consists of synsets. At every synset, there were descriptions of synset members, internal relations between synsets, equal relations (the links to ILI) between synsets in different languages, and properties or property values. The file syntax used to export a wordnet is described by Louw (1998). As no client-server model was used, each collaborator worked individually and had to export her results via this export text file. The Estonian team did not use any version control software back in these days, and in order to maintain conflict-free development, certain strict rules were applied and some helper scripts were used. Each member was working on a specific topic, and added only synsets and literals strictly connected to the topic that was assigned to him. However, it is not possible to work in complete isolation and there were frequent discussions on some concept or word. It was not always possible to reach common ground, but most of the arguments were settled (Kerner 2008). The individual files from each team member were sent to the project manager, who carefully imported each version individually and checked manually whether there were no overlaps in the vocabulary or synsets. There were always some lost and found synsets, relations or synset members after each manual upgrade. To decrease possibilities for such mistakes, some strict rules and some bash scripts 2 https://www.eki.ee/EN/dictionaries 3 http://wordties.cst.dk/wordties-estwn/ 4 http://www.cl.ut.ee/ressursid/teksaurus/index.php?lang=en were implemented in fresh data collected from team members. It was a laborious effort, and as we see in section 3, some inconsistencies still have slipped in. When a certain stage in development of EstWN was completed, or when it was time to report some results, the individual files were merged, checked for duplicates and inconsistencies, provided with a version number and released. 3 The Curse of Character Encoding There are 27 letters in the Estonian alphabet. Besides the letters found in US- ASCII, there are a, o, u with umlauts (ä, ö, ü), o with a tilde (õ) and s and z with a caron (š, ž). See EVS 1999 for details about Estonian alphabet and recommended locale settings.5 Although best attempts were made to use only one single code page for EstWN, some encoding bugs have sneaked in. Since the Polaris tool was too old to support any Unicode encodings, so the work files had to be in some extended ASCII code page. As the number of lexicographers grew along with the number of computers used, it was practically impossible to keep track of the code pages actually in use. In fact, this problem only emerged around 2010 when we started to convert the export files to UTF-8. The code pages that were in use are rather similar, there is no problem with the letters with umlauts. The first problems rose with Õ/õ (letter O with the diacritic mark tilde) that shares common code point with Ő/ő (letter O with double acute accent) in Windows 1252 and Windows 1250, respectively. These characters can be hard to separate, depending on the font design. Characters Š/š (letter S with a caron) and Ž/ž (letter Z with a caron) have different coding points in these code pages. If one lexicographer used Windows 1252 and another Windows 1250, after merging of the two files it could happen that synsets con- taining Š/š and/or Ž/ž were duplicated, one of them having nice literals and the other garbage. That led to synsets that differed in literals when really they should have been the same . The “umlauts” and “carons” were not the only causes of trouble. In the pro- cess of creating the lexicon, lexicographers at times copied parts of text from other sources. The copied parts of text occasionally included typographic char- acters (quoting marks, em and/or en dash), mathematical symbols (◦ degree symbol), foreign characters from other languages (å, ø). Finally we included only those characters that are in the extended Estonian alphabet. By doing this we lost some information, mostly foreign characters and typographic symbols, but the conversion was possible without any encoding errors. 5 27 is official number of characters in Estonian alphabet. This number does not include some letters found in US-ASCII: c, q, w, x, y. With all these included we will get the Estonian extended alphabet that contains 32 letters. If we will omit all “foreign” letters, we will get the number 23. This number includes the umlauts and õ; in addition to earlier mentioned letters, “carons” and f are excluded too. 4 The History of the EstWN The numeration of EstWN releases has been very simple: the number of each subsequent version is incremented by one. The last number is 73, but this com- mon ground does not mean, that we have exactly 73 releases of EstWN, and that all of them are numbered in this manner. There are 69 versions of EstWN at the Gitlab code repository6 that is hosted at the Center of Estonian Language Resources (CELR)7 . Most of them have version numbers, but two of them are identified by date. Those two we denoted as derivations of kb53 in Table 1. Table 1. Some indicators for EstWN versions, one from each year Synsets Synsets Synsets ILI without without ILI with many Version Year Synsets Senses Relations relations relations relations hyperonyms kb01 1998 1,570 3,277 0 1,426 1,570 244 0 kb13 1998 3,726 6,626 5,584 4,000 1,068 259 115 kb29 1999 8,578 15,553 17,832 9,985 134 1 201 kb35 2000 9,347 16,849 19,634 10,819 154 1 245 kb39 2001 9,798 17,493 20,683 11,329 151 1 277 kb42 2002 10,439 18,192 22,017 11,558 162 1 279 kb47 2003 10,892 18,851 22,982 12,044 161 2 284 kb48 2004 10,893 18,855 22,984 12,044 161 3 284 kb51 2005 15,387 33,517 23,620 12,084 4,426 4,477 22 kb53 2006 15,386 33,514 24,802 16,522 4,025 4,073 58 kb53-1 2007 15,952 34,227 26,218 18,358 4,025 4,088 91 kb53-2 2008 17,544 36,404 30,836 20,786 3,967 4,089 243 kb55 2009 26,781 48,292 55,260 35,217 3,541 2,167 811 kb59 2010 42,341 72,579 99,014 58,787 1,730 5,360 1,267 kb62 2011 49,514 82,568 133,062 73,397 836 2,141 1,700 kb65 2012 56,929 93,914 163,785 84,146 518 1,804 1,717 kb69 2013 65,518 107,544 203,070 96,205 60 1,258 102 kb70 2014 67,676 110,881 210,424 98,710 47 1,231 14 kb72 2015 74,720 120,911 239,719 108,202 5 9 100 kb73 2016 77,878 125,646 248,996 112,283 0 0 70 For each new version of EstWN the summary of statistical measures has been made. For the present overview we made an excerpt of the table, including the very first version (kb01 released on June 5, 1998), and thereafter the last version from each year. The date of the “last” version of the year varies a lot, from August to mid-December. The number of versions per year varies also. There were 13 releases in 1998, in 2006–2008 there was only one release per year. 6 https://gitlab.keeleressursid.ee/nemee/Estwn 7 https://keeleressursid.ee/en/ The capacity of a wordnet can be measured by various indicators. The most prominent ones are the number of synsets, and the number of lexical items. A more precise indicator is number of synsets by part-of-speech. As a wordnet is also about semantic (or lexical, or lexico-semantic) relations, the number of relations is informative as well. In order to link to other lexicons, external refer- ences are important. In EuroWordNet project the equal relations or ILI relations served this purpose . For EuroWordNet, the number of the Base Concepts (see Vossen et al. 1998) served as one of the indicators. It is not listed in this paper, but according to our raw data the Base Concepts were included in EstWN kb14 (first release of 1999) already, and their number has remained greater than 2,300 since then. Following the running statistics has influenced the process of extending Es- tWN. This can give insight to features, in which the data is missing. One of the aims to complement EstWN has been adding semantic relations, the decreasing number of synsets without semantic relations is a good indicator of this process. The last row in Table 1 entitled “Synsets with many hyperonyms” lists the number of synsets that have more than one hyperonym. Although more than one hyperonym per synset is a common practice in wordnets, their existence may cause anomalies in the wordnet hierarchy, as Lohk 2015 has shown. Better overview about the dynamics of the number of synsets by part of speech is given in Figure 1. The chart shows the number of noun, verb, adjective and adverb synsets cumulatively. Instead of a linear growth there is a S-shaped curve, which has a plateu at the beginning of the Millenium and a sharp burst in 2010. According to Figure 1 there is a increase in number of synsets from 2004 to 2005, but at the same time the number of synsets without relations, both internal and ILI, has changes too. This may be the co-effect of automatic addition of new synsets via derivations. For some derivations it was possible to make semantic relations in the same manner, but not for all of them. The method and problems are described by Kahusk, Kerner, and Vider (2010). As for kb73, currently the the last version, the number of synsets without relations is zero. EstWN kb73 will probably remain the last version to be linked to Princeton WordNet ver. 1.5, the next version will be linked to to 3.0 at least, and perhaps we will use a new numbering scheme as well. Kb 73 is also browsable by the current version of TEKsaurus tool (Kahusk and Vider 2005). A slightly older version of EstWN has been included into EstNLTK toolkit (Orasmaa et al. 2016) along with the Eurown module for Python (Kahusk 2010). As EstWN is compiled according to Estonian language, the meaning of words in Estonian is considered instead of translating synsets from English wordnet. This practice has lead to a rather small number of eq synonym relations in EstWN (see Figure 2). There was a rather broad spectrum of equal relations used in EuroWordNet project, 15 equal relation types (besides eq synonym) are used in EstWN. That explains the relatively high number of other ILI relations on Figure 2. Fig. 1. The number of synsets by part of speech in Estonian wordnet versions and the number of synsets without ILI relations. The most numerous of the other ILI relations are eq has hyperonym and eq near synonym. Both of them are relations that are used to cope different granularity and coverage of synsets in Estonian and many other languages. Fig. 2. The number of eq synonym and other ILI relations in different versions of Estonian Wordnet. 5 Saving for Future There are five versions of EstWN (see table 2) described in the CELR language resources metadata registry META-SHARE8 . They have been viewed more than 600 times, more than 330 views imputed to the oldest version. As for downloads, the number is 113, the latest version has been downloaded most often. 8 https://metashare.ut.ee Table 2. EstWN versions in META-SHARE Version DOI Views Downloads 65 https://doi.org/10.15155/1-00-0000-0000-0000-00084L 333 37 69 https://doi.org/10.15155/1-00-0000-0000-0000-00072L 95 13 71 https://doi.org/10.15155/1-00-0000-0000-0000-00087L 73 10 72 https://doi.org/10.15155/1-00-0000-0000-0000-000E9L 61 7 73 https://doi.org/10.15155/1-00-0000-0000-0000-0011DL 89 46 The metadata from our repository is harvested by CLARIN9 to the Virtual Language Observatory (VLO)10 , and META-SHARE11 . As for the last registry, search for “wordnet” gives 74 answers12 , if Estonian is specified as the language, 7 of them will remain in results. In VLO ’s case, the search for “wordnet” gives 106 answers, after limiting language to Estonian, 10 of the results remain. Although the numbers are not very large, it may take for a while to find out the last version. Probably version 73 will not remain last version of the EstWN, and every next version will add some more confusion. We may assume, that generally the user would be interested in finding the last version of the wordnet, but it may not always be so. A specific version of the wordnet or several (more than one) legacy versions may be needed in order to replicate a particular experiment or to develop a resource. If we poured all the more than 60 legacy versions of EstWN into our repository, then it would overflow not only our metadata registry, but CLARIN VLO and www.meta- share.org too. An elegant solution for this problem would involve an elaborated CMDI profile with references to older and newer versions of the resource.13 However, an upgrade of harvesting sites would be required as well. This would probably work for VLO, but not for www.meta-share.org or other harvesting services (our registry is harvested by Estonian E-Repository too).14 As a compromise, we are going to describe metadata of all the legacy versions as single collection of resources and upload it to our repository. 6 Acknowledgements Development of EstWN is supported by the National Programme for Estonian Language Technology during 2006–2017.15 9 https://www.clarin.eu/ 10 https://vlo.clarin.eu/ 11 http://www.meta-share.org 12 The search site redirects to http://metashare.ilsp.gr:8080/ 13 https://www.clarin.eu/cmdi1.2 14 https://www.e-varamu.ee 15 https://www.keeletehnoloogia.ee/en?set_language=en References Dı́ez-Orzas, Pedro, Philippe Forest, and Michael Louw (1996). High-level Archi- tecture of the EuroWordNet Database. A Novell ConceptNet-based semantic network. Tech. rep. url: https://www.illc.uva.nl/EuroWordNet/docs. html. Eastman, Dick (2014). “GEDCOM Explained”. In: Eastman’s Online Geneal- ogy Newsletter. url: https : / / blog . eogn . com / 2014 / 05 / 24 / gedcom - explained/. EVS (1999). Requirements on information technology in estonian language and cultural environment. Unofficial final draft. url: http : / / www . eki . ee / itstandard/2000/contents.html. Kahusk, Neeme (2010). “Eurown: an eurowordnet module for python”. In: Prin- ciples, Construction and Application of Multilingual Wordnets. Proceeding of the 5th Global Wordnet Conference: The 5th International Conference of the Global WordNet Association (GWC-2010), pp. 360–364. Kahusk, Neeme, Kadri Kerner, and Kadri Vider (2010). “Enriching Estonian WordNet with Derivations and Semantic Relations”. In: Frontiers in Artifi- cial Intelligence and Applications 219.Human Language Technologies—The Baltic Perspective. Ed. by Inguna Skadiņa and Andreijs Vasiljevs, pp. 195– 200. issn: 0922-6389. doi: 10.3233/978-1-60750-641-6-195. url: http: //doi.org/10.3233/978-1-60750-641-6-195. Kahusk, Neeme and Kadri Vider (2005). “TEKsaurus—the Estonian WordNet online”. In: The Second Baltic Conference on Human Language Technologies: proceedings. Ed. by Margit Langemets and Priit Penjam. Tallinn: Institute of Cybernetics. Kerner, Kadri (2008). “Proposing Some Methods of Improving Word Sense Dis- ambiguation for Estonian Language”. In: Proceedings of the Fourth Global WordNet Conference: Proceedings of the Fourth Global WordNet Conference. Hungary, Szeged: University of Szeged, Department of Informatics, pp. 229– 239. Lohk, Ahti (2015). “A System of Test Patterns to Check and Validate Wordnet- type Dictionaries”. PhD thesis. Tallinn University of Technology. url: http: //digi.lib.ttu.ee/i/?3161. Louw, Michael (1998). Polaris User’s Guide. The EuroWordNet Database Editor. Tech. rep. url: https://www.illc.uva.nl/EuroWordNet/docs.html. Orasmaa, Siim et al. (2016). “EstNLTK—NLP Toolkit for Estonian”. In: Pro- ceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Ed. by Nicoletta Calzolari et al. Paris, France: Eu- ropean Language Resources Association (ELRA). isbn: 978-2-9517408-9-1. Pedersen, S. Bolette et al. (2013). “Nordic and Baltic wordnets aligned and com- pared through “WordTies””. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013). Ed. by Stephan Oepen, Kristin Hagen, and Janne Bondi Johannesse. Linköping University Electronic Press, pp. 147–162. url: http://www.ep.liu.se/ecp/085/ecp13085.pdf. Vossen, Piek et al. (1998). Set of Common Base Concepts in EuroWordNet-2. Tech. rep. url: https://www.illc.uva.nl/EuroWordNet/docs.html.