Conflations and duplications in Wikidata items: causes, detection, solutions, and issues Camillo Carlo Pellizzari di San Girolamo 1 1 Scuola Normale Superiore, Pisa PI 56126, Italy Abstract This paper analyzes the problems of incorrect disambiguation of entities in Wikidata items, both in general and focusing on items regarding humans. The problem of incorrect disambiguation is categorized into two types, i.e. conflations and duplications. The paper subsequently treats the causes of conflations and duplications, the methods available for detecting them, the solutions applicable to them and the issues that constitute an obstacle to the aforementioned solutions; three proposals are finally made to mitigate these issues. Keywords 1 Wikidata, entity management, authority control 1. Introduction Wikidata (WD) is a knowledge base containing, as of September 2023, more than 106 M entities2. The disambiguation of these entities is, or at least should be, obtained through the authority control, a key part of cataloguing [1]. The aim of authority control is creating entries that coincide exactly with the described entities; the incorrect disambiguation of an entry can result in conflations (i.e. many entities being described in the same entry) and duplications (i.e. many entries describing the same entity). The concept of authority control can be applied to the entries created by librarians (the authority records) as well as to the entries created by WD users (the WD items) [2]. This paper deals with the problem of conflations and duplications in WD items, both in general and with a specific focus on the 10+ M items regarding humans 3 (this estimate includes only individual real humans, excluding both groups of humans and fictional humans). The choice of limiting this research to humans is motivated by the standardized structure of these items, which usually contain the same core of data, i.e. name(s), birth/death dates and places, occupation(s); this makes the solution of the above problems easier than in the case of other types of items, e.g. organizations, places, or concepts. Wikidata’23: Wikidata workshop at ISWC 2023 EMAIL: camillo.pellizzaridisangirolamo@sns.it ORCID: 0000-0003-2699-1693 ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2 Cf. https://www.wikidata.org/wiki/Special:Statistics. 3 https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q5. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings However, it should be noted that conflations and duplications affect significantly also other types of WD items. Discussions have aroused regarding e.g. duplications of geographical places 4 and conflations of buildings and organizations having their seat inside them 5; these issues also affect the conceptual items composing WD ontology [3]. The following paragraphs analyze the most frequent causes of conflations and duplications in WD items, the ways of detecting these problems, the methods usually adopted in order to solve them, and the issues that presently affect this process; three proposals are finally made to mitigate these issues. 2. Causes The causes of conflations and duplications are analyzed on three levels: 1. general causes; 2. causes specifically applicable to humans; 3. how they concretely originate in WD. 2.1. General causes Conflations and duplications are both caused mainly by a non-bijective relationship between an entity and its name. In other words, if an entity has only one name and this name is used only by this entity, there is no risk of incorrect disambiguation. However, incorrect disambiguation is typically a problem when an entity has many names (polyonymy) and when many entities use the same name (homonymy): a conflation happens when two entities using the same name are wrongly treated as just one entity, whilst duplication happens when two names used by the same entity are wrongly treated as two different entities. 2.2. Causes for entities regarding humans Applying the previous description to the case of entities regarding humans, the risks of incorrect disambiguation can be described as follows. Conflations usually affect homonyms, i.e. persons using the same name. Homonymies affect both very common combinations name-surname in a certain language (e.g. John Brown, Hans Meyer, Jean Martin, Mario Rossi etc.) and rarer ones, which could be more difficult to spot. Conflations are more probable when the homonym persons have been alive in the same period and in the same field of activity6. Duplications usually affect persons with many names. These names can have multiple causes: they can be multiple forms of the same name, with different degrees of completeness (i.e. including/excluding the second surname and/or the second name, or with second name represented by its initial)7; or they can be transliterations of the original name in different scripts 8; 4 Cf. https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/08#Dealing_with_our_second_planet and https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/07#Another_cebwiki_flood?. 5 Cf. https://www.wikidata.org/wiki/Wikidata:WikiProject_Performing_arts/Data_structure/Data_modelling_issues#Items_confounding_ architectural_structures_and_organizations. 6 E.g. https://www.wikidata.org/wiki/Q57906651 and https://www.wikidata.org/wiki/Q118183242, two geologists active in the same period and teaching in the same university. 7 E.g. https://www.wikidata.org/wiki/Q111010598 (see aliases in Italian). 8 E.g. https://www.wikidata.org/wiki/Q304890. or they can be translations/adaptations of the original name in different languages9. Graphical variants (especially for premodern persons) can also generate multiple names. 2.3. Causes in WD WD is edited both by humans and by bots. As of July 2023, most of WD edits have been made by bot accounts (55%), the others by humans (45%), either through semi-automatic tools or manually. Also considering item creations, bots (60%) prevail over humans (40%) 10. Thus, it is relevant to consider how bot edits, semi-automated edits and manual edits can differently contribute to originating conflations and duplications, and to whom the responsibility of these incorrect disambiguations should be attributed in each case. 2.3.1. Causes in bot editing Bot accounts (i.e. accounts possessing a bot flag) must comply with the bot policy 11; according to it, each bot task has to be approved by the community and the bot “must stay within reasonable bounds of their approved tasks”; as of the 1st September 2023, in WD 360 accounts have the bot flag. Duplication is an issue frequently discussed by the users commenting the requests of approval of new bot tasks, with the string duplicat currently occurring in more than one hundred request pages12: if the task involves importing new items, users commonly ask bot operators to demonstrate they are taking every possible measure to minimize the risk of creating duplicate items. In 2018 a proposal to add to the bot policy a maximum duplicate rate for bots creating new items was not approved by WD community13. To the contrary, conflation is almost never discussed (the string conflat currently occurs in only three request pages) 14. Bots are often used also to add data to WD items using the external identifiers they contain as sources (of course, imported data should be in CC0 license15); e.g. if item X contains the ID Ψ, the bot can copy data from Ψ to X. However, if item X and ID Ψ are mostly about the same entity but ID Ψ also contains not-pertinent data (i.e. it is conflated), the bot importing data from Ψ to X will reproduce in X the same conflation affecting Ψ. In this case, the responsibility is mainly of the compilers of Ψ, although it can be argued that Ψ should not have been matched to X because of its being conflated; this argument is valid unless Ψ has become conflated after having been matched to X. 2.3.2. Causes in semi-automated editing A variety of semi-automated tools can be used, without a bot account, to add statements to existing items and to create new items massively. Among the tools listed in the Bot requests 9 E.g. https://www.wikidata.org/wiki/Q9438. 10 https://wikidata.wikiscan.org/. 11 https://www.wikidata.org/wiki/Wikidata:Bots. 12 Cf. https://www.wikidata.org/w/index.php?title=Special:Search&prefix=Wikidata:Requests+for+permissions/Bot/&search=duplicat*. 13 https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/04#What_duplicate_rates_should_we_tolerate?. 14 Cf. https://www.wikidata.org/w/index.php?title=Special:Search&prefix=Wikidata:Requests+for+permissions/Bot/&search=conflat*. 15 Cf. https://www.wikidata.org/wiki/Wikidata:Data_Import_Guide. page16, the most used are QuickStatements 17 (QS) and OpenRefine18 (OR). According to the statistics of the tool Wikidata Navel Gazer 19 (WNG), based on edit tags, as of the 1st of August 2023 a total of 390 581 099 edits have been made through QS, 16 439 295 edits through OR, 1 628 745 edits through Harvest Templates 20, 413 223 edits through WikibaseJS-cli21. QS is used to perform a broad range of edits in WD, whilst OR is used mainly for the reconciliation of external databases with WD. The batches of edits performed through semi-automated tools, unlike bot tasks, are not subject to a preliminary approval by WD community. However, batches containing a relevant percentage of mistakes can be reverted through the tool EditGroups 22; usually reverts are decided through a community discussion23. Through OR the entries of an external database are reconciled with WD items, i.e. either matched with existing items regarding the same entity or created as new items. A wrong reconciliation can originate conflations (when the entry is not imported into the existing item representing its same entity, but into another item) and duplications (when the entry is not imported into the existing item representing its same entity, but as a new item instead). These wrong matches can happen in two distinct scenarios. In the first scenario, the matches by OR are based on a third ID (i.e. if entry A and WD item X both contain the ID Ψ, the entry A is matched with item X), so the responsibility for mistakes is not of the uploader, who merely executes the upload. In such cases, the blame is either of the compilers of A (who have added to it the non-pertinent ID Ψ) or of the compilers of X (for the same reason) or of the compilers of Ψ (who have conflated in it two distinct entities). In the second scenario, OR suggests possible WD items matching with the entries of the external database and the uploader has to review manually these suggestions. In this case, obviously, the mistakes are caused by the inaccuracy of the uploader in the review of the matches proposed by OR. The best possible reconciliation should also pay attention to the mistakes potentially affecting the external database. In particular, the uploader should always consider the possibility that the external database contains conflated and duplicate entries. Conflated entries should not be matched with WD, whilst duplicate entries should either be all matched with the same existing item or, if no item exists for the entity, be used to create one new item and all matched to it. 2.3.3. Causes in manual editing When editing manually, users can generate new conflations and duplications for the reasons explained in the previous paragraph; so, in this case the responsibility for the incorrect disambiguation falls entirely on WD’s community. 3. Detection In general, the detection of cases of incorrect disambiguations is based on the following reasoning: one item containing multiple values for certain statements that are expected to have 16 Cf. https://www.wikidata.org/wiki/Wikidata:Bot_requests. 17 https://quickstatements.toolforge.org/. 18 https://openrefine.org/. 19 https://bambots.brucemyers.com/NavelGazer.php?property=P-5. 20 https://pltools.toolforge.org/harvesttemplates/index.html. 21 https://github.com/maxlath/wikibase-cli. 22 https://editgroups.toolforge.org/. 23 Cf. https://www.wikidata.org/wiki/Wikidata:Edit_groups; community discussions are automatically categorized in https://www.wikidata.org/wiki/Category:Edit_group_discussions. one single value could be a conflation, whilst two items whose data show a high degree of similarity could be duplicates. 3.1. Use of property constraints Property constraints 24, introduced in 2015 [4], can be applied to properties in order to clarify how they should be used. Constraint violations, which can be checked both through SPARQL queries and through bot-updated reports25, help in discovering potentially problematic statements. The constraint violations, mainly of external-identifiers properties, could also be used in order to detect conflations and duplications; these incorrect disambiguations can happen both in WD and in the external database. Given a database P containing two entries A and B, the following reasoning can be applied:  a single-value constraint violation (SVCV) in WD (i.e. one WD item X containing both IDs A and B) could be either a conflation in WD or a duplication in P;  a unique-value constraint violation (UVCV) in WD (i.e. two WD items X and Y both containing ID A) could be a duplication in WD or a conflation in WD or a conflation in P. 3.2. Use of SPARQL queries Whilst the detection method through constraint violations affecting identifier-statements could be applied to all kinds of items, some other detection methods, based on SPARQL queries, are available specifically for items regarding humans. Considering that humans can only be born (and eventually died) in one place and in one moment, if a WD item X contains e.g. two birth dates or places, it could be a conflation. However, only truly-different values should be counted (i.e. if two values are just the same value with different precisions, like 1933 and 1st March 1933, or Brooklyn and New York, they should be counted as one, and the most precise should be ranked as preferred26), and sometimes for the same human different sources support different values 27. Then, considering that different humans are rarely born (and/or died) in the same place and moment, if two WD items X and Y contain the same birth date and have the same label (or similar labels) in a given language, they could be duplicates. In this case, too, exceptions exist, and they should be marked with the property “different from” (P1889) in order to avoid incorrect merges, which would conflate two different humans. 3.3. Statistics on conflations and duplications No existing tool can be used to obtain statistics about conflations and duplications, since, as said above, there is no method allowing to discover them with full certainty. So the number of currently existing conflations and duplications is unknown (the same conclusion is reached by [5]). However, considering the sum of SVCV and UVCV for a given external-identifier property it is possible to deduce the total number of conflations and duplications affecting either WD or the 24 https://www.wikidata.org/wiki/Help:Property_constraints_portal. 25 https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations (these reports include also deprecated statements, which are ignored by constraints in the visualization of items and in SPARQL queries). 26 Cf. https://www.wikidata.org/wiki/Help:Ranking. 27 Cf. e.g. the different birth/death dates of https://www.wikidata.org/wiki/Q1698718. considered external database. For the 5 most used external-identifier properties as of the 30th August 202328, the following list declares property ID, English label, number of items containing the property (defined as “items”), sum of SVCV and UVCV (defined as “disambiguation issues”):  P698 (PubMed ID): 32 037 827 items, 58 018 disambiguation issues29;  P356 (DOI): 29 596 978 items, 39 934 disambiguation issues 30;  P3083 (SIMBAD ID): 8 076 124 items, 12 013 disambiguation issues31;  P2671 (Google Knowledge Graph ID): 7 373 446 items, 27 236 disambiguation issues 32;  P932 (PMCID): 6 577 473 items, 697 disambiguation issues33; The following list declares the same data for the 6 most used external-identifier properties whose subject type constraint allows usage in items regarding humans (P2671, being already present in the previous list, is not repeated in this one):  P646 (Freebase ID): 4 417 648 items, 18 953 disambiguation issues34;  P214 (VIAF ID): 3 224 729 items, 70 566 disambiguation issues 35;  P7859 (WorldCat Identities ID (superseded)): 1 893 721 items, 28 124 disambiguation issues36;  P227 (GND ID): 1 779 731 items, 14 360 disambiguation issues 37;  P496 (ORCID ID): 1 788 906 items, 2 708 disambiguation issues 38; As noted above, for these disambiguation issues it is impossible to disentangle automatically those depending from WD and those depending from each external database. 4. Solutions This paragraph considers first how conflations and duplications are solved in single cases (i.e. splitting or merging items) and then how WD community generally tackles these problems, collecting relevant WikiProjects and guidelines. 4.1. Splitting items The standard procedure for solving a conflation, after its detection, is splitting the conflated item into two items (or more, if necessary). Splitting an item consists in moving non-pertinent data from the conflated item either to an existing item (or items) or to a new item (or items). Non- pertinent data can be found in all the parts of the item: 1. labels, descriptions, and aliases; 2. statements (including identifiers); 3. sitelinks; 4. incoming links from other items (which are not, strictly speaking, a part of the item itself). 28 Cf. https://www.wikidata.org/w/index.php?title=Template:Number_of_main_statements_by_property&oldid=1964859732. 29 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P698&oldid=1965269162. 30 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P356&oldid=1965284365. 31 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P3083&oldid=1965220883. 32 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P2671&oldid=1965225590. 33 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P932&oldid=1965263192. 34 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P646&oldid=1965269477. 35 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P214&oldid=1965292956. 36 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P7859&oldid=1965167996. 37 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P227&oldid=1965291214. 38 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P496&oldid=1965275301. All these parts should be checked manually, in order to completely solve the conflation. Since the procedure is manual, no statistics are available about items splits. 4.2. Merging items The standard procedure for solving a duplication, after its detection, is merging the duplicate items into one item (merges always happen between two items; in order to merge more than two items, a series of merges has to be performed39). Merges are performed mainly through the gadget Merge.js40, which could be enabled by logged-in users in their Preferences; the special page MergeItems41 could also be used. Once triggered, the merge procedure is completely automatic: one item (usually the newer item, but using Merge.js the choice is always made by the user) is redirected to the other and all its data are automatically transferred into the other. The merges made through Merge.js are tagged with the tag “gadget-merge”42 and can be monitored through the special page RecentChanges 43. 4.2.1. Not mergeable duplicates The following issues can hinder the merge of detected duplicate items:  the two WD items can both contain a sitelink to a Wikimedia project having, mistakenly, two different articles for the same entity. The user can personally merge the articles (if they have a good knowledge of the language in which they are written, and enough time) or they can mark them as needing to be merged. Until the merge is performed, the two WD items cannot be merged and they are usually marked with the statement “instance of” (P31) “Wikimedia duplicate page” (Q17362920); as of the 24th July 2023, the items with the P31= Q17362920 statement are 10 53844;  the two WD items can both contain a sitelink to a Wikimedia project having, with a valid reason, two different article for the same entity (e.g. the two articles are written in two variants of the same language). As of now, these two items can never be merged and are marked with the statement “instance of” (P31) “Wikimedia permanent duplicate item” (Q21286738) and are interlinked through the property P2959 (“permanent duplicated item”); as of the 24 th July 2023, the items with the P31=Q21286738 statement are 6 76845 and the items using P2959 are 15 09146. 4.2.2. Statistics on merges Thanks to the standardized nature of merge-edits (in comparison with edits made in order to split items), some statistics are available about them through WNG: according to it, as of the 1 st July 2023, 3 568 671 merges have been made by 52 435 distinct users. In June 2023, 27 561 merges have been made by 2 284 distinct users47; in July 2023, similarly, 27 392 merges have 39 Cf. https://phabricator.wikimedia.org/T336192. 40 https://www.wikidata.org/wiki/MediaWiki:Gadget-Merge.js. 41 https://www.wikidata.org/wiki/Special:MergeItems. 42 Cf. https://www.wikidata.org/wiki/Special:Tags. 43 https://www.wikidata.org/w/index.php?title=Special:RecentChanges&tagfilter=gadget-merge. 44 https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q17362920. 45 https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q21286738. 46 https://www.wikidata.org/w/index.php?title=Template:Number_of_main_statements_by_property&oldid=1939837531. 47 https://web.archive.org/web/20230724151558/https://bambots.brucemyers.com/NavelGazer.php?property=P-5. been made by 2 256 distinct users48. Statistics about merges can also be obtained counting the number of redirected WD items [5]. Statistics about redirected items (i.e. merged duplicates) can be obtained from the tool Wikiscan49: as of the 1st September 2023, WD has 109 531 235 items 50 and 4 112 295 redirected items51. Considering redirected items, 1 115 958 have been created by humans (specifically, 1 088 584 by registered users52 and 27 374 by IPs53) and 2 996 337 by bots54. So, analyzing the total of merged duplicates, it seems that bots have created nearly the triple of merged duplicates in comparison with humans. But, taking into consideration the data available by year, bots have created more merged duplicates than humans each year from 2012 to 2019 (596 021 merged duplicates created by humans, 2 741 326 by bots), whilst the contrary is true each year from 2020 to 2023 (519 937 merged duplicates created by humans, 255 011 by bots). So, from 2020 humans (editing both with semi-automated tools and manually) are responsible for the creation of about two thirds of merged duplicates. 4.3. Approaches in WD community The WD community has created specific WikiProjects55 in order to deal with the problem of incorrect disambiguations: WikiProject Duplicates 56 was founded in 2016, whilst WikiProject Conflation57 was founded in 2023. WikiProjects are meant to coordinate users involved in tackling a certain issue, or curating a certain group of items. Problematic disambiguations are frequently discussed also in thematic WikiProjects, as stated in [6]. Relevant guidelines about incorrect disambiguations include the practical ones, about splitting items58 and merging items 59, and the theoretical ones, about conflations 60 and duplications61. Best practices have also been collected in a subpage of WikiProject Duplicates 62. 5. Issues and possible approaches This paragraph proposes some measures which could have positive effects in mitigating the issues of incorrect disambiguation in Wikidata, with a specific focus on their prevention and on the efficiency of the procedure for solving them. 5.1. Prevention 48 https://web.archive.org/web/20230831104225/https://bambots.brucemyers.com/NavelGazer.php?property=P-5. 49 https://wikidata.wikiscan.org/. 50 https://wikidata.wikiscan.org/?menu=tables&submenu=creation&filter=creation_noredir. 51 https://wikidata.wikiscan.org/?menu=tables&submenu=creation&filter=creation_redir. 52 https://wikidata.wikiscan.org/?menu=tables&submenu=creation&filter=creation_redir&type=user. 53 https://wikidata.wikiscan.org/?menu=tables&submenu=creation&filter=creation_redir&type=ip. 54 https://wikidata.wikiscan.org/?menu=tables&submenu=creation&filter=creation_redir&type=bot. 55 Cf. https://www.wikidata.org/wiki/Wikidata:WikiProjects. 56 https://www.wikidata.org/wiki/Wikidata:WikiProject_Duplicates. 57 https://www.wikidata.org/wiki/Wikidata:WikiProject_Conflation. 58 https://www.wikidata.org/wiki/Help:Split_an_item. 59 https://www.wikidata.org/wiki/Help:Merge. 60 https://www.wikidata.org/wiki/Help:Conflation (see also, for humans, https://www.wikidata.org/wiki/Help:Conflation_of_two_people). 61 https://www.wikidata.org/wiki/Help:Deduplication. 62 https://www.wikidata.org/wiki/Wikidata:WikiProject_Duplicates/VIAF_members. With respect to the prevention of these issues, it has been noted that bot tasks need to be approved before running and the users discussing them usually require the issue of duplication to be tackled thoroughly. To the contrary, semi-automated batches run by users using OR, QS, or other tools are not subject to any approval procedure, but can only be criticized and eventually undone afterwards. The presence, in batch editing, of a relevant percentage of incorrect disambiguations can be considered disruptive editing, and thus can fall under the blocking policy 63 and can be a valid reason for undoing the whole batch itself, but there is no policy specifically dealing with conflations and duplications. Approving a policy containing precise standards of quality for semi-automated batches, including norms regarding incorrect disambiguations, could have positive effects both in encouraging the users running the batches to care more about data quality and in providing a clear reference point for judging if the mistakes affecting a batch are serious enough to justify undoing it. This policy could allow to users having run problematic batches to choose between undoing them or fixing their mistakes in a reasonably brief span of time. 5.2. Data round-tripping With respect to the detection of the issues, constraint violations and SPARQL queries are already an effective mean to find a relevant amount of disambiguation issues. However, the lists of items obtained through these means mix disambiguation issues affecting WD items and disambiguation issues affecting the external databases to which WD items link. Whilst the first ones, once solved, disappear from the lists, the second ones cannot be solved in WD, but only in the external database itself, and so can remain in the lists for long time, wasting the time of the users stumbling upon them. This is a major issue in the workflow of users interested in dealing with disambiguation issues in WD. Data round-tripping64, i.e. the synchronization of WD data with the external database’s data (implying also the correction of mistakes on each side), surely benefits the quality of both WD items and the external database’s entries [7]. Thus, each external database should be interested in receiving mistake reports from, among others, WD users. Some databases effectively provide a contact method (web form, e-mail, phone number etc.; each WD property can indicate through P10923 the error-report method used by the database) and, when contacted, solve the reported mistakes on a regular basis, but others never answer reports (or explicitly refuse them 65), and a few ones do not even provide any contact method. National authority files, whose IDs are widely used in Wikidata, are also affected by these issues, as shown in [8], and often lack effective ways of mistake report. Until the mistake is solved in the external database, WD items have to keep conflated and duplicate external IDs. Conflated IDs are usually ranked as deprecated and qualified with P2241 (“reason for deprecated rank”) Q14946528 (“conflation”) – this qualifier has 10 341 occurrences, as of the 25th July 202366 –, whilst duplicate IDs do not usually receive any specific marker, since in most cases all these IDs are equally valid. The presence of these problematic IDs in WD items, as said, has the main negative effect of flooding the lists of constraint violations with a relevant amount of false positives (i.e. incorrect disambiguations which cannot be solved in WD, but need to be solved elsewhere). 63 https://www.wikidata.org/wiki/Wikidata:Blocking_policy. 64 Cf. https://www.wikidata.org/wiki/Wikidata:Data_round-tripping. 65 Cf. https://web.archive.org/web/20230621075514/https://www.wikidata.org/wiki/User:CaféBuzz/BNF. 66 https://w.wiki/77bn. The creation of a tool simplifying both the reporting of mistakes in external databases for WD users and the management of such reports by the curators of these databases could mitigate this issue67. 5.3. Management of conflations The disambiguation issues, after being detected, follow different paths of solutions: splitting a conflated item is fully manual operation, requiring to evaluate to which item each datum needs to be assigned, whilst merging two duplicate items is an automated operation requiring just a few seconds. Although, obviously, disentangling two mixed entities necessarily requires human judgement in order to decide the exact boundaries of each entity, it is nonetheless evident that the present procedure for splitting items has some drawbacks: it is very time-expensive and there is a high risk, for the user, to forget checking some parts of the conflation, thus not solving it completely. These issues can also discourage less experienced users from trying to solve conflations when they find them. Moreover, item splits are presently impossible to monitor, since they are not performed through a gadget (which could assign them a specific tag). A related issue regards the solution of conflations deriving from incorrect merges 68: the merge, after having been proven wrong, has to be undone separately in both the involved items, and then all incoming links have to be checked (and corrected, whenever necessary) manually. Introducing a gadget designed to help users in solving conflations could solve the above problems. The gadget should present the user a panoramic view of the two items he is managing, distinguishing their four parts (labels, descriptions, and aliases; statements; sitelinks; incoming links), and provide the user a simple interface for moving these parts from one item to the other. It could integrate two already existing gadgets, Move69 (used for moving sitelinks) and moveClaim70 (used for moving statements). This new gadget would facilitate the whole process and in this way it would both encourage users to solve conflations 71 and make some statistics about item splits available. 6. Related work The issue of entity identification and disambiguation has been discussed in semantic web literature. The surveyed literature is mainly concerned with the development of automatic (or semi-automatic) methods and tools for the disambiguation of entities. The problem of preventing database contributors from adding new conflations and duplications, which is typical of user- generated databases such as Wikidata, appears not to be the subject of dedicated publications. Among the surveyed publications, the following are of particular interest. [9] provides a survey of techniques and tools used in entity management systems for the semantic web. [10] deals with the issue of author disambiguation in bibliographic databases. [11] describes the clusterization process used by VIAF and how it deals with ambiguities. [12] proposes a method for identifying duplicate entries for people and companies in a given dataset. [13] describes the deduplication procedure used in the database ScholarlyData72. 67 Cf. https://phabricator.wikimedia.org/T312718. 68 See https://phabricator.wikimedia.org/T237262. 69 https://www.wikidata.org/wiki/MediaWiki:Gadget-Move.js. 70 https://www.wikidata.org/wiki/MediaWiki:Gadget-moveClaim.js. 71 Cf. https://www.wikidata.org/wiki/Wikidata_talk:Ontology_issues_prioritization. 72 http://www.scholarlydata.org/. 7. Conclusions Disambiguation issues are among the problems affecting the data quality of WD items: two entities having the same name can be conflated into one item, one entity with two names can be duplicated into two items. These issues are generated in WD items by bots, by humans using semi-automated tools and by humans editing manually. Statistics about item merges (i.e. including only duplicates which have already been detected and merged) suggest that until 2019 most duplicates have been created by bots, whilst from 2020 most duplicates have been created by humans. Conflations and duplications are detected through constraint violations and SPARQL queries: if an item contains two values for the same datum, it could be a conflation; if two items contain the same value for the same datum, it could be a duplication (or one of the two items is conflated). The considered datum is typically an external identifier; for this reason, the detected disambiguation issue can lie either in WD items or in the considered external database. The issues affecting WD items can be solved directly by WD users, whilst the issues affecting external databases need to be solved by these databases. Solving a conflation implies splitting the conflated item, an operation which is performed manually, checking each piece of the item. To the contrary, the solution of a duplication, i.e. merging the duplicate items, is a fully automated operation, which can also be monitored through statistics. As outlined, three main proposals are advanced in order to mitigate the problem of incorrect disambiguation, both improving its prevention and facilitating its solution. Firstly, in order to reduce the number of mistakes introduced by semi-automated batches, it is proposed to introduce precise standards of quality for these batches, including norms regarding incorrect disambiguations. Secondly, a more efficient data round-tripping procedure is needed in order to make the detection and solution of incorrect disambiguations more efficient: a dedicated tool could simplify the communication between WD users and external databases' curators, encouraging the second ones to receive and answer efficiently the mistake reports coming from the first ones. Thirdly, a new gadget helping users in solving conflations could make items splits faster and help users in performing them without forgetting some parts of the conflation itself, besides making possible to obtain some statistics, not available as of now. 8. References [1] R. A. Wiederhold, G. F. Reeve, Authority Control Today: Principles, Practices, and Trends, Cataloging & Classification Quarterly 59.2-3 (2021) 129–158. doi:10.1080/01639374.2021.1881009. [2] C. Bianchini, L. Sardo, Wikidata: a new perspective towards universal bibliographic control, JLIS.it 13.1 (2022) 293–311. doi:10.4403/jlis.it-12725. [3] D. Ammalainen, Wikidata Ontology Issues. Suggestions for prioritisation based on the perceived frequency of occurrence and the severity of impact on data re-use, 2023. URL: https://commons.wikimedia.org/wiki/File:Wikidata_ontology_issues_%E2%80%94_sugge stions_for_prioritisation_2023.pdf. [4] D. Vrandečić, L. Pintscher, M. Krötzsch, Wikidata: The Making Of, in: WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023, Association for Computing Machinery, New York, NY, 2023, pp. 615–624. doi:10.1145/3543873.3585579. [5] K. Shenoy, F. Ilievski, D. Garijo, D. Schwabe, P. Szekely, A study of the quality of Wikidata, Journal of Web Semantics, 72 (2022). doi:10.1016/j.websem.2021.100679. [6] T. Kanke, Knowledge curation work in Wikidata WikiProject discussions, Library High Tech 39.1 (2021) 64–79. doi:10.1108/LHT-04-2019-0087. [7] S. Fauconnier, Data Roundtripping: a new frontier for GLAM-Wiki collaborations, 2019. URL: https://diff.wikimedia.org/2019/12/13/data-roundtripping-a-new-frontier-for-glam- wiki-collaborations/. [8] C. Bianchini, S. Bargioni, C. C. Pellizzari di San Girolamo, Beyond VIAF: Wikidata as a Complementary Tool for Authority Control in Libraries, Information Technology and Libraries 40.2 (2021). doi:10.6017/ital.v40i2.12959. [9] A. Morris, Y. Velegrakis, P. Bouquet, Entity Identification on the Semantic Web, in: Proceedings of the 5th Workshop on Semantic Web Applications and Perspectives (SWAP2008), Rome, Italy, December 15-17, 2008. [10] C. Hedeler, B. Parsia, B. Mathiak, Using the semantic web for author disambiguation-are we there yet?, in: ISWC 2014 Posters & Demonstrations Track, pp. 449–452. [11] T. B. Hickey, J. A. Toves, Managing Ambiguity In VIAF, D-Lib Magazine 20.7/8 (2014). doi:10.1045/july2014-hickey. [12] M. Holub, O. Proksa, M. Bieliková, Detecting Identical Entities in the Semantic Web Data, in: SOFSEM 2015: Theory and Practice of Computer Science. 41st International Conference on Current Trends in Theory and Practice of Computer Science, Pec pod Sněžkou, Czech Republic, January 24-29, 2015, Proceedings. [13] Z. Zhang, A. G. Nuzzolese, A. L. Gentile, Entity deduplication on ScholarlyData, in: The Semantic Web: 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28– June 1, 2017, Proceedings, Part I 14, pp. 85–100.