Automatic Tag Recommendation for the UN Humanitarian Data Exchange Ghadeer Abuodaa , Chad Hendrixb and Stuart Campob a College of Science and Engineering, Hamad Bin Khalifa University, Qatar b United Nations Office for the Coordination of Humanitarian Affairs (OCHA), Centre for Humanitarian Data, Netherlands Abstract We have recently seen a rapid growth of data portals and dataset repositories being made available on the Web. While these repositories have been critical for advancing research, much work remains to improve finding appropriate datasets and relevant sources. Search engines, the primary tools for dataset discovery, are mainly keyword-based over published metadata of the datasets, whether within dataset repositories or over the Web. However, in most cases, the available metadata may not encompass the essential information the user needs to decide whether the dataset fits a given task. Therefore, data publishers should annotate their datasets with informative metadata when they add them to a dataset repository. Tags are a particular form of metadata that the data publisher uses to describe their view of how the dataset should be categorized. An interesting problem is how to automate the process of recommending tags to data publishers when they add new data to a dataset repository. In this paper, we develop an approach for automatic tag recommendation for dataset repositories. We investigate how to exploit the features of the dataset and the tagging history in the repository to build an effective tag recommendation model. We further demonstrate the integration of the model in the The Humanitarian Data Exchange, a real-world dataset repository in the social and humanitarian domains. Keywords Dataset Repository, Dataset Tagging, Keyword Search, Tag Recommendation 1. Introduction Nowadays, many dataset repositories and data portals are created by different organizations to facilitate sharing and distribution of datasets. Online platforms like CKAN,1 Quandl Kaggle,2 and Microsoft Azure Marketplace3 are examples of dataset repositories that host datasets for data-driven research in a wide range of domains. The data in these repositories is usually tabular (e.g., CSV files), and the goal of the repositories is to enable data scientists to find, access, integrate, and analyze combinations of datasets based on their needs. The first step in this process is to find the datasets relevant to a task, which requires information retrieval. Currently, dataset repositories use search engines that were mainly developed for unstructured textual documents. To improve retrieval quality, dataset repositories typically allow data publishers BIRDS 2021: Bridging the Gap between Information Science, Information Retrieval and Data Science, March 19, 2021, online " gabuoda@hbku.edu.qa (G. Abuoda); hendrix@un.org (C. Hendrix); campo2@un.org (S. Campo) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://ckan.org/ 2 https://www.quandl.com/ 3 https://azuremarketplace.microsoft.com/en-us/marketplace/ 4 to add metadata with their datasets, i.e., structured information about the data [1]. The search engines rely on this metadata in addition to the content of the datasets to guide the users toward relevant datasets. Thus, high-quality metadata plays an important role in enabling users to find datasets relevant to their needs [2, 3]. One type of metadata that data publishers often use to annotate and label their datasets is tags [4]. In particular, publishers can assign freely chosen keywords to datasets with the purpose of referencing these datasets later on with the help of these assigned tags. A dataset publisher can define their tags to describe a dataset as a whole or emphasize a certain topic that is only relevant to the dataset. A fundamental issue that underlies the effectiveness of user-defined tags is the quality and the relevance of these tags [5]. On the one hand, these tags represent a more flexible way of describing content than a fixed taxonomy with a controlled vocabulary, which means that tags should be freely chosen by data publishers. On the other hand, tags should be correctly formed and spelled, relevant to the content and its terms, and not repetitive or ambiguous. To balance these conflicting goals, a tag recommendation method can assist data publishers in the tagging process to improve the quality of the available metadata about their datasets [5]. Good tag recommendation can benefit not only search, but also other services that rely on tags such as content recommendation and categorization. In this paper, we present a tag recommendation model and show how we applied it effec- tively to improve information retrieval for datasets in the Humanitarian Data Exchange (HDX) platform, in service of the data scientists who use this platform. The main idea of our model is to analyze the metadata and tagging history associated with existing datasets to find candidate tags. We propose a way of integrating the model in the dataset upload page, which encour- ages data publishers to attach informative tags to their dataset when they first upload them. Automatic tag recommendation raises user confidence when interacting with the platform: (i) dataset publishers feel more confident that they are not guessing how they can tag; the HDX platform makes them feel supported, and (ii) users who come to the HDX platform looking for information have more confidence because they have a more accurate picture of the datasets. As an example of our model recommendations on the HDX platform, consider the dataset Nigeria: 2018 Education Secondary Data Review (SDR)4 published by the Nigeria Education in Emergencies Working Group. The dataset contains assessment reports for humanitarian missions in the education domain. The dataset is currently tagged with only two tags, “EDUCA- TIONNEEDS” and “ASSESSMENT”. Our model recommends additional tags and ranks them based on similarity to the dataset, with the top three tags being “Nigeria complex emergency”, “education”, and “education cluster”. This gives the dataset publisher meaningful tagging options and higher confidence when tagging their dataset. 2. Related Work In dataset repositories, tags form the source for enriching taxonomies in evolving and dy- namic content published in these repositories [6]. Moreover, many metadata standards were developed to aid researchers in sharing research (data, code, publications) that rely on tagging techniques [7]. Additionally, dataset-centered search engines rely extensively on metadata 4 https://data.humdata.org/dataset/nigeria-2018-education-secondary-data-review-sdr 5 generally and tags specifically for dataset discovery and retrieval. For instance, Google dataset search engine [3] crawls the web for all datasets and collect the associated metadata. These standards and tools are effective only if the metadata and tags are mainly correct and maintained. However, in practice, most datasets have incomplete or non-existent metadata [8]. Therefore, there is a need for work like ours to automate the creation of metadata. Tag recommendation services have a direct benefit to IR services such as search [9] and query expansion [10]. There are many well-studied approaches for tag recommendation, such as content-based methods, collaborative filtering methods, and hybrid methods [5]. Regardless of the type of tag recommendation method, the challenge in tag recommendation is always in finding the appropriate set of tags that better describe a given resource. Text analysis has long been recognized as a useful technique for extracting informative tags for web resources. In this approach, each resource (in our case a dataset) is represented as a document through a vector of all word occurrences weighted by term frequency-inverse document frequency (TF-IDF) or statistical topic modeling techniques. Various tag recommen- dation techniques have been proposed relying on different representations of the resources and computing the similarity between different resources in addition to mining the historical occurrence of tags [11, 12, 13, 14]. Next, we present how we use text analysis techniques to recommend tags in HDX. 3. Tagging on the Humanitarian Data Exchange (HDX) The Humanitarian Data Exchange (HDX)5 is an open platform for sharing data across crises and organizations. The HDX platform is managed by the Centre for Humanitarian Data of the United Nations Office for the Coordination of Humanitarian Affairs (OCHA). The platform hosts more than 17,000 datasets shared by hundreds of organizations covering humanitarian crises around the world. The goal of the HDX platform is to make humanitarian data easy to find and use for analysis. HDX platform has a search-engine interface that allows users to search datasets via keywords or a faceted search on features such as location, organization, licenses, etc. The returned datasets are presented in a structure-aware fashion, exposing attributes of the datasets (number of downloads, tags, dataset owner, format, etc.) and enabling users to explore different quick charts of the datasets or develop their own visualizations. A keyword search relying on user-generated metadata is the most common way to find a specific dataset on the platform. One crucial factor in defining the quality of the search results on the HDX platform is the quality and richness of the metadata, mainly the tags provided by dataset publishers. On the HDX platform, the tag usually refers to a concept (e.g., health, education, camps), a specific crisis (e.g., Syria, Darfur), the type of the crisis (e.g., earthquake, hurricane), and/or the organization that collected the dataset (e.g., UNICEF, Education Above All). At the time of this work, there was no defined list of tags, and data publishers could use free text to tag datasets. The HDX technical team reported that, in a particular sample of 19,171 search queries, only 8,114 resulted in actual downloads of HDX datasets. One possible interpretation of this gap between search requests and dataset downloads is that users may not be satisfied with the search results or could not find the information they expected. Since tags play an important 5 https://data.humdata.org/ 6 Figure 1: Phases of the HDX Tag Recommender role in search quality, we propose a method for improving the tagging process on the HDX platform with the goal of improving search quality and user engagement. 4. Our Proposed Tag Recommender Our model takes as input the set of tagged datasets in the repository, and an input target dataset 𝑑. The model should provide a list of top 𝑘 candidate tags, sorted according to their relevance to dataset 𝑑. In this work, we investigate recommending tags that are relevant to target dataset 𝑑 by utilizing various types of information: (i) previously assigned tags in the repository, (ii) terms extracted from textual features of the datasets in the repository (e.g., title, description, etc.), and (iii) terms extracted from the target dataset. Developing our model on the HDX platform required us to address several challenges. First, the amount of metadata available varies widely between datasets. In some datasets, the metadata can contain around 1,000 different terms, while other datasets can barely reach 40 terms. Thus, our approach needs to enrich the metadata of datasets with a few terms by choosing the important words in the datasets’ content. Second, data publishers use numbers, special characters, and hyperlinks in the description of their dataset. This content affects the ability to match with predefined tags and to define similarity in any approach (e.g., “Syria crisis-2011” is different from “Syria crisis”). Third, data publishers sometimes provide the description of their datasets in PDF files, not free text. In some cases, the attached PDF file reflects the project in which this dataset was collected, not a description of the dataset itself. Fourth, the tags used for HDX may refer to the same concept with different terms (e.g., education vs. learning; sex/age rate vs. demographics; displaced people location vs. displacement and shelter). Moreover, the valid list of tags contains more specific concepts (e.g., education in emergencies, education facilities). Fifth, data publishers use acronyms as tags. They use different acronyms to refer to the same concept (e.g., using both ‘3W’ or ‘3Ws’ to refer to a ‘who-is-doing-what-where’ dataset). Alternatively, they may use acronyms in a way that will change the meaning and make finding a match in the valid tag list even harder (e.g., using ‘pin’ to mean ‘people in need’). Finally, the tags can be variations on the same term (e.g., refugee vs. refugees). 7 To address these challenges, we developed a model that analyzes the metadata of a dataset through different phases using off-the-shelf text processing techniques. The main phases of our recommendation model are depicted in Fig. 1, and are summarized in the rest of this section. Acquiring Metadata Using the HDX API,6 we extract the metadata collection associated with a group of datasets of interest. For example, we may be interested in the education domain and thus focus on datasets annotated with the “education” tag. The metadata elements extracted for each dataset are: the title of the dataset, the tags assigned to the dataset by the data publisher, the organization that provided this dataset, the source of the dataset is different from the organization, the URL which enables us to crawl the content of public datasets to enrich the metadata with information from the dataset header, the countries mentioned in the metadata object, whether the dataset has geodata, and the free-form note describing the dataset. The output of this phase is a record of terms extracted from the HDX metadata for each dataset. Preprocessing and Cleaning This phase includes tokenization, stemming (e.g., “refugees” and “refugee” become the same token), and removing numbers/special characters/links/stop- words/non-English terms. Candidate Tag Extraction The set of terms extracted from the metadata of the dataset col- lection is our vocabulary. We extract candidate tags from this vocabulary that could be an individual term or a pair of co-occurring terms. An important step in our work was to evalu- ate different methods for defining candidate tags (results in the next section). We evaluated: (1) scoring each vocabulary term based on Term Frequency (TF) and using the terms with the top TF scores as candidate tags, (2) combining TF with Inverse Document Frequency and using the top TF-IDF terms, and (3) extracting frequent co-occurring terms from the vocabulary using N-grams to help decide which N-terms can be chunked together to form a single tag. Tag Expansion Our metadata acquisition step extracts the set of previously used tags in the repository. In the tag expansion phase, we enrich these tags by adding related terms. We expand by adding related terms from WordNet7 (e.g., “teaching”, “pedagogy”, and “didactics" are added to the “education” tag). We also consider enriching the tags using similar terms based on the word2vec model [15]. Computing Similarity The model identifies a set of candidate tags from the vocabulary. It also uses a similar process to identify a set of candidate tags for the target dataset 𝑑. We need a similarity measure between the tags in the two sets. We explore different representation techniques such as vector encoding, TF, TF–IDF, and distributed representations (i.e., word2vec). We compute the cosine similarity between the representation of the candidate tags from the vocabulary and the target dataset. Tags Recommendation We now reach the key step in our approach: recommending tags for the target dataset. Our model ranks the candidate tags by their similarity to the target dataset 𝑑 and recommends the top 𝑘 candidate tags. Setting Thresholds We need to compute TF and TF–IDF thresholds to guide the creation of the vocabulary. These thresholds define the cut-off points that determine which terms to eliminate from the vocabulary. In order to determine the thresholds in our model, we test different cut-off 6 https://github.com/OCHA-DAP/hdx-python-api 7 https://wordnet.princeton.edu/ 8 100 tfidf tf tf+ngram 80 tfidf+ngram tf+word2vec tfidf+word2vec tfidf+ngram+word2vec 62.2 60.7 Recall (%) 60 58.0 47.9 40 38.1 34.6 31.4 20 0 Keyword Extraction Methods Figure 2: Recall for Different Keyword Extraction Methods values and observe their effect on vocabulary size. There is typically a cut-off value where going higher leads to a significant reduction in vocabulary size, and this is the cut-off value we use. 5. Experimental Evaluation Datasets and Tag Selection We used a subset of HDX datasets to develop and evaluate our model. There were around 3000 private and public datasets that are annotated with the tag “education”. We sampled 80% of these datasets to build the vocabulary of the model while the remaining 20% were used for evaluating the recommended tags. Evaluation Our model recommends 𝑘 tags for each dataset (we set 𝑘 in the range 3-5). Our evaluation metric is the percentage of these tags that is already used in tagging the dataset. This is a recall metric [16]. The vocabulary consists of around 1800 terms. Term frequency and document frequency vary widely, and thresholds TF=20 and DF=30% worked best. Fig. 2 shows the recall of different methods of building the vocabulary. Using frequency to identify candidate tags achieves around 30% recall. Adding N-grams (N=2) boosts recall by around 20 percentage points. Using word2vec is not effective, even when combined with N-grams. Thus, we recommend using TF-IDF and N-grams, but not the more complex word2vec. 9 6. Conclusion We presented an approach to automatically recommend tags for datasets on the HDX platform. The effectiveness of our model lies in using existing metadata in the dataset repository in addition to the textual features of a dataset to recommend informative tags. Our goal is for better tags to lead to better search results and user engagement on the HDX platform. References [1] S. Khalsa, P. Cotroneo, M. Wu, A survey of current practice of data search services, 2018. [2] A. Chapman, E. Simperl, L. Koesten, G. Konstantinidis, L.-D. Ibáñez, E. Kacprzak, P. Groth, Dataset search: a survey, The VLDB Journal 29 (2020). [3] N. Noy, M. Burgess, D. Brickley, Google dataset search: Building a search engine for datasets in an open web ecosystem, 2019. [4] P. Rafferty, Tagging, KO KNOWLEDGE ORGANIZATION 45 (2018). [5] F. M. Belém, J. M. Almeida, M. A. Gonçalves, A survey on tag recommendation methods, Journal of the Association for Information Science and Technology (2017). [6] F. Nargesian, K. Q. Pu, E. Zhu, B. Ghadiri Bashardoost, R. J. Miller, Organizing data lakes for navigation, in: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020. [7] P. Rocca-Serra, A. Gonzalez-Beltran, L. Ohno-Machado, G. Alter, The data tags suite (dats) model for discovering data access and use requirements, GigaScience (2020). [8] A. F. Tygel, Semantic tags for open data portals: Metadata enhancements for searchable open data, Federal University of Rio de Janeiro (2016). [9] M.-H. Hsu, H.-H. Chen, Efficient and effective prediction of social tags to enhance web search, Journal of the American Society for Information Science and Technology (2011). [10] V. Oliveira, G. Gomes, F. Belém, W. Brandao, J. Almeida, N. Ziviani, M. Gonçalves, Auto- matic query expansion based on tag recommendation, in: Proceedings of the 21st ACM international conference on Information and knowledge management, 2012. [11] B. Hong, Y. Kim, S. H. Lee, An efficient tag recommendation method using topic modeling approaches, in: Proceedings of the International Conference on Research in Adaptive and Convergent Systems, 2017. [12] R. Krestel, P. Fankhauser, W. Nejdl, Latent dirichlet allocation for tag recommendation, in: Proceedings of the third ACM conference on Recommender systems, 2009. [13] W. Huang, S. Kataria, C. Caragea, P. Mitra, C. L. Giles, L. Rokach, Recommending cita- tions: translating papers into references, in: Proceedings of the 21st ACM international conference on Information and knowledge management, 2012. [14] B. Sigurbjörnsson, R. Van Zwol, Flickr tag recommendation based on collective knowledge, in: Proceedings of the 17th international conference on World Wide Web, 2008. [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013. [16] K. M. Ting, Precision and Recall, 2010. 10