Interactive Multimedia Visualization for Exploring and
 Fixing a Multi-Dimensional Metadata Base of Popular
                       Musics

                         Maroua Tikat, Marco Winckler, and Michel Buffa

          University Côte d’Azur, SPARKS/wimmics team, Inria, CNRS, I3S, France
       [maroua.tikat, winckler, michel.buffa]@univ-cotedazur.fr


         Abstract. In this position paper we discuss the use of information visualization
         techniques as a mean to find and characterize inconsistencies in music datasets.
         This idea is supported by empirical findings in a previous work dedicated to the
         visualization of a large dataset of music metadata called WASABI. During the
         development process of a visualization technique for exploring the multitude of
         multimedia attributes in the WASABI dataset (which includes lyrics, chords, au-
         dio, graphics describing sound analysis, etc.), we found visual patterns suggesting
         data inconsistencies (ex. ambiguities, inaccuracies, missing data, conflicts, etc.),
         which might have occurred during the integration from diverse sources. Tradition-
         ally, information visualization techniques are used to understand the data corpus
         and identify causal relationships, trends, patterns of data concentrations. Nev-
         ertheless, our findings suggest that information visualization techniques can be
         used to inspect data quality and highlight the parts of the datasets that need to
         be corrected/improved. Furthermore, we suggest that information visualization
         could be used as an entry point for repairing the dataset. More specifically, our
         aim is to use information visualization techniques to: communicate data quality
         issues to users, compare the outcomes of methods (such as crowdsourcing, matrix
         vectorization, graph reasoning, among other) used to fix the dataset, and observe
         the evolution of problem solving during the maintenance of the dataset.

         Keywords: music dataset, multimedia data, visualization techniques, data qual-
         ity, multivariate data


1    Introduction
This position paper presents our findings and ideas raised whilst creating a knowledge
graph for music datasets. Working on a knowledge graph makes it possible to inter-
link resources with rich semantic relationships, which helps increase knowledge. If we
take for example a music dataset, it would offer a large scope of metadata describing
music, including multimedia dimensions that count textual (e.g. title, lyrics), graphical
(e.g. curves), and audio (e.g., the sound produced), as well as metadata qualifying the
works (e.g., date and time of recording, authors, performers...) and uses (e.g. song cov-
ers, classification for a specific use,...). All these resources are given URIs that can be
referenced to documents describing them and can be shared across datasets. In order to


Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International
(CC BY 4.0).
explore music datasets, we have proposed in a previous work [6] the use of informa-
tion visualization techniques. Nonetheless, during the visualization process we found
some patterns that suggest inconsistencies in the dataset. As a matter of fact, the related
work section shows that all the music datasets containing a large volume of data have
been built from several data sources, either from public sources on the web, or from au-
dio analysis or lyrics analysis, which will inevitably generate data incoherence. These
observations led us to consider a new perspective for using information visualization
techniques as a tool for assessing data quality, comparing the results of fixing methods
and communicating data inconsistencies to the user. In the rest of this position paper
we discuss the underlying background for characterizing the problem, especially on the
field of music datasets, and we envisage a method aimed at helping to fix data quality
problems.


2      Background and related work

2.1      Music datasets

Musical contents can be described in many multimedia dimensions (ex. lyrics, chords,
sounds, metadata...). Nonetheless, most of music datasets are specialized on a few types
of attributes or on a specific musical genre. In order to have a more complete description
of songs, some new datasets have been created by aggregating multiple data sources:

    – The WASABI dataset [6]: available at https://github.com/micbuffa/WasabiDataset,
      it contains over 2 MM commercial songs issued from multiple sources. It features
      a rich set of cultural metadata on songs, albums and artists, and also contains meta-
      data extracted from NLP analysis of lyrics and MIR (Music Information Retrieval)
      analysis of song audio contents. An Ontology describing the WASABI dataset is
      also publicly available 1 .
    – The Million Song Dataset [5]: this one includes over 1 MM contemporary popular
      songs, with metadata extracted mainly from audio content analysis and lyric-related
      metadata in the form of ”bag of words”.
    – MusicBrainz: a free online music database, available under an open licence, which
      collects music metadata (ex. artits, albums, labels, etc) and makes it available to the
      public. There is no information about lyrics but a companion web site (i.e. Acous-
      ticBrainz) offers MIR audio analysis of a subset of the songs (crowd sourced).
    – MusicWeb [2] and its latest version MusicLynx [3]: allow the exploration of a graph
      of artist similarities, built by linking several free public data sources.


2.2      Visualization techniques used in the musical field

Music datasets are often huge, containing hundreds of thousands, or millions of songs.
In this context, information visualization techniques might be a suitable alternative for
exploring music contents, identify patterns, trends, and correlations. According to a
recent survey done on visualizations for music related data [8], musical collections
 1
     https://github.com/micbuffa/WasabiDataset/blob/master/ontology/wsb-1.0.ttl
                 Fig. 1: Visualization of artists by their MusicBrainz id


(albums, playlists, music archives) are mostly visualized through the means of map vi-
sualizations, on the other hand musical works are represented by different and special
techniques such as glyph, some of which cannot be grouped explicitly. Musicians are
viewed most often through graphs and timelines, while 3D rendering is used for instru-
ments. Charts can be used for all data types listed. Sunbursts, Node-Link Trees, Bubble
Charts and Treemaps are used to explore music rankings using interactive visualization
[9]. They all allow to represent music data content (artist’s name, tracks’ position,...)
and music genre. As far as audio analysis is concerned, several approaches have been
considered to determine the structure of a song [14] [11] and classify them (intro,
theme, verse, chorus, solo and outro) based on different musical dimensions such as
melody, harmony, rhythm, and timbre. We cite the following visualization techniques:
recurrence plots, arc diagrams, chroma features and rythmograms.


2.3   Types of data quality defects found using visualizations


Whilst inconsistencies are not frequent in the WASABI dataset, we were able to spot
some of them during the exploration process using a visualization tool called MGEx-
plorer. Figure 1 illustrates an example of duplicated entry for artists found during the
exploration process. Hereafter we present a short list of the types of data quality defects
we have found. We classify the problems in two categories, as follows:
    1) Intrinsic defects encompass cases of duplicated data, missing data, disambigua-
tion of artists with similar names, wrong formats for dates, conflicts between values of
the same property collected from different data sources, broken links, etc.
    2) Scenario problems related are specific to the application domain, such as: i) the
same song with more than one producer or performer, according to the source; ii) a
big difference between the number of songs produced in one single year, compared to
others years (is it normal - related to some real life events? Is it an error during data
collection?); iii ) songs classified in multiples genres, which might be the expression
of errors in the classification of divergence of opinions among how songs could be
classified.
2.4    Methods to assess data quality, to complete and to fix the dataset
Some of the methods we found in the literature allows both the detection and the cor-
rection of dataset inconsistencies, as follows:

    – Sourcing knowledge [1]: a solution that takes advantage of the development of the
      web and online communities to help fill blanks in a dataset. It can also be used to
      detect and fix outliers in a dataset.
    – Support Vector Machines [13] and matrix factorization: methods that have been
      used to predict entities’ types.
    – association rule mining [7] and neural network : can be used to predict relations
      based on chains of other relations,
    – Retrospective evaluation: an approach that allows people to assess a value by saying
      whether it is true or false.
    – Reasoning: inference techniques are highly used in the semantic web community
      to check for errors in knowledge graphs, by adding rules and restrictions on the
      ontology, provided that the latter is rich.
    – Interquartile range and kernel density estimation: these methods are used to correct
      numerical outliers in a dataset.
    – DeFacto [10]: one of the few methods to correct a dataset, it assigns confidence
      scores to statements, based on their occurences in different web pages.
    – Fact validation through consensus measurement [12]: an automatic approach using
      knowledge graph interlinks to detect erroneous numerical values. It exploits the
      links between identical resources and apply different matching functions between
      the properties of individual sources. Facts in a knowledge graph are assumed to be
      wrong if several other sources have a consensus for a contradictory fact.
    – Graph embedding TransE [4]: a method that allows link prediction on large databases
      by interpreting relationships as translations operating on the low-dimensional em-
      beddings of the entities.

Whilst several methods for detecting problems with data quality exist, our research
questions are focused on how to communicate these problems to the users. For example,
it is a duplicated record and not a homonyms artists. In the example given by Figure 1,
the disambiguation can be made by allowing users to explore other attributes associated
to the artists.


3     Our approach
As we shall see, detecting and solving data quality problems might require a decision-
making processing involving users. For example, different algorithms might provide di-
vergent analysis for music genre and the correct classification will depend on the users’
needs. For that, we propose to put users in the loop for detecting data quality problems
and for deciding how the problems must be solved. Moreover, rather than propose com-
pletion/repairing methods (e.g., link prediction, duplication detection, missing URL,
etc.), our approach allows users to apply existing methods and visualize the outcomes
in terms of repairs proposed; ultimately users should be able to check if the corrections
                                 Fig. 2: Methodology


fulfill their needs (or not). Figure 2 presents a view at glance of our approach for de-
tecting and solving quality problems whilst exploring music dataset using visualization
techniques. Our approach encompasses the following premises:

 – Allow continuous integration of data from several sources, including results of au-
   dio and song lyrics analysis; In addition to standard music data, we will be adding
   emotion analysis, structure detection (verse, chorus, etc.), topic analysis, etc.;
 – Use information visualization techniques for multivariate data (such as MGEx-
   plorer and Parallel Coordinates) to represent all sort of data;
 – Identify the problems with data quality during the exploration process using vi-
   sualization techniques; Figure 1 shows an example of a problem detected in the
   WASABI dataset using MGExplorer, which is a duplicate artist.
 – Solutions for detecting problems can be: manual, automatic or both;
 – Visualization techniques can accommodate the results of data correction adding an
   explanation to the context of use;
 – Compare the results of methods for detecting problems, allowing the users to decide
   about the outcomes of the corrections proposed. Validation of data correction/com-
   pletion can be done quantitatively, by running against some measures (i.e number
   of missing properties, number of artists that require disambiguation) or qualita-
   tively (i.e present a before/after visualization, and do user testing with real users
   to evaluate the refined data compared to what we had previously and evaluate the
   tools themselves).
4    Discussion and future work
This paper presented a preliminary work towards the visualization of music datasets.
It includes a discussion about problems and potential solutions for integration methods
for fixing data quality problems along the visualization process. As we have discussed,
these data quality problems arose by continuous data integration (ex. new songs, new
analysis of old songs, conflicts detected over time...). So that, we claim that data quality
of music datasets is part of a continuous curation process rather than a definite state of
the dataset. For that, we propose a methodology aiming at visualizing music datasets,
highlighting problems found during the exploration process, as well as allowing to com-
pare solutions to fix those problems. The next steps should be to deal with integration
of corrective methods for fixing problems and support the versioning of data (a main
requirement for comparison of the methods’ outcomes). For this purpose, MGExplorer
will be used as a proof of concept to illustrate the feasibility of the methodology pro-
posed. We are currently working on running prototype that could be used to test our
hypothesis with end users. Finally, despite the fact that our work is focused on the mu-
sic dataset WABASI, we suggest that the solution might be suitable to other types of
datasets to be explored in the future.

References
 1. M. Acosta and al. Crowdsourcing linked data quality assessment. In International semantic
    web conference, pages 260–276, 2013.
 2. A. Allik, M. Mora-Mcginity, G. Fazekas, and M. Sandler. Musicweb: an open linked seman-
    tic platform for music metadata. In Proc. 15th International Semantic Web Conf., 2016.
 3. A. Allik, F. Thalmann, and M. Sandler. Musiclynx: Exploring music through artist similarity
    graphs. In Companion Proceedings of the The Web Conference, 2018.
 4. N. U. ”Antoine Bordes and al”. Translating embeddings for modeling multi-relational data.
    pages 2787–2795, 2013.
 5. T. ”Bertin-Mahieux and al”. The million song dataset. pages 591–596, 2011.
 6. M. BUFFA and al. The wasabi dataset: cultural, lyrics and audio analysis metadata about 2
    million popular commercially released songs. 2021.
 7. U. G. ”Jochen Hipp and G. Nakhaeizadeh”. Algorithms for association rule mining – a
    general survey and comparison. In ACM SIGKDD Explorations Newsletter, 2000.
 8. R. KHULUSI, J. KUSNICK, C. MEINECKE, and al. A survey on visualizations for musical
    data. In Computer Graphics Forum, pages 82–110, 2020.
 9. C. F. Leandro Guedes. Exploring music rankings with interactive visualization leandro. In
    Proceedings of the Symposium on Applied Computing, pages 214–219, 2017.
10. J. Lehmann and al. Defacto – deep fact validation. In International semantic web conference.
    Springer, Berlin, Heidelberg, pages 312–327, 2012.
11. J. Paulus, M. Müller, and A. Klapuri. Audio-based music structure analysis. In Proc. of the
    11th International Society for Music Information Retrieval Conf., pages 625–636, 2010.
12. E. M. Shuangyan Liu, Mathieu d’Aquin. Towards linked data fact validation through mea-
    suring consensus. In CEUR Workshop Proceedings, 2015.
13. J. Sleeman and T. Finin. Type prediction for efficient coreference resolution in heterogeneous
    semantic graphs. In IEEE 7th International Conference on Semantic Computing, pages 78–
    85, 2013.
14. H. H. Wu and J. P. Bello. Audio-based music visualization for music structure analysis. In
    Proceedings of the 7th Sound and Music Computing Conference, 2010.