Exploiting ontologies for deep learning: a case for sentiment mining ?

Introduction and approach

Deep learning continues to achieve state of the art performance in a variety of domains, such as image analysis and text mining. Despite this success, deep learning models remain elusive, and it is quite hard to understand what knowledge is represented in them, and how they generate decisions (see [3] for discussion). The field of explainable AI is increasingly gaining traction. Promising results have been reported with attention-based models [4] and latent-space analysis [7]. The link between ontologies and deep learning is actively being expored. For instance, [6] addresses the extraction of OWL information with deep learning from raw text and [2] applies deep learning to ontology extraction. Our approach attempts to leverage the semantic information in ontologies for explaining deep text mining, using neural attention and word embeddings. Ontologies usually contain structured, encyclopedic knowledge, arranged in a semantic, conceptual structure. One such ontology is the recently proposed sentiment ontology On-toSenticNet [1], an extension of the SenticNet ontology. SenticNet (Figure 1(a)) links entities via an intermediate concept level (consisting of semantic categories and relations) to an a↵ective level describing sentiment-based associations, like sadness or joy. OntoSenticNet uses SenticNet to derive a↵ective associations for words and phrases. It is automatically compiled from a↵ective analyses performed with WordNet-A↵ect, Open Mind Common Sense and GECKA. Figure 1(b) lists the OntoSenticNet entry for "wrong food". The primitiveURI nodes contain the a↵ective labels associated with the multi-word expression "wrong food". The semantics nodes express associations with other NamedIndividuals (expressions), based on corpus-based evidence such as collocations, and the static knowledge contained in SenticNet. We embed the ontology information into the sentiment analysis process directly, combining it with non-ontological information such as textual features. Taking advantage of the attention a neural network pays to the extra ontology-based information will allow us to decompose its decisions semantically. We start (Figure 1(c)) with generating vector representations of our input data, using 100-dimensional GloVe vectors [5], which were derived on the basis of 6 billion words coming from a 2014 English fragment of Wikipedia. Every document is represented as the sum of the GloVe vectors of its constituting words, normalized for the length of the document. Subsequently, we chunk up every document in bigrams, and perform a beam search over the semantically labeled bigrams in the OntoSenticNet ontology. As semantic labels for bigrams, we use the primitiveURI labels, and every combination in OntoSenticNet generates a unique label. In order to cater for bigrams without overt a↵ective labels, we randomly took 5,000 bigrams from a BBC news corpus 1 , and labeled these bigrams as 'bbc'. This approach yields, for every dataset we use, a unique set of semantic labels. Restricting our use of OntoSenticNet to bigrams allows us to look for contextual matches rather than for word-based matches, without run-ning into sparsity: OntoSenticNet contains 22,935 bigram expressions, and only 3,104 expressions longer than 2 words. The majority of OntoSenticNet entries consists of unigrams (26,912 entries). The beam search operation attempts to retrieve, for any combination (100 total) of the 10 most similar words per word in the bigram, an existing bigram from OntoSenticNet. As an example, 'bad dinner' is not in OntoSenticNet, but one of its GloVe expansions ('wrong food') is. Once such a hit is found, the beam search stops for the given input bigram, the semantic labels are picked up from OntoSenticNet, and search proceeds with the next bigram in the document. The relation between an OntoSenticNet bigram and its labels is stored as an entry in a dictionary. The attested semantic label for every bigram in a document is counted, and for every document, a count vector (with as its length the total number of labels attested in the corpus) is generated and stored. After processing a labeled text corpus in this manner, every document in the corpus becomes represented by two vectors: a GloVe-based vector, and a count vector describing the counts for the semantic labels that apply to the bigrams in the document. Subsequently, we train a neural network (Figure1(d)) on these joint representations of labeled documents. The network has two branches, each equipped with a separate input layer. First, a branch processes the ontology label vectors, and computes attention scores (probabilities) for the various labels in the vectors. These attention scores indicate the importance ('attention') the network pays to the ontology labels. They are merged with the GloVe vectors by concatenation, and this derived representation is used by a second branch to learn the labeling of documents with sentiment labels. The attention probabilities are optimized during this process in an end-to-end fashion (they are part of the overall weight optimization problem the network is solving). Once learning is complete, for every test case, the attention scores as computed by the trained network for the test document are extracted from the network, and an image is generated that displays the scores. We applied our system to a variety of sentiment labeling datasets: a set of UCI datasets 2comprising Yelp, Amazon product and IMDB movie reviews. In addition, we trained and tested on a subjectivity dataset 3.

Results

Some illustrative results are listed in Figure 2. For the complex emotion expressed in the sentence The only thing I did like was the prime rib and the dessert section, the OntoSenticNet labels anger, sadness, disgust, surprise score relatively high. Sentences We'd definitely go back here again and Will go back next trip out both score high for the joint label joy#surprise. The negative sentiment of ...least think to refill my water before I struggle to wave you over for 10 minutes has significant underpinning with disgust and anger labels. The attention probabilities extracted from our classifier may thus serve to decompose monadic sentiment labels into much more rich and varied descriptions, enhancing the explainability of monadic sentiment labeling. The explanatory advantages of our system will be assessed in future work by submitting the generated analyses to human evaluators in a task-based evaluation setting, and by displaying the underlying words and phrases used by the model for sentiment decomposition. Our code will be shared at https://github.com/stephanraaijmakers/deeptext.