[CL-Aff Shared Task] Squared English Word: A Method of Generating Glyph to Use Super Characters for Sentiment Analysis Baohua Sun1 , Lin Yang1 , Catherine Chi1 , Wenhan Zhang1 , and Michael Lin1 Gyrfalcon Technology Inc., 1900McCarthy Blvd Suite 208, Milpitas, CA, 95035, US baohua.sun@gyrfalcontech.com Abstract. The Super Characters method addresses sentiment analysis problems by first converting the input text into images and then apply- ing 2D-CNN models to classify the sentiment. It achieves state of the art performance on many benchmark datasets. However, it is not as straight- forward to apply in Latin languages as in Asian languages. Because the 2D-CNN model is designed to recognize two-dimensional images, it is better if the inputs are in the form of glyphs. In this paper, we propose SEW (Squared English Word) method generating a squared glyph for each English word by drawing Super Characters images of each English word at the alphabet level, combining the squared glyph together into a whole Super Characters image at the sentence level, and then applying the CNN model to classify the sentiment within the sentence. We applied the SEW method to Wikipedia dataset and obtained a 2.1% accuracy gain compared to the original Super Characters method. In this CL-Aff shared task on the HappyDB dataset, we applied Super Characters with SEW method and obtained 86.9% accuracy for agency classification and 85.8% for social accuracy classification on the validation set based on 80%:20% random split on the given labeled dataset. Keywords: Super Characters · Squared English Word · Text Classifi- cation. 1 Introduction The need to classify sentiment arises in many different problems in customer related marketing fields. Super Characters [8] is a two-step method for senti- ment analysis. It first converts text into images; then feeds the images into CNN models to classify the sentiment. Sentiment classification performance on large text contents from customer online comments shows that the Super Character method is superior to other existing methods, including fastText[7], EmbedNet, OnehotNet, and linear models[10]. For example, on the JD binary dataset[10], which collects Chinese shopping reviews evenly split to positive and negative; the Super Characters method obtained an accuracy of 92.20% while the best exist- ing method obtained one of 91.28%. On another dataset of Rakuten binary[10], 2 B. Sun et al. (a) Raw Super Char- (b) Raw Super Char- acters method for En- acters method for glish sentence on al- English sentence on phabet level. alphabet level with changing line to avoid breaking words. (c) Squared English (d) Squared English Word (SEW) method Word method with at- with 6x6 words per tention on the first image. four words. (e) Squared English (f) Squared English Word method using Word method using both happy moment happy moment text text and profiles infor- and attended profile, mation, age 36, coun- age 36, country India try India (IND), mar- (IND), married (m), ried (m), male (m). male (m). Fig. 1. Demonstrations of raw Super Characters method and Squared English Word method. We use the same example input to illustrate our idea. The raw text input sentence is: “Last month my son got his first trophy in the tennis match and i was very happy and he was very excited to see me his trophy and i took him out for dinner and spend the evening happily with him.” Title Suppressed Due to Excessive Length 3 which collects Japanese shopping reviews evenly split into positive and nega- tive, Super Characters obtained a 94.85% accuracy, compared to the 94.55% accuracy of the best existing method. Yet on another dataset of 11st Binary[10], which collects Korean shopping reviews evenly split into identifying positive or negative sentiments, the Super Characters method achieved 87.6% compared to best existing method with 86.89% accuracy. The Super Characters method also shows that the pretrained models on a larger dataset help improve accuracy by finetuning the CNN model on a smaller dataset. Compared with from-scratch trained Super Characters model, the finetuned one improves the accuracy from 95.7% to 97.8% on the well-known Chinese dataset of Fudan Corpus. However, there are a few challenges of using the Super Characters method for Latin language inputs. First, the Super Characters method can be directly applied for Asian languages with glyph characters, such as Chinese, Japanese, and Korean, but not so in such a straightforward fashion to Latin languages such as English. This is because the CNN model connected to the super characters images are designed to recognize two-dimensional images better in the form of a glyph in a square form. Languages like Chinese build their language system upon logograms, which are symbols or characters that serve to represent a phrase or word. If we directly apply Super Characters method to represent sentences in the English language, the Super Characters image is shown as in Figure 1a. Or, as shown in Figure 1b if we try to avoid breaking the words and changing lines because a word is divided between two lines, it will become harder for the CNN model to recognize. In addition, Attention models have succeeded in various fields [2, 9]. How to employ the idea of attention is another challenge for Super Characters, because the second step in Super Characters method is to feed the image of text to a 2-D CNN model for classifying sentiments, but it is difficult to add attention architecture to the CNN model in which Super Characters images are parallel processed. This paper borrows several ideas from both Super Characters and attention. For the first challenge, we convert each English word to a glyph, such that each word only occupies the pixels within a designated squared area. The resulting algorithm is named Squared English Word (SEW) as shown in Figure 1c. For the second challenge, we add the attention scheme in the first step of Super Characters method, i.e. during the process of Super Characters image generation. In the original Super Characters method, all the text drawn on the image are given the same size, or given the same degree of attention when it is fed into the CNN model connected to it. We add the attention scheme by allocating larger spaces for important words, e.g. those in beginning of each sentence. SEW with attention as shown in Figure 1d. We will describe the details on how to generate these images in the next section. The CL-AFF Shared Task[6] is part of the Affective Content Analysis work- shop at AAAI 2019. It builds upon the HappyDB dataset[1], which contains 10,560 samples of happy moments. Each sample is a text sentence describing the happy moments in English. And each sample has two sets of binary classification 4 B. Sun et al. labels, Agency?(Yes|No) and Social?(Yes|No). In this paper, we will apply SEW and SEW with attention on this data set to classify the input texts. 2 Squared English Word method The original Super Characters method works well if the character in that lan- guage is a glyph, and Asian characters in Chinese, Japanese, and Korean are written in a square form. In this work, we extend the original idea of Super Characters [8] by preprocessing each English word into a squared glyph, just like Asian characters. To avoid information loss, the preprocessing should be a one-to-one mapping, i.e. each original English word can be recovered from the converted squared glyph. For text classification task, we propose the SEW method for English sentence input as described in Algorithm 1. Input: text input: a string of English words Output: Sentiment Classification Result Initialization: start a blank image and set the font to draw Super Characters with, set a cut-length of the words, set counter=0, set current word=the first word in the text input, set the current word location for the current word which is a square area, and get the current word area as the area of pixels for current word location; while not at end of the input text and counter