<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A ective Content Classi cation using Convolutional Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fraunhofer FKIE</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fraunhoferstrasse</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wachtberg</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany daniel.claeser@fkie.fraunhofer.de</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We present a three-layer convolutional neural network for the classi cation of two binary target variables 'Social' and 'Agency' in the HappyDB corpus exploiting lexical density of a closed domain and a high degree of regularity in linguistic patterns. Incorporating demographic information is demonstrated to improve classi cation accuracy. Custom embeddings learned from additional unlabeled data perform competitive to established pre-trained models based on much more comprehensive general training corpora. The top-performing model achieves accuracies of 0.90 for the 'Social' and 0.875 for the 'Agency' variable.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional Neural Networks Unsupervised Learning GloVe FastText</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The CL-A Shared Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], held as a part of the A ective Content Analysis
workshop at AAAI 2019, invited participants to analyze and classify the contents
of HappyDB [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a corpus of 100,000 'Happy Moments'. Subtask 1 consisted of
classifying contents with respect to two binary variables, 'Agency' and 'Social',
with 'Agency' indicating whether the author of a happy moment was in control
of events and 'Social' indicating whether additional people were explicitly or
implicitly involved. In addition, an open-ended second subtask invited participants
to share insights from the corpus with respect to 'ingredients of happiness'.
      </p>
      <p>To the best of the author's knowledge, no similar shared task or challenge
has previously been proposed, and while there has been extensive research on
sentiment and a ect analysis, the task at hand is very speci c and its scope
is limited to pre-classi ed data describing 'happy moments'. The task at hand
could therefore not be approached with established techniques for sentiment
or polarity analysis. It was rather considered a classi cation task aiming for
the detection of semantic ('Social' variable) and syntactic ('Agency' variable)
patterns, with both implicit and explicit concepts present in the data.</p>
      <p>
        In recent years, embedding-based deep learning techniques have gained
momentum superseding conventional machine learning techniques in a broad range
of linguistic tasks, currently constituting the absolute majority of publications
at the four major venues of computational linguistics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The use of neural
networks employing the technique of vector embeddings seemed a natural choice
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>given the need to extend the language model to abstract concepts beyond the
lexical surface structure.</p>
      <p>
        A comprehensive description of the dataset provided along with informative
basic statistics can be found in the original HappyDB paper ([
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). The following
section describes some additional insights into the data structure that proved to
be relevant for classi cation approach and performance.
2.1
      </p>
      <p>Analyzed subsets
It was quickly noted that 95.2% of the provided happy moments were tagged
as submitted from just two countries, United States (8378 or 79.3%) and India
(1674 or 15.9%), while the remainder of the corpus of just 508 happy moments
was distributed among 69 other countries. In light of this uneven distribution
and the resulting challenges for claiming statistically signi cant insights on this
data, only the subsets from the aforementioned two countries were considered
for further evaluation and additional classi cation experiments.
2.2</p>
      <p>Duplicates
While the authors of HappyDB took basic cleaning and quality assurance
measures with respect to misspellings and removal of non-informative entries, the
corpus contains a considerable proportion of duplicates.</p>
      <p>While the corpus contains 1,674 entries with the country tag 'IND', a manual
inspection of those moments revealed the presence of a high number of
duplicates. After removing exact literal duplicates, the subset was 391 entries lighter,
leaving 1,283 entries. Removing punctuation to further catch small variations
in otherwise identical utterances, like the college example in table 1, left 1,246
unique entries, reducing the number of available examples for training and
evaluation of the classi er by more than 25%.</p>
      <p>Occurrences Duplicate
126 i went to temple
100 i went to shopping
15 i went to college.
15 i went to college
13 the day with my wife
12 my boy friend love feeling
10 when i am getting ready to [...]*</p>
      <p>The seven most common duplicate entries alone make up 391 (23.3%) of all
moments with country tag 'IND'. Note that the entry "i went to college" occurs
with and without full stop 15 times each. Additionally, the majority of these
duplicates were submitted along with contradicting demographic information.
While sentences like "i went to college" might indeed have been submitted by
multiple participants, more distinct duplicates like the irregular pattern second
to the bottom or complex utterances like the example at the bottom (shortened
from originally "when i am getting ready to go to my o ce my parents send o
with cute smile and say have a nice day and take care") were almost certainly
submitted multiple times by the same worker. Even the cleaned-up subset still
contains several very similar complex utterances. Undeniably, the presence of
such a high proportion of duplicates in one category has a considerable distorting
e ect on training and evaluation of a classi er.</p>
      <p>The situation was far less critical for the 'USA' subset of the corpus with
208 duplicates amounting to less than 2.5% of entries in the corpus. The overall
duplicate ratio over the entire corpus was 6.2%</p>
      <p>Only the cleaned up versions of the 'USA' and 'IND' subsets were considered
for further analysis and training the classi ers.
2.3</p>
      <p>Lexical, syntactic and idiomatic properties
The material provided by participants from the US and India di ered from
each other in several linguistic dimensions. The exact linguistic background of
individual authors remained unclear as both countries are polyglot, however,
it seems reasonable to assume the majority of US participants to be native
speakers of English or highly uent in the language. The vast majority of authors
submitting from India is in contrast assumed to use English as a second language,
with a more diverse linguistic background than US participants. Assuming a
descriptive rather than prescriptive point of view, it is not of particular interest
whether particular patterns in the Indian subset might be considered correct
or appropriate by native and pro cient speakers of English as long as they are
distinct and reproducible enough for a classi er to learn. The intuition that
patterns in this subset might be distinct enough for the classi er to bene t from
learning them separately was proven correct experimentally.</p>
      <p>American and Indian submissions di ered considerably with respect to
syntactic patterns to start with: While statements from US authors contained 13.52
tokens on average per sentence with a standard deviation of 6.78, Indian
statements contained 12.71 tokens on average with a considerably higher standard
deviation of 10.59 caused e.g. by a larger proportion of particularly long
statements. While the authors were originally instructed to state complete sentences,
the level of compliance varied between the two groups, with e.g. US authors
starting 8.4% of sentences by a gerund form compared to 5.7% of Indian
authors. Tables 2 and 3 show the most common trigrams starting sentences from
the two di erent groups, demonstrating US authors use a considerably higher
share of idiomatic expressions such as "i got to" and framing expressions such
as "an event that [made me happy]" and "i was happy", marked in bold. The
Indian statements might in that light tentatively be characterized as being more
straightforward. Additional di erences involve Indians using simple and
progressive present substituting simple past more often than US authors and a higher
rate of omission of particles such as prepositions. Indian statements were
lexically more dense with a types to tokens ratio of 9.67 compared to 8.00 in US
statements.</p>
      <p>Occurrence Trigram Relative Cumulated
216 i was happy 2.65 2.65
206 i went to 2.52 5.17
188 i was able 2.30 7.47
178 i got to 2.18 9.65
177 i got a 2.17 11.82
144 i had a 1.76 13.59
92 i bought a 1.13 14.71
71 i received a 0.87 15.58
71 i found out 0.87 16.45
67 an event that 0.82 17.28
56 i made a 0.69 17.96
51 i watched a 0.62 18.59
43 i went on 0.53 19.11
43 i found a 0.53 19.64
42 i ate a 0.51 20.15
40 i went out 0.49 20.64
34 it made me 0.42 21.06
33 my wife and 0.40 21.47
30 my husband and 0.37 21.83
30 i took my 0.37 22.20
Participants in the crowdfunding process creating the HappyDB corpus were
explicitly asked to state moments that made them happy in single full sentences.
While not all participants submitted strictly complied to those instructions, the
overwhelming majority of statements are in the form of full declarative sentences.
Syntax in the corpus can thus be regarded xed and discarded as distinct piece
of information in the classi cation process.
Occurrence Trigram Relative Cumulated
53 i went to 4.31 4.31
20 i got a 1.63 5.93
20 i bought a 1.63 7.56
15 i went for 1.22 8.78
14 my happiest moment 1.14 9.92
10 yesterday i went 0.81 10.73
10 i met my 0.81 11.54
9 me and my 0.73 12.28
9 i was very 0.73 13.01
9 in the past 0.73 13.74
8 when i am 0.65 14.39
8 last month i 0.65 15.04
8 i got my 0.65 15.69
7 my best friend 0.57 16.26
7 i purchased a 0.57 16.83
7 i had a 0.57 17.40
6 we bought a 0.49 17.89
6 the day i 0.49 18.37
6 bought a new 0.49 18.86
5 we went to 0.41 19.27</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments and results</title>
      <sec id="sec-3-1">
        <title>Basic considerations and setup</title>
        <p>Given the almost uniform syntactic structure of the corpus with respect to
declarative sentences, a convolutional neural network was determined to be an
appropriate architecture rather than a time-step based approach: Considering syntax
more or less xed relieves the classi er of the e ort to interpret the complete
input as sequences and allows to focus on detecting the presence or absence of
features relating to agency or social participation in the utterance. Two binary
classi ers were trained to address each variable separately. A large search space
of con gurations was explored, yielding the following con guration with the best
performance in terms of accuracy: Two convolutional layers with 128 lters each
with a step size of 5 and a dense layer with 128 units. Applying dropout of 10 and
20 % yielded slight but statistically insigni cant improvements. Batch sizes were
iterated in steps of 8, 16, 32, 64 and multiples of 64 up to 1024, with medium
batch-sizes of around 384 performing best in the vast majority of con gurations.</p>
        <p>Table 4 shows overall results in the best-performing con gurations with the
architecture described above.</p>
        <p>As higher dimensional embeddings consistently outperformed low-dimensional
models, only the 300 dimensional models were considered for further
experiments.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Pre-trained and customized embeddings</title>
        <p>
          Three major groups of pre-trained embeddings were used for the initialization
layer of the neural network: FastText by Facebook AI [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], GloVe by Stanford
University [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and custom FastText embeddings trained on the joint set of labeled
and unlabeled HappyDB data provided by the task's authors.
        </p>
        <p>To assess the degree to which the supplied labeled and unlabeled HappyDB
data were able to re ect syntactic and semantic relations of the domain in
comparison to broader knowledge of prede ned embeddings as distributed by the
authors of FastText and GloVe, FastText embeddings of di erent
dimensionality and with both available approaches, CBOW and SkipGram, were trained and
evaluated as displayed in Table 4.
3.3</p>
        <p>Constructing two binary classi ers
Based on aforementioned considerations, one binary classi er was constructed
for each dependent variable, 'Agency' and 'Social', each with the target values
'yes' or 'no' as labeled in the training data.</p>
        <p>Training classi ers on four classes
Table 5 shows the uneven distribution of the two variables and their co-occurrences
in the corpus, illuminating some basic connections in agreement with the
psychological ndings quoted by the authors of HappyDB: A majority of 73.8% of
happy moments involves active participation or control by the author. Within
these moments, an absolute majority of 54.4% involves no other people than
the acting authors themselves. In turn, within the 26.2% of moments with no
active participation of the author, the probability is 74.9% that other people
are involved, re ecting the intuition that in most instances, something, or
somebody, needs to cause the happiness after all. This connection raised interest in
the performance of a classi er considering each combination of the two variables
a distinct class, thus forming four classes "Agency no, social no", "Agency yes,
social no", "Agency no, social yes" and "Agency yes, social yes". While there
is apparently a strong conditional probability of "Social: yes" given "Agency:
no", the signi cantly lowered number of samples was expected to cause a drop
in performance, especially with only 693 samples for the "Agency no, social no"
class, an assumption that was con rmed by the experimental results as displayed
below.</p>
        <p>Social no Social yes Sum
Agency no 693 2071 2764
Agency yes 4242 3554 7796</p>
        <p>Sum 4935 5625 10560</p>
        <p>The results of the three top-performing high-dimensional con gurations
instantly a rmed those expectations and ceased interest in further experiments:
Combining the two variables into four categories decreased the performance even
when evaluating only for one variable per category well below the achievable
results in the binary setting.
3.5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Training separate classi ers by countries</title>
        <p>The presence of aforementioned distinct syntactic and lexical characteristics in
the two largest groups by country inspired the question whether classi cation
performance would bene t from training separate classi ers for each group. Since
only the USA and India subsets contained more than 1000 samples, the
exploration was limited to those subsets. Three separate classi ers were trained, one
for 'IND' and 'USA' each with 1246 samples (which equals the number of
available samples for 'IND' to receive a balanced setting) each and one with the 1246
split between the two countries proportionally in alignment to the original full
training corpus.</p>
        <p>Embedding Acc USA Acc IND Acc Mixed
GloVe840 0.849 0.857 0.844
GloVe6b 0.849 0.854 0.850
FastText Crawl 0.852 0.859 0.856
FastText Crawl Subword 0.852 0.863 0.843
FastText Wiki-News 0.857 0.857 0.845
FastText Wiki-News Subword 0.842 0.845 0.829
FastText Wikipedia 0.850 0.848 0.855
FastText, HappyDB, CBOW 0.856 0.859 0.854</p>
        <p>FastText, HappyDB, Skip 0.860 0.857 0.842</p>
        <p>The results show a modest but statistically signi cant (con dence level 0.95)
improvement for both language groups with the moments submitted under
country code IND bene tting considerably stronger. We suggest this might be an
e ect of more compact syntax patterns (see above).</p>
        <p>Embedding USA IND Mixed
GloVe840 0.895 0.866 0.838
GloVe6b 0.880 0.859 0.821
FastText Crawl 0.894 0.863 0.830
FastText Crawl Subword 0.867 0.855 0.820
FastText Wiki-News 0.896 0.859 0.824
FastText Wiki-News Subword 0.843 0.847 0.823
FastText Wikipedia 0.880 0.851 0.823
FastText HappyDB, CBOW 0.890 0.868 0.832</p>
        <p>FastText HappyDB, Skip 0.887 0.864 0.829</p>
        <p>The picture is even clearer for the Social variable. For both variables, the
separated classi ers achieve better performance than their combined average.
However, the degree of convergence for this phenomenon towards larger training
sets has not been investigated.
3.6</p>
        <p>Classi cation by concepts
The authors of HappyDB report on successful e orts to categorize the corpus
by a set of crowd-sourced category labels. Additionally, they identi ed a set
of concepts or topics of happy moments in a seemingly rather intuitive and
subjective way. To apply a limited test of replicability to this set of topics, a
classi er with the aforementioned architecture was trained on a subset of the
corpus consisting of happy moments labeled with exactly one concept, limited
to concepts with more than 1000 labeled examples, which were careeer (1280),
entertainment (1135), family (1259) and food (1007).</p>
        <p>Embedding Dimensions Accuracy
GloVe6b 300 0.908
GloVe840 300 0.913
FastText, Wiki-News 300 0.912
FastText, Wiki-News Subword 300 0.884
FastText, Crawl 300 0.916
FastText, Crawl Subword 300 0.899
FastText, Wikipedia 300 0.908
FastText, HappyDB, Skip 300 0.915
FastText, HappyDB, CBOW 300 0.892
GloVe, Twitter 200 0.910
GloVe6B 200 0.901
FastText, HappyDB, Skip 200 0.914
FastText, HappyDB, CBOW 200 0.893
FastText, HappyDB, Skip 100 0.911
FastText, HappyDB, CBOW 100 0.889</p>
        <p>GloVe, Twitter 100 0.901
We introduced a rather simplistic architecture to classify the HappyDB contents
with respect to the two binary variables 'Agency' and Social. HappyDB prove to
be a high-quality linguistic resource with a high degree of replicability in terms of
machine learning and classi cation as proven by experimental results for both the
target variables de ned by the Shared Task and the ability to reproduce the
concepts introduced by the HappyDB authors. We observe that while embeddings
trained only on HappyDB without any external world knowledge supplied
cannot statistically signi cantly outperform established general purpose embeddings
such as FastText and Glove trained on Wikipedia and crawled web content, they
appear to be almost competitive utilizing a database of not even 20,000 types as
opposed to up to 2 million types in the pre-trained embeddings. We observe no
particular social media bene t for embeddings in accordance to the assumption
that most statements were given in a rather formal register as intended by the
corpus' authors. Classi cation appears to bene t from taking linguistic
backgrounds of di erent groups of authors into account, and we recommend cleaning
the corpus from remaining duplicates to avoid distortions.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>I would like to express my gratitude to Fahrettin Gokgoz and Albert Pritzkau
of Fraunhofer FKIE and Maria Jabari of University of Bonn for their expertise
and insights supporting the system design and dataset analysis.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Jaidka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mumick</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chhaya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The CL-A Happiness Shared Task: Results and Key Insights</article-title>
          .
          <source>In: Proceedings of the 2nd Workshop on A ective Content Analysis @ AAAI (A Con2019)</source>
          .
          <source>Hawaii</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evensen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golshan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopatenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments</article-title>
          .
          <source>In: Proceedings of LREC 2018</source>
          .
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          (pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of tricks for e cient text classi cation</article-title>
          .
          <source>In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hazarika</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
          </string-name>
          , E.:
          <article-title>Recent trends in deep learning based natural language processing</article-title>
          .
          <source>ieee Computational intelligenCe magazine</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ),
          <fpage>55</fpage>
          -
          <lpage>75</lpage>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>