<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sentiment Extraction from Financial Public Disclosure Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ali Caner Turkmen</string-name>
          <email>caner.turkmen@boun.edu.tr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bogazici University Department of Computer Engineering Bebek</institution>
          ,
          <addr-line>Istanbul, Turkey 34342</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We address the problem of extracting sentiment in nancial public disclosure documents, and explore their e ects on daily price movements. We take a collection of public disclosure forms submitted by four companies in the Turkish stock market. Using simple classi cation algorithms, we point to a signi cant correlation between the content of disclosure texts and the next day's price direction. We discuss the relationship between learned term weights and sentiment by comparing to a translation of a well-known nancial sentiment lexicon.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Using sentiment in nancial news to guide investment decisions is a recent eld
of interest. E cient processing of the newswire, nancial commentary, social
media and regulatory disclosure documents have been explored with success for
forecasting price over the short and long terms.</p>
      <p>
        The seminal works of Tetlock [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] draw the rst links between the
psychosocial aspects of language in nancial news and market outcomes. However, shortly
thereafter, it was noted that implied sentiment of words in nancial texts can
di er signi cantly from those in generic corpora. Loughran and McDonald [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
introduced the rst nancial sentiment lexicon learned via statistical methods,
one which was veri ed recently [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and which this paper reuses.
      </p>
      <p>However, sentiment lexica for nancial texts are generally available in
English. A method for building lexica for newly encountered corpora, contexts and
languages while taking nancial implications into account, is yet to be described.
This is the main question addressed in this paper, where we attempt to trace
the relationship between terms used in mandatory public announcements by
publicly traded companies in Turkey and the market outcomes of the next day.
With the end goal of building a nancial sentiment vocabulary and outlining a
methodology for doing so, we take a step towards processing nancial news to
guide accurate investment decisions.</p>
      <p>Note that throughout this paper, we use the term \sentiment" liberally. That
is, we do not necessarily point to the psychosocial aspect of terms as usually done
in natural language processing, but instead focus on the statistical relationships
between documents and market outcomes.</p>
      <p>In this light, we rst investigate if the presence of public disclosure lings
has a statistically signi cant relationship with returns. We then try to associate
content with meaningful nancial signals using several common machine
learning methods. Finally, we investigate the interpretability of learned vocabularies,
and compare with a nancial sentiment lexicon for English. In doing so, we
explore several methods for learning both interpretable and statistically signi cant
sentiment vocabularies, in the context of a developing stock exchange.</p>
      <p>In the next section, we introduce the data set. In Section 3, we present the
methodology and results, before concluding in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data Set</title>
      <p>In this work, we aim to recover meaningful nancial indicators from public
disclosure lings by companies traded in the Turkish stock market. Public disclosure
forms are mandatory announcements made through a central Public
Disclosure Platform by companies traded in Istanbul's stock exchange, Borsa Istanbul
(BIST). Among others, companies are required to report changes in ownership
structure, appointment of management, disclosure of nancial reports and
comments on public news and rumors. In this regard, these documents are akin to
Securities Exchange Commission (SEC) lings in the US.</p>
      <p>We focus exclusively on \Special Announcements" made by companies,
leaving out lings such as nancial reports circulated periodically. We gather
announcement texts of four randomly selected companies in BIST, among
constituents of the XU030 index for highest market capitalization stocks. We
exclude banks since their announcements are mainly related to their market making
activities. The selected stocks' ticker codes, names and industries are given in
Table 1.</p>
      <p>We take lings made during the period between November 2012 and
February 2015, totaling to 551 trading days. For each company, we merge public
announcements on a given day into a single document. We then label each of the
documents based on the price increase/decrease of the next trading day after
the announcements, with 1 for an increase, and 0 otherwise. This is due to the
observation that most public announcements are led near the closing of trading
hours, and would most likely impact the next day's outcome.</p>
      <p>
        We exclude a widely used set of function words in the Turkish language [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
and work with a count-based term-document matrix built with a bag-of-words
representation. E ectively, we formulate the question as a binary document
classi cation problem often encountered in natural language processing.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>Sentiment Extraction
Before moving on to the classi cation problem, it is an interesting exercise to
investigate if merely the appearance of a ling and the next day's price correlate.
We take the distribution of log returns for the entire period, as well as those of
days after lings. For each stock, we provide histograms and p-values for a
twosample t-test in Figure 1. We observe that the appearance of an announcement
is not consistently followed by higher or lower returns for any of the stocks in
question.</p>
      <p>
        For each stock, we utilize three common \shallow" machine learning
algorithms to solve the classi cation problem, making use of scikit-learn [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Finally,
we combine all documents and labels in the data to investigate the existence of a
common lexicon that is indicative of price movements, independent of individual
stocks.
      </p>
      <p>For brevity, the details of the models used will not be discussed in detail.
However, we provide some details of implementation below:
{ For Logistic Regression, we use L2 regularization, and set the penalty
coe cient to 1. That is, the optimization objective is left as a simple sum of
cross-entropy loss and the 2-norm of the parameter vector.</p>
      <p>EREGL KCHOL PETKM THYAO</p>
      <p>
        ALL
Buy and Hold
Buy on News
Logistic Regression
Multinomial NB
SVM
Logistic Regression
Multinomial NB
SVM
{ We use the Multinomial Naive Bayes model with Laplace smoothing, see
[7, p. 82].
{ The Support Vector Machine (SVM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is used with a linear kernel.
      </p>
      <p>We operate on limited data, so we take several precautions to prevent
overtting. First, we do not perform hyperparameter optimization and leave
hyperparameters at their most commonly used values as given above. We refrain from
using deeper models and work only with simpler ones that are easy to interpret.
Interpretability is also a key requirement in easily extracting a lexicon. Finally,
we perform 10-fold cross-validation for each experimental setting.</p>
      <p>We report results in terms of classi cation accuracy and precision [7, p. 182].
We choose these two metrics due to their unique interpretation in the context
of stock market prediction. Assume the learned models were used to build an
\expert system", or \strategy" that is triggered by the content of disclosure
forms based on the model at hand. Disregarding trading costs, precision would
correspond to the percentage of pro table buys in that speci c stock. Accuracy,
on the other hand, can be interpreted as the percentage of pro table positions
if both sides of the trade (long and short) were allowed. We compare our results
to two base cases. First, we report the precision of a strategy that would buy
every day, i.e. buy and hold. We also report the precision of a trading algorithm
that would buy every day after a news item was announced.</p>
      <p>We present our results in Table 2. Our models outperform the base case for
individual stocks. With all news items combined, we nd that support vector
machines are able to yield improved accuracy, although not to a signi cant
degree. We can then reasonably hypothesize that the most discriminative terms
are powerful in the context of their individual companies or industries, but that
a generic lexicon cannot be recovered using simple models.
3.2</p>
      <p>
        Interpreting Model Parameters
In this section, we explore the relationship between term weightings learned by
one of our models and the sentiment associated to the term in nancial contexts.
For this purpose, we rst translate the negative terms lexicon given by Loughran
&amp; McDonald [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] into Turkish, which we make available online1.
      </p>
      <p>On the combined data set, we t a Multinomial Naive Bayes model and
estimate the log odds of a term appearing in a document followed by a decline
in price. Having estimated p^(wijc), where wi denotes a single word and c the
next day's outcome, we calculate
l(wi) =
p^(wijc = 0)
p^(wijc = 1)</p>
      <p>
        We then match the lexicon of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to the terms used in disclosure texts. We
nd that for 57% of the 293 matched terms between the two vocabularies, the
log-odds measure is greater than 0, i.e. it implies a decline in price. Although the
majority of terms agree on their psychosocial aspects and market implications,
this is only a weak correlation.
      </p>
      <p>One possible explanation is that, almost surely, some of the semantics were
lost during translation leading to added lexical ambiguity. One may also argue
that the market may be \selling the fact", in that the expectation of negative
news may have been priced in prior to the announcement. Combined with our
previous argument that a statistically signi cant signal can be isolated in
nancial news, this leads to the conclusion that apparent negative meanings of terms
do not necessarily lead to negative outcomes and that there may be other terms
that are \bearish", but do not \sound" negative.</p>
      <p>
        Upon inspecting vocabularies of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and those extracted by the simple rule
above, we can observe some of the disagreement is indeed due to lexical
ambiguity. However, we observe some terms appear to have a negative bearing, but
a strong positive correlation to price. The inverse also exists, where the term is
neutral despite having a strong negative implication. We give examples in Table
3. Note the appearance of words like \vote", or \retired".
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this work, we provide early evidence of the relationship between mandatory
public disclosure documents and daily market outcomes in the Turkish stock
market. We show that with \shallow" machine learning models, and within the
context of a few randomly selected stocks, one can isolate a signal for the next
day's trade direction. Under more detailed analysis of the lexicon learned by
these models, we nd that only a fraction of negative sounding nancial terms
are in fact followed by declines in price.</p>
      <p>There are several next steps to follow this work. The rst is to advance
the unigram representation of this work to a more relevant language model,
especially seeing that many nancial terms in English translate to noun phrases
in Turkish and vice versa. Second, we will expand the data and implement models
1 github.com/canerturkmen/tr nneg lexicon
2 as in legal text
fault annulment payment
penalty dissident retired
crisis diminish article2
stagnate fraudulent vote
dangerous inquiry temporary
capable of representing highly nonlinear relationships, in order to capture themes
and higher level representations more predictive of market outcomes. Finally, we
will focus on extracting information from such models in order to build a full
nancial sentiment lexicon in Turkish, and propose a methodology for doing so
independently of language. Such an exercise will entail generalizing these models
over a much wider set of stocks and news sources.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>I thank Taylan Cemgil for his invaluable guidance, and the Central Securities
Depository of Turkey (MKK) for making the data available.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Tetlock</surname>
            ,
            <given-names>P.C.</given-names>
          </string-name>
          :
          <article-title>Giving content to investor sentiment: The role of media in the stock market</article-title>
          .
          <source>The Journal of Finance</source>
          <volume>62</volume>
          (
          <year>2007</year>
          )
          <volume>1139</volume>
          {
          <fpage>1168</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tetlock</surname>
            ,
            <given-names>P.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saar-Tsechansky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macskassy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>More than words: Quantifying language to measure rms' fundamentals</article-title>
          .
          <source>The Journal of Finance</source>
          <volume>63</volume>
          (
          <year>2008</year>
          )
          <volume>1437</volume>
          {
          <fpage>1467</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Loughran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>When is a liability not a liability? Textual analysis, dictionaries</article-title>
          , and 10-
          <string-name>
            <surname>Ks</surname>
          </string-name>
          .
          <source>The Journal of Finance</source>
          <volume>66</volume>
          (
          <year>2011</year>
          )
          <volume>35</volume>
          {
          <fpage>65</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Heston</surname>
            ,
            <given-names>S.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>N.R.</given-names>
          </string-name>
          :
          <article-title>News versus Sentiment: Comparing Textual Processing Approaches for Predicting Stock Returns</article-title>
          .
          <string-name>
            <surname>Robert H. Smith School Research Paper</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Amasyali</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davletov</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torayew</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciftci</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Text2ar : Automatic feature extraction software for Turkish texts</article-title>
          .
          <source>In: Signal Processing and Communications Applications Conference (SIU)</source>
          ,
          <year>2010</year>
          IEEE 18th,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2010</year>
          )
          <volume>629</volume>
          {
          <fpage>632</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Machine Learning A Probabilistic Perspective</article-title>
          . The MIT Press (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine Learning</source>
          <volume>20</volume>
          (
          <year>1995</year>
          )
          <volume>273</volume>
          {
          <fpage>297</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>