<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Authorship Attribution. Computers and
the Humanities</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automated IQ Estimation from Writing Samples</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Austin Hendrix</string-name>
          <email>austin.hendrix@louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Yampolskiy</string-name>
          <email>roman.yampolskiy@louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Franking, Holly. (1988). Stylometry: A statistical method for determining authorship, textual integrity, and chronology. University of Kansas</institution>
          ,
          <addr-line>ProQuest Dissertations Publishing</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Louisville Louisville</institution>
          ,
          <addr-line>KY 40208</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>28</volume>
      <issue>2</issue>
      <fpage>3</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>The primary focus of this research is to introduce a method of measuring an individual's IQ by analyzing the vocabulary in said individual's writing. In this paper, we show that the ratio of SAT words in a dataset of writing samples is roughly normally distributed, though with an obvious left skew. We go on to show a method that can be used to calculate an individual's IQ with this ratio and provide samples with measured accuracy. The conclusion suggests ways to increase accuracy in order to further develop the research along with applications of doing so.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Stylometry is the statistical analysis of differences in
literature between authors (Franking, 1988). As early as
1880, the study of stylometry has been used as a method of
authorship identification on disputed texts. With the
development of computers and automation techniques,
sylometric analysis has become easier. An early example of
software defined stylometry was used to identify the author
of the disputed papers amongst the “Federalist Papers”
(
        <xref ref-type="bibr" rid="ref9">Tweedie, Singh, Holmes; 1996</xref>
        ). This work demonstrated
that stylometric analysis using automation is, at least in this
application, able to draw similar conclusions about the
authorship of these papers as previous work on the subject.
In recent years, stylometry has taken on a broad range of
applications. More specifically, stylometry has been used in
the identification of chat bots (
        <xref ref-type="bibr" rid="ref1">Ali, Hindi, Yampolskiy;
2011</xref>
        ). Further research was done to show that when a chat
bot changes behavior over time, the stylometry approach
becomes more difficult (
        <xref ref-type="bibr" rid="ref2">Ali, Schaeffer, Yampolskiy; 2012</xref>
        ).
In addition, it has been demonstrated that stylometric author
identification processes can be used on a single author that
is capable of writing in multiple languages.
        <xref ref-type="bibr" rid="ref3">(Ali,
Yampolskiy, 2014)</xref>
        . This is significant in that it
demonstrates certain writing trends are independent of the
authors’ language and are therefore likely stronger
candidates for comparing authors that write in different
languages.
      </p>
      <p>
        As there is no true scientific measurement that is
currently used to quantify someone’s intelligence, many
different measurements have been used. Intelligence tests
have often been a common way to determine an individual’s
intelligence relative to others. There have been many
negative and controversial opinions on these tests, yet
experts still agree on their overall usefulness (Snyderman,
Rothman; 1989). Further studies have shown that a standard
intelligence test provides the best single, reliable predicator
of academic aptitude
        <xref ref-type="bibr" rid="ref4">(Bullerdieck, 1985)</xref>
        . One popular
example of standard intelligence tests measures an
individual’s Intelligence Quotient (IQ). The assumption
behind this system of measurement is that if a large sample
of IQs are mapped together, the distribution will be normal.
It has been shown that there are issues with the structure and
quality of the standard IQ test (Lawler, 1977). Still, the IQ
test can be a useful way for individuals to compare
intelligence. For this paper, we will act under the
assumption that an individual’s IQ score relates directly to
their true intelligence level.
      </p>
      <p>This preliminary research project is focused on
exploring whether an individual’s IQ can be determined by
using software defined stylometry. The novelty of this
process is that it is not centered around author identification.
Instead, stylometry will be used to determine the relative
writing quality of a known author. The process will involve
analyzing an attribute of a known author’s writing to
determine said author’s IQ. There are multiple attributes of
writing that are potential candidates for this application. For
the beginning of this research, we will focus on the
individual in question’s vocabulary. Other research has been
done to discuss other attributes with possible merit. These
attributes include, but are not limited to, word-length,
syllables, sentence-length, and distribution of parts of
speech (Holmes, 1994).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Collegiate Word Ratio</title>
      <p>To determine an individual’s IQ based on their vocabulary,
a quantitative way to measure the quality of their vocabulary
is necessary. For the purposes of this project, we will define
a “Collegiate Word” as a word the SAT considers a part of
strong vocabulary usage.1 The College Word Ratio (CWR),
which we will refer to through this paper, is therefore
defined as:</p>
      <p>Collegiate Word Ratio = Collegiate Word Count
/ Total Word Count
The CWR of each sample will be measured by software and
then compared to the rest of the samples to determine its
relative quality by use of a distribution. A pseudo-code for
calculating the CWR of a sample is shown in Figure 1.
for SampleWord in Sample:
for CollegiateWord in CollegiateWordList:
if SampleWord == CollegiateWord:</p>
      <p>CollegiateWordCount++
CollegiateWordRatio = (CollegiateWordCount /
SampleWordCount)
Now that we have a clearly defined a process for calculating
the CWR of a sample, we need to execute this software on a
large dataset. An ideal dataset would consist of writing
samples by many randomly selected individuals. Along with
this, each writing sample would represent each individual’s
average writing ability. As such a dataset was not available
to the authors of this project, another source had to be
found.</p>
      <p>The Common Crawl is a corpus containing raw
web page data, extracted metadata and text extractions.2 The
text extractions from this corpus contain the raw text taken
directly from websites. We are acting under the assumptions
that the text extractions are all written by humans and likely
contain that individual’s average writing. To help increase
the accuracy of results under this assumption, only samples
with more than 100 words were used. After collecting a
large number of samples from the Common Crawl corpus,
each sample’s CWR was stored and mapped onto a
distribution (Figure 2). The distribution is fairly normal,
though there is a slight skew to the left. This implies that on
a large number of samples, the distribution of CWR is fairly
normal and resembles the distribution of IQs.
0
0.02
0.04
0.06
0.14
0.16
0.18
0.2
0.08 0.1 0.12</p>
      <p>Collegiate Word Ratio
1 The full list of words used for this project can be found at
www.freevocabulary.com.</p>
      <p>2 https://aws.amazon.com/public-datasets/common-crawl/
0.02
0.04
0.06
0.08
0.1</p>
    </sec>
    <sec id="sec-3">
      <title>Sample</title>
      <p>CWR
0.1250
0.1238
0.1618
0.1321
3</p>
    </sec>
    <sec id="sec-4">
      <title>Determining IQ from CWR</title>
      <p>We have shown the distribution of CWR is fairly normal,
and now we will demonstrate the process of using CWR to
calculate an individual’s IQ. A graph showing these two
distributions overlaid is located below (Figure 3).</p>
      <p>The IQ curve shown is the ideal expected IQ
distribution. It is perfectly normal with a mean of 100. The
CWR distribution, though skewed slightly left, is mapped
very closely to the IQ distribution for the second and third
positive standard deviation from the mean. For the purposes
of this analysis, we will assume that this indicates the CWR
in this area will map onto its corresponding IQ. This will
result in a certain amount of error when calculating IQ from
CWR. Nevertheless, the distributions are close enough that
the process should give a good estimation of an individual’s
IQ.</p>
      <p>To begin the process of transferring between the
two curves, we need to know the standard deviation and</p>
    </sec>
    <sec id="sec-5">
      <title>Sample World</title>
    </sec>
    <sec id="sec-6">
      <title>Length</title>
      <p>752
412
136
3279</p>
    </sec>
    <sec id="sec-7">
      <title>Sample Collegiate Word Count</title>
      <p>94
51
22
433
mean of both distributions. For the IQ curve, these are fixed
values. The mean IQ value of all individuals is said to be
100 and the standard deviation of all IQ values is said to be
15. For our data set, the mean CWR is 0.074759005 and the
standard deviation is 0.031552108.</p>
      <p>Using these values and an induvial data point’s
CWR, a corresponding IQ score can be calculated.
Performing this calculation involves finding the z-score of
the data point. This is done by the following:</p>
      <p>Z-Score = (CWR Data Point – CWR Mean) /</p>
      <sec id="sec-7-1">
        <title>CWR Standard Deviation</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Expected IQ</title>
      <p>153
130
141
129</p>
    </sec>
    <sec id="sec-9">
      <title>Measured IQ</title>
      <p>123.88
123.31
141.36
127.24
%</p>
    </sec>
    <sec id="sec-10">
      <title>Error</title>
      <p>19.03
5.15
0.26
1.36
This z-score represents the number of standard deviations,
positive or negative, that the data point is away from the
mean. Since we know the standard deviation and mean of all
IQ scores, the corresponding IQ can be calculated as
follows:
Corresponding IQ = (Z-Score * IQ Standard Deviation)
+ IQ Mean
4</p>
      <sec id="sec-10-1">
        <title>Testing IQ Estimation Software</title>
        <p>Now that a sample of writing can be used to determine the
IQ of an individual from their CWR, we must determine if
the IQ is accurate. The process of doing this is
straightforward, though difficult to accomplish. For it to be
reliably said that CWR can be used to calculate an
individual’s IQ, we must find multiple individuals with a
known IQ and access to writing that is their own. The
pseudo-code for the software used to map the CWR of
samples on to a corresponding IQ is shown in Figure 4.</p>
        <p>Using social media contacts, we located several
individuals willing to give their IQ and a sample of their
Sample_Z_Score = (CWR_Sample – CWR_Mean) /
CWR_Standard_Deviation
Sample_IQ = (Sample_Z_Score *
IQ_Standard_Deviation) + IQ_Mean
writing for the purposes of testing our software. It should
be noted that there is no external verification that these
individuals gave an accurate IQ, but these samples are a
good starting point for testing the reliability of this software.
The data collected from these samples is shown in Table 1.
Regardless of the large error in the first sample, the
accuracy of the rest of the samples provide support for this
approach for calculating IQ.
5</p>
      </sec>
      <sec id="sec-10-2">
        <title>Conclusions and Future Work</title>
        <p>Though our first sample produced a result with a moderate
error, there still seems to be merit to looking further into this
methodology. It should be noted that the samples used were
approximately 2 standard deviations above the mean.
Further sampling should include data on both ends of the
curve. There may not ultimately be a cause effect
relationship between intelligence level and vocabulary
usage, but this research does indicate the two are correlated.
The normality of the distribution of CWR may be
significant in other applications, and should be noted
regardless of the final merits of this approach to calculating
intelligence.</p>
        <p>This research paper is intended to be purely
preliminary and simply introduce the concept and one
possible implementation of using an individual’s vocabulary
to determine IQ levels. To further develop this research, the
authors suggest a larger dataset be used to create a more
accurate distribution. In addition, a more reliable dataset is
necessary to test the accuracy of these methods. For the
strongest possible results, self-reported IQ scores should not
be used. Ideally, the next stage in research will include an
IQ test along with a specific writing prompt on which to run
our software. Lastly, there is likely merit in exploring the
analysis of the other attributes of writing that are mentioned
at the introduction to this piece. It is possible that one or all
of these attributes may provide a better avenue for
calculating an individual’s intelligence level.</p>
        <p>
          The ability to analyze the intelligence of
individuals is a very useful tool. It has been shown in
previous research that numerous factors influence whether
an intellectually gifted child will ultimately lead a
successful life (
          <xref ref-type="bibr" rid="ref8">Tomlinson-Keasey, Little; 1990</xref>
          ). Earlier
identification of these children, through application of this
research, has the potential to allow these children to be
guided down a positive path that will lead to their personal
success. In addition, this research could play a role in
evaluating the abilities of persons currently prominent in the
political and scientific realms. Nevertheless, further research
must be done in this area of study before anything truly
conclusive can be said.
6
Lawler,
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ali</surname>
            , Nawaf; Hindi, Musa; Yampolskiy,
            <given-names>Roman.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Evaluation of authorship attribution software on a Chat bot corpus</article-title>
          .
          <source>Information, Communication and Automation Technologies (ICAT)</source>
          ,
          <source>2011 XXIII International Symposium on, IEEE.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ali</surname>
            , Nawaf; Schaeffer, Derek; Yampolskiy,
            <given-names>Roman.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Linguistic Profiling and Behavioral Drift in Chat Bots</article-title>
          . MAICS,
          <fpage>27</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Ali</surname>
            , Nawaf; Yampolskiy,
            <given-names>Roman.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>BLN-Gram-TFITF as a Language Independent Feature for Authorship Identification</article-title>
          and
          <string-name>
            <given-names>Paragraph</given-names>
            <surname>Similarity</surname>
          </string-name>
          .
          <source>9th Cyber and Information Science research Conference</source>
          , Oak Ridge, Tennessee.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bullerdieck</surname>
            ,
            <given-names>K. Kelly McK.</given-names>
          </string-name>
          (
          <year>1985</year>
          ).
          <article-title>Considerations in Defining the Gifted</article-title>
          . http://journals.sagepub.com/doi/abs/10.1177/10762 1758500800607 James. (
          <year>1977</year>
          ).
          <source>IQ: Biological Fact or Methodological Construct? Science &amp; Society</source>
          , vol.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          41, no.
          <issue>2</issue>
          , pp.
          <fpage>208</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>www.jstor.org/stable/40402014.</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Snyderman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rothman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>1987</year>
          ).
          <article-title>Survey of expert opinion on intelligence and aptitude testing</article-title>
          .
          <source>American Psychologist</source>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <fpage>137</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Tomlinson-Keasey</surname>
            , Carol; Little,
            <given-names>Todd D.</given-names>
          </string-name>
          (
          <year>1990</year>
          ).
          <article-title>Predicting educational attainment, occupational achievement, intellectual skill, and personal adjustment among gifted men and women</article-title>
          .
          <source>Journal of Educational Psychology</source>
          , vol.
          <volume>82</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>442</fpage>
          -
          <lpage>455</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Tweedie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>Neural Network Applications in Stylometry: The "Federalist Papers" Computers and the Humanities</article-title>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . Retrieved from http://www.jstor.org/stable/30204514
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>