<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology-Based Approach to Academic Style Marker Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viacheslav Lanin</string-name>
          <email>vlanin@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia Philipson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Perm, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The article describes the ontology-based approach to systematization and search of academic English style markers. The designed ontology is divided into two levels: the first level provides the information about linguistic terms and the second consists of style markers, which were derived by experts in linguistic. It is suggested to generate lexical-semantic template based on the ontology to identify the list of markers in the text with the help of Domain Specific Language (DSL) technology. Currently, there is JAPE-template (Java Annotation Patterns Engine) of GATE text processing system.</p>
      </abstract>
      <kwd-group>
        <kwd>Style marker</kwd>
        <kwd>Scientific paper</kwd>
        <kwd>Ontology</kwd>
        <kwd>DSL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The contribution of research results in scientific publications is the most significant
performance indicator of scholars and research co-workers. Papers written on English
language notable extend the audience but the scholars, who are not native speakers,
usually face some difficulties connected with strict style and language requirements of
written academic English. There is huge variety of methodological materials on
written academic English as well as specialized educational courses. Literature analysis
has shown that suggested recommendations are not systematized and sometimes even
have obvious internal contradictions. It should be appreciated that many publications
have its own stylistic “publicistic traditions”, which are needed to be taken into
account while preparing materials. At the moment text investigations are undertaken
with the use of computer technology. This enables the processing of huge corpus.
Corpus data gives empiric material which can be the foundation for the creation of
etalon language patterns, the study of language consistency, and the description of
linguistics phenomenon typical of a particular language area, i.e. derivation of style
markers. The statistics, collected from corpus annotating in accordance with the style
markers, gives the information about academic English criteria frequency of
occurrence and their role in style estimation. These will help to define the style quality
level of paper and then form development recommendations. Style markers in this
paper are considered as main features of academic English in its linguistics meaning.</p>
      <p>
        The main purpose of this project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is the extraction of style markers and
interrelations between them, and the designing of the academic English style etalon model.
Investigating of hierarchical relations between style elements are also crucial as it
helps to determine their frequency occurrence in English scientific texts and describe
usage pattern of these elements on the texts pieces of different levels.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Existing Solutions</title>
      <p>One of the actively developing branches of theoretical and applied stylistics is a
complex analysis of English written scientific papers conducted through the large text
corpus of particular science processing and comparative stylistics study carrying out.
The comparative analysis of English academic style text quality of author, for whom
English language is foreign, offers the greatest challenge of corpus linguistics
research and the field of software development for corpus analysis. It is worth to say
that the major of English written speech research is performed by native scholars and
has declarative character or is based on limited data scope. This becomes a problem
because of inability to describe English language of the particular subject area with
certainty, to derive key features and to study usage pattern. The usage of computer
technologies highly simplifies statistical processing of corpus in linguistic research.
System-based quantitative research of written scientific speech with the use of
software makes possible the statistical processing of large scientific corpus of almost
every domain as well as finding of the existent consistency and identification and
systematization of main scientific speech attributes.</p>
      <p>
        At this moment there are a great number of tools for corpus processing. The most
widespread of them are AntConc, WordSmith Tools, Gate Developer, Sketch Engine
and CQPweb. There are specialized solutions for academic papers style analysis, for
example project Fapas (Full Automatic Paper Analysis System) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        It is also possible to find projects connected with the creation of ontologies, which
describe linguistic domain. One of them is GOLD ontology which is General
Ontology for Linguistic Description [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It gives the description of linguistic basis including
most foundational categories and relation between them. The ontology is connected
with SUMO ((Standard Upper Merged Ontology) is based on four main domains:
expressions, grammar, data constructs, and metaconcepts.
      </p>
      <p>The category expressions mean the physically accessible aspects of language. The
base for this aspect was taken from SUMO and to the concept LinguisticExpression
were added new ones like WrittenLinguisticExpressions and
SpokenLinguisticExpressions.</p>
      <p>Grammar category includes the abstract properties and relations of language, the
domain that is of primary interest to linguists. It means that anything expressed by a
grammatical system be represented by the concept GrammaticalCategory.</p>
      <p>Data constructs are constructs that are used by linguists to analyze language data,
such as paradigms, lexicons and feature structures. Metaconcepts are the most basic
concepts of linguistic analysis, including language itself. There are many ways in
which language can be viewed and without a working concept of language, an
ontology cannot be used to describe and compare data from all of the world’s languages.
Language was defined as the set of data associated with a common grammatical
pattern. All in all, the ontology tries to describe all the aspect of the natural language
which can be applied to all languages.</p>
      <p>Another example of ontology is also from linguistic field but it is concentrated on
computational linguistics. Developed ontology is built on the basis of scholarly
knowledge ontology and because of it concepts of ontology is divided into five
hierarchies “whole-part” which are connected to each other with associative relations.
Subject of investigation of computational linguistics are the properties and the
systems of linguistics units, operations, connected with their functioning in the process of
communication, and application processes replied to defined request.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Theoretical approach background</title>
      <p>The theoretical foundation of the system described in this paper consists of a list of
style markers that were selected from reference and study materials, Internet
resources about academic writing as well as scientific papers on this topic. All markers
from this list can be divided into three main groups: lexical markers, grammar
markers, syntactic markers.</p>
      <p>Lexical markers include three types of features:
• specific words and terminology (high frequency of terminology; usage of
abstract semantic verbs, desemantisized verbs, intensifying adverbs; low frequency of
personal pronouns you, he, she, etc.);</p>
      <p>• words corresponding to specific word-formation constructions (nouns with -or
suffix, commonly used in terminology; abstract nouns derived by suffixes -ment,
ness, -tion, etc.);</p>
      <p>• words of specific part of speech (high frequency of nouns, low frequency of
pronouns).</p>
      <p>Two types of features that fall into grammar markers category are:
• wide usage of verbs in Passive Voice;
• presumable prevalence of verbs in Present Tense.</p>
      <p>Syntactic markers can also be classified into two types:
• features described by syntactic structures (simple, complex or compound
sentence structure; prepositive and postpositive attributes by most of the nouns; possible
prevalence of prepositive attributes in technical texts);</p>
      <p>• specific conjunctions, linking expressions, etc. (subordinating and correlative
conjunctions; archaisms thereby, therewith, hereby; prepositional phrases; means of
logical expressions).</p>
      <p>Most of these features can be automatically annotated using lexical-syntactic
patterns, although absolute accuracy cannot be guaranteed, which is why expert control
and means of manual annotation correction is highly desirable for the system
implementation. Flexibility of the system components is also important for development
and further testing and debugging due to specificity of academic style feature tagging
and natural language processing in general.</p>
      <p>Currently our system annotates text based on all of the described style markers
with the exception of terminology and sentence structure. Although some components
are still being tested, recent resulting annotation sets provide enough information to
analyze academic writing and deepen the studies about some of the features.</p>
      <p>For the present style markers are represented as desperate data set. There emerged
a necessity of markers systematization besides the method of systematization should
give the opportunity of enlargement and adaptation, as language is dynamic and
always developing system.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Academic style marker ontology</title>
      <p>In the present study, a way of regulation and systematization of disparate data set
called ontology is going to be described. The ontology reveals the dependences
between entities in the form of style markers, and if there are any interconnections, they
are indicated. Thus, a huge variety of different style markers turn into a controlled
system which then can be used as a part of larger project focused on improving the
quality of text annotating.</p>
      <p>The ontology is based on the main definitions or basic aspects of academic English
which were derived by experts. They are Nominalization, Personal Stance, Verb,
Adverb, Attributes, and Cohesiveness. While adding new classes there was achieved
class hierarchy consists of 37 classes and subclasses. The ontology as has been
already said has two levels: the level of linguistic terms, which includes such classes as
PartOfSpeech, PartOfWord, GrammarStructure, Atributes, and the level of style
markers concepts like ComplexConjuctors, PrepositiveAttributes,
DesemanticisedVerbs etc.</p>
      <p>There are different properties for identifying relations between entities. The main
relation is inheritance, which is used for generalization and specification, but also
there are properties like hasSuffix (it is the relation between classes Noun and Suffix),
depend/influence (between verb and nouns/pronouns) etc.</p>
      <p>All in all, developed ontology collects all the derived style markers and reveals
relations between them what makes the process of working with style markers simpler.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Pattern generation on the basis of DSL-technologies</title>
      <p>Ontology is needed not only for style markers systematization but also as the
foundation of lexical-semantic patterns generation. Rule generation architecture is
demonstrated on Fig. 2.</p>
      <p>Protégé ontology editor</p>
      <p>Validator</p>
      <p>Generator</p>
      <p>Generation rules</p>
      <p>Protégé ontology editor is used for ontology describing and its representation in the
OWL format. Validator is the component which is meant for accuracy check of
user’s models. While designing a model, the user can make some mistakes or make
models which are not satisfy the ontology limits constraints. Generator is the
component responsible for code generation on target language. Generator is used for
transformation of user’s models into textual representation on the description language of
lexical-semantic patterns as well as file generation into the formats of the computer
linguistic systems for example JAPE. To extend the interoperability ability the system
gives users the opportunity of determining the transformation rules by themselves. It
is crucial on this level of metamodel to make text pattern for every language elements
in accordance to which code generation would be implemented. Text pattern includes
the statistic part which is not depend on certain model and the dynamic part, which
makes possible the reference to attributes values of different DSL-constructions..
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Current version of designed ontology consists of 37 concepts and 8 types of relations.
The standard tools and software applications are used while designing the ontology
which simplifies the process of development and decision maintenance process. The
described approached has an expanding property i.e. in order to add new marker the
user need to add its description and the identification rule will be generated
automatically. Moreover, the use of this linguistics level, which is described in ontology,
makes possible the description of related domains.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>The article was prepared within the framework of the Academic Fund Program
at the National Research University Higher School of Economics (HSE) in 2017(grant
№ 17-05-0020) and by the Russian Academic Excellence Project "5-100".</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Borovikova</surname>
            <given-names>O.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zagorul'ko Yu</surname>
          </string-name>
          .A.,
          <string-name>
            <surname>Zagorul'ko G</surname>
          </string-name>
          .B.,
          <string-name>
            <surname>Kononenko</surname>
            <given-names>I.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sokolova</surname>
            <given-names>E.G.</given-names>
          </string-name>
          <article-title>Razrabotka portala znanij po komp'yuternoj lingvistike // Trudy 11 nacionalnoj konferencii po iskusstvennomu intellektu s mezhdunarodnym uchastiem KII-</article-title>
          <year>2008</year>
          . - M.: LENAND,
          <year>2008</year>
          . -V.3. -p.
          <fpage>380</fpage>
          -
          <lpage>388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Strinyuk</surname>
            <given-names>S. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shuchalova</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanin</surname>
            <given-names>V</given-names>
          </string-name>
          .
          <article-title>Academic Papers Evaluation Software</article-title>
          ,
          <source>in: Application of Information and Communication Technologies (AICT)</source>
          ,
          <year>2015</year>
          9th International Conference on,
          <fpage>14</fpage>
          -
          <lpage>16</lpage>
          Oct.
          <year>2015</year>
          .
          <article-title>Rostov-on-</article-title>
          <string-name>
            <surname>Don</surname>
          </string-name>
          : IEEE,
          <year>2015</year>
          . p.
          <fpage>506</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Scholz</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conrad</surname>
            <given-names>S</given-names>
          </string-name>
          .
          <source>Style Analysis of Academic Writing // Natural Language Processing and Information Systems: 16th International Conference on Applications of Natural</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cunningham</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            <given-names>K.</given-names>
          </string-name>
          , et al.
          <article-title>Developing Language Processing Components with GATE Version 7</article-title>
          . The University of Sheffield.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Farrar</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langendoen</surname>
            <given-names>D.</given-names>
          </string-name>
          <article-title>A linguistic ontology for the Semantic Web</article-title>
          .
          <year>2003</year>
          . GLOT International.
          <volume>7</volume>
          (
          <issue>3</issue>
          ), p.
          <fpage>97</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>