<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building the knowledge base of the question-answer system based on the syntagmatic analysis of the text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A Zarubin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A Koval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Bonch-Bruevich Saint - Petersburg State University of Telecommunication</institution>
          ,
          <addr-line>Saint - Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>342</fpage>
      <lpage>347</lpage>
      <abstract>
        <p>Question-answer (QA) systems are systems that can take questions and respond to them in a natural language. In most cases, the principles of building question-answer systems are used in the development of decision support systems. The mechanism of syntagmatic patterns is used when processing open-ended questions and when extracting answers to it from semi-structured resources. This article describes the application of the mechanisms of syntagmatic patterns in the construction of various types of QA-systems and expert systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. A structure of the KB of the QA system</title>
      <p>
        The knowledge base (KB) of our question-answer system has a tree-like structure. Semantic networks
are currently actively used in the construction of a KB [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Formally, the structure of the KB looks like
this:
      </p>
      <p>KB  SP, TD, R ,
where SP  SP1, SP2 , , SPn  is a set of syntagmatic patterns;</p>
      <sec id="sec-1-1">
        <title>TD  TD1, TD2 , , TDn  – is a set of text data (KB content);</title>
        <sec id="sec-1-1-1">
          <title>R  R SP , RTD  – is a set of relations of KB:</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>RSP  R1SP , R2SP , , RnSP  – is a set of relations between the internal nodes of the KB tree;</title>
          <p>RTD  R1TD , R2TD , , RnTD – is a set of relations between the internal and terminal nodes of the KB
tree.</p>
          <p>The internal nodes of the KB tree contain a syntagmatic pattern as a label. Terminal nodes contain
text information. The answer to the question will be extracted from this textual information. An
example of the KB structure of our question-answer system is shown in Figure 1.</p>
          <p>More general syntagmatic patterns are located closer to the root element of the tree. More precise
syntagmatic patterns are located closer to the terminal nodes of the tree. Thus, this knowledge base
structure of our question-answer system allows us to find the necessary terminal nodes on the user's
question (if the answer to this question is in the knowledge base).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Learning of the QA system KB</title>
      <p>The modified fuzzy C-Means (FCM) fuzzy clustering algorithm is used for learning (building a tree)
the KB of our QA system. It is necessary to present each document as an index for the FCM
algorithm. Indexing documents consists of the following steps:
1. Download the document.
2. Removing stop words (words that do not have semantic value: prepositions, particles, etc.).
3. Stemming using the Porter algorithm (highlighting the basis of the word).
4. Calculation the frequency of occurrence of words in the document.
5. The index of the document can be represented as an expression:
where wid – i -th word of the document d ;
fid – is the frequency of occurrence of i -th word in the document d ;
n – is the number of words in the document d .</p>
      <p>The modified FCM clustering algorithm is based on minimizing the function:</p>
      <p>D C 2
F FCM    uimj I i  I cj ,1  m  ,</p>
      <p>i1 j1
where D – is the number of document indexes for clustering;</p>
      <p>C – is a number of clusters;
m – is any real number greater than 1;
uij – is the degree to which the document index belongs Ii to the cluster j ;
Ii – i -th document index;
I cj – is a center of j -th cluster;</p>
      <p>I i  I cj – the normalized distance between the index of the document and the center of the
cluster.</p>
      <p>The FCM algorithm consists of the following steps:
1. Initialization of the matrix of indexes belonging to documents to clusters:</p>
      <p>U  uij .
2. Calculation of cluster centers:
3. Formation of a new membership matrix:</p>
      <p>I cj 
uij 
iD1uimj  I i
iD1uimj</p>
      <p>.
1
k1 I i  I kc 
4. The value of the objective function is calculated. The obtained value is compared with the
value at the previous iteration. Clustering is complete if the difference does not exceed the
threshold value. Otherwise, go to the second step of the algorithm.</p>
      <p>The knowledge base is learned in the process of hierarchical clustering. First, the entire set of
document indexes I 0 is clustered. Clusters are formed after the algorithm is executed. Each cluster
obtained contains a subset of the documents of the original set: I 1  I 2   I n  I 0 . A new partition
is performed for each cluster received. The split continues as long as the value
or equal to 2.</p>
      <p>Thus, a tree is constructed whose internal nodes contain indexes of cluster centers, and terminal
nodes contain text data. It is necessary to form internal node labels in the form of syntagmatic patterns
based on the contents of internal nodes.</p>
      <sec id="sec-2-1">
        <title>D / 2 is greater than</title>
        <p>
          4. An algorithm for constructing syntagmatic patterns
There are currently many approaches to the analysis of texts in natural language [
          <xref ref-type="bibr" rid="ref1 ref11 ref12 ref13 ref14 ref15">1, 11, 12, 13, 14, 15</xref>
          ].
Statistical and/or linguistic methods for the analysis of texts in natural language underlie such
approaches. Methods for the analysis of texts on natural language are also used in the development of
QA systems [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref5 ref6 ref7">1, 2, 3, 5, 6, 7</xref>
          ].
The analysis of texts on natural language consists of the following steps:
1. Grafematic analysis is the selection of structural elements of the text (sentences, names, dates,
etc.).
2. Morphological analysis is the definition of the morphological features of the words of the
sentence (part of speech, gender, etc.).
3. Parsing is the selection of the syntactic units of the sentence (subject, predicate, etc.).
4. Semantic analysis is the definition of the meaning of the sentence.
        </p>
        <p>It is sufficient to use the first two steps of the process of text analysis in natural language to form
syntagmatic patterns: graphematic and morphological analysis.</p>
        <p>The algorithm for syntagmatic patterns can be written as follows in pseudocode:</p>
        <p>For each terminal node of the knowledge base.</p>
        <p>For each document from the current terminal node.</p>
        <p>Split the text of the document into sentences.</p>
        <p>For each proposal of the document.</p>
        <p>Split the sentence into words.</p>
        <p>For each word of the sentence.</p>
        <p>Get morphological signs.</p>
        <p>For each untreated internal node, except for the root node.</p>
        <p>Until the root node is reached or the parent of the current node is not the root node.</p>
        <p>For each word of the index of the parent ejected node.</p>
        <p>Find all the documents in which the word appears.</p>
        <p>For each document.</p>
        <p>Select syntagmatic units.</p>
        <p>Delete the same syntagmatic units.</p>
        <p>For each selected syntagmatic unit.</p>
        <p>Find similar syntagmatic units.</p>
        <p>Form candidates in syntagmatic patterns.</p>
        <p>Solve the problem of optimal coverage.</p>
        <p>Write the result in the node label.</p>
        <p>Mark the node as processed.</p>
        <p>Go to the parent node.</p>
        <p>Thus, internal nodes of the knowledge base tree are marked with syntagmatic patterns as a result of
this algorithm.
5. The search for the answer to the question in the KB
The learned KB of our QA system allows us to find answers to the user's requests. First you need to
find the required terminal node of the knowledge base. The internal node labels are used to find the
most relevant terminal node. Each internal node of the knowledge base is marked with a syntagmatic
pattern.</p>
        <p>In pseudocode, the search algorithm for the most relevant terminal node of the knowledge base tree
can be written as follows:</p>
        <p>Syntagmatic units are allocated from the question that has entered the system entrance.
Set the position on the root node of the knowledge base tree.</p>
        <p>For each child node of the knowledge base.</p>
        <p>Find the number of syntagmatic patterns of the current child node</p>
        <p>corresponding to the syntagmatic units of the question.</p>
        <p>Set the position to the child node with the highest match.</p>
        <p>Continue if the selected child node is not terminal, otherwise, return the terminal node.</p>
        <p>It is necessary to find in the text documents the most relevant sentence after finding the terminal
node. The answer to the question is the most relevant sentence.</p>
        <p>In pseudocode, the search algorithm for the most relevant sentence from the text documents of the
terminal node found can be written as follows:</p>
        <p>For each terminal node document.</p>
        <p>For each sentence of the current document.
Find the number of syntagmatic patterns of the current sentence
corresponding to the syntagmatic units of the question.</p>
        <p>Choose the best match.</p>
        <p>Select the document with the highest match.</p>
        <p>Thus the two algorithms presented above make it possible to organize the search for the most
relevant answer to an incoming question.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Experiments</title>
      <p>
        The materials of the Sberbank Data Science Contest [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] were used as data for experiments. These
materials contain 50,365 entries of the form "paragraph, question, answer." The answer to the question
is always the exact text substring of the paragraph, with precision to punctuation and the text register.
Each paragraph contains several sentences.
      </p>
      <p>Two question-answer systems were used to conduct the experiment. Their knowledge base was
learned using a modified FCM algorithm. The first knowledge base contained many pairs of
"termfrequency" as labels of internal nodes. The second knowledge base contained syntagmatic patterns as
labels of internal nodes.</p>
      <p>A proximity measure was used to find the most relevant document and / or sentence in the first
knowledge base. The proximity measure is obtained using the square of the Euclidean distance:
W 2
Dist(I ic , I q )    f wc - f wq  ,
w1
where I ic – is the index of the i -th terminal node document c ;</p>
      <p>I q – is the index of the received question;</p>
      <sec id="sec-3-1">
        <title>W – is the number of words in the index I ic ;</title>
        <p>f wc , f wq – is the frequency of occurrence of the word w in the indexes I ic and I q .</p>
        <p>The most relevant document and / or sentence is a document and / or sentence with a minimum
proximity measure.</p>
        <p>During the experiment 50,365 questions were submitted to both question-answer systems. The
sentence from the paragraph was given as an answer to the question. The result was considered
successful if the reference answer was a substring of the found sentence. The results of the
experiments are presented in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusion</title>
      <p>Thus, the developed syntagmatic approach to the development of question-answer systems is effective.
This approach can be used to develop the following types of software systems:



the system for automating the process of interaction with customers based on the analysis of
the knowledge base and corporate correspondence;
decision support system based on the analysis of the knowledge base and use cases;
the system of verification of information flows of the enterprise to ensure information
security;
 the system for automating the work of the technical support service based on the analysis of
the knowledge base and use cases.</p>
      <p>In the future, we plan to modify the developed approach by finding answers to questions in an
implicit form.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            <given-names>J.H.</given-names>
          </string-name>
          , Speech and
          <string-name>
            <given-names>Language</given-names>
            <surname>Processing</surname>
          </string-name>
          . Available at: https://web.stanford.edu/~jurafsky/slp3/28.pdf (
          <issue>accessed</issue>
          :
          <fpage>03</fpage>
          .
          <fpage>05</fpage>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Berant</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Semantic parsing on freebase from question-answer pairs / Berant</article-title>
          , J.,
          <string-name>
            <surname>Chou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frostig</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
          </string-name>
          , P.
          <source>Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <year>2013</year>
          . pp
          <fpage>1533</fpage>
          -
          <lpage>1544</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Question answering with subgraph embeddings Available at</article-title>
          : https://arxiv.org/pdf/1406.3676.
          <source>pdf (accessed: 03.05</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Epstein</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schor</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>E.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cwiklik</surname>
          </string-name>
          , J. Making watson fast /
          <source>IBM Journal of Research and Development</source>
          .
          <year>2012</year>
          . Vol.
          <volume>56</volume>
          (
          <issue>3</issue>
          .4). pp
          <fpage>15</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Chu-Carroll</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gondek</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalyanpur</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murdock</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nyberg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prager</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Et Al.
          <article-title>Building watson: An overview of the deepqa project / AI magazine</article-title>
          .
          <year>2010</year>
          . Vol.
          <volume>31</volume>
          (
          <issue>3</issue>
          ). pp
          <fpage>59</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Gallagher</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadrozhny</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shalaby</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Avadhani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Watsonsim: Overview of a question answering engine</article-title>
          . Available at: https://arxiv.org/pdf/1412.0879.
          <string-name>
            <surname>pdf</surname>
          </string-name>
          (
          <volume>29</volume>
          .
          <fpage>04</fpage>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>A. UIMA:</given-names>
          </string-name>
          <article-title>An architectural approach to unstructured information processing in the corporate research environment / Nat</article-title>
          . Lang. Eng.
          <year>2004</year>
          . Vol.
          <volume>10</volume>
          (
          <issue>3</issue>
          -
          <fpage>4</fpage>
          ). pp.
          <fpage>327</fpage>
          -
          <lpage>348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Bollacker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paritosh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sturge</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J. Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge / Proceedings of the 2008 ACM SIGMOD international conference on Management of data</article-title>
          .
          <year>2008</year>
          . pp
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>D. fast and accurate dependency parser using neural networks /</article-title>
          <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <year>2014</year>
          . pp
          <fpage>740</fpage>
          -
          <lpage>750</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zarubin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koval</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Filippov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moshkin</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>Application of syntagmatic patterns to evaluate answers to open-ended questions /</article-title>
          <source>Proceedings of the 2017 Communications in Computer and Information Science (CITDS)</source>
          .
          <year>2017</year>
          . pp
          <fpage>150</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Zarubin</surname>
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koval</surname>
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moshkin</surname>
            <given-names>V.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Filippov</surname>
            <given-names>A.A.</given-names>
          </string-name>
          <article-title>Construction of the problem area ontology based on the syntagmatic analysis of external wiki-resources</article-title>
          . Available at: http://ceurws.org/Vol-1903
          <source>/paper26.pdf (accessed: 04.05</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Boyarskiy</surname>
            <given-names>K.K.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kanevskiy</given-names>
            <surname>Ye</surname>
          </string-name>
          .
          <article-title>A. Semantic and syntactic parser SemSin / Nauchnotekhnicheskiy vestnik informatsionnykh tekhnologiy, mekhanikiioptiki</article-title>
          . -
          <source>2015</source>
          . Vol.
          <volume>5</volume>
          . pp.
          <fpage>869</fpage>
          -
          <lpage>876</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Artemov</surname>
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vladimirov</surname>
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seleznev</surname>
            <given-names>K. E.</given-names>
          </string-name>
          <article-title>Review of Russian NLP systems</article-title>
          . Available at: http://www.vestnik.vsu.ru/pdf/analiz/2013/02/2013-02-31.pdf (
          <volume>03</volume>
          .
          <fpage>11</fpage>
          .
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>Automatic text processing</article-title>
          . Available at: http://aot.
          <source>ru (accessed: 29.04</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prager</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCord</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boguraev</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patwardhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fodor</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ChuCarroll</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Question analysis: How watson reads a clue /</article-title>
          <source>IBM Journal of Research and Development</source>
          .
          <year>2012</year>
          . Vol.
          <volume>56</volume>
          (
          <issue>3</issue>
          .4) (
          <year>2012</year>
          ). pp
          <fpage>2</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Sberbank</given-names>
            <surname>Data Science Contest</surname>
          </string-name>
          . Available at: https://contest.sdsj.
          <source>ru (accessed: 04.05</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>