Research and design of IT-based Kazakh terminology
                                recognition system construction techniques
                                Muheyat Niyazbek1,2 and Kuenssaule Talp3, ∗

                                1 College of Computer Science and Technology, Xinjiang University, 830017, Urumqi, China
                                2 Xinjiang Key Laboratory of Multilingual Information Technology, 830017, Urumqi, China
                                3 College of Chinese Medicine of Xinjiang Medical University, Urumqi, 830011, Xinjiang, China


                                                  Abstract
                                                  The demonstration and realization of terminology identification systems in the IT domain is
                                                  considered to be one among the most significant measures to utilize terminological resources in
                                                  this field more efficiently. This article describes the research and design of an IT-based
                                                  terminology recognition system for the Kazakh language. The system uses Conditional Random
                                                  Fields (CRF) and manual modification methods and proceeds to analyze the patterns of term
                                                  formation and related term recognition methods based on the characteristics of the technical
                                                  terms itself in the IT domain.

                                                  Keywords
                                                  information technology field, terminology recognition, system design


                                1. Introduction

                                With the expansion of applications for processing Chinese language information, the
                                requirements for terms retrieval in various fields of different languages is becoming
                                increasingly imminent. Among them, using computer as a tool to build a platform for
                                identifying the terminology in the field of IT in the Kazakh language is increasingly
                                important for the construction of national language informatization such as Kazakh
                                natural language information processing, Kazakh linguistics studies, information security
                                retrieval, machine translating, corpus establishment, and terms repository in the IT field
                                [1]. A term is a linguistic unit representing the primary and fundamental notions of a
                                particular academic field, which is the representation of the field's central knowledge and
                                facilitates people's rapid access to specialized knowledge, so how to retrieve terms
                                automatically naturally becomes a research hotspot for related professionals. Automatic
                                term acquisition is a major investigation assignment in the information processing domain,


                                ICCIC 2024: International Conference on Computer and Intelligent Control, June 29–30, 2024, Kuala Lumpur,
                                Malaysia
                                * Corresponding author.

                                   muheyatn@xju.edu.cn (M. Niyazbek); kuenssauletalp@163.com (K. Talp)
                                   0009-0002-2051-6103 (M. Niyazbek); 0009-0007-1507-0844 (K. Talp)
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
and it has a significant usage in the fields of lexicography, ontology structuring, machine
translating, and other fields. Terminology extraction is one of the pivotal technologies for
constructing and extending large-scale ontology engineering in automatic or
semi-automatic ways. In the past few years, the awareness of the importance of methods
for identifying terms have been acknowledged and a lot of researches have been carried
out, while the widely used methods for extracting terms are primarily classified into
statistics-based approaches, methods based on machine learning, linguistic rules and
combined hybrid methods. The system presented in this article is designed by combining
linguistic rules with Conditional Random Fields (CRF) and manual modifications. In the IT
domain, it is expected that through the conception and realization of the system for
identifying Kazakh terminology, we will do our part in the excavation, inheritance, and
innovation of national culture and national scientific and technological educational work,
as well as in the security, stability, and prosperity of the community.


2. System design

The resulting framework is designed on the basis of an electronic corpus of various texts
obtained from various Kazakh language websites and IT textbooks for primary and
secondary schools, as well as a cooked corpus with completed word extraction, affix
extraction, and lexical annotation, obtained after lexical analysis of the original corpus by
various linguistic corpus tools now in use in multilingual IT laboratories. After inputting
the cooked corpus into the rule-based system, it is additionally refined using a term
dictionary and a rule-based term collocation system to produce the final list of candidate
terms and the annotated term corpus [2-4]. Then manually modify the candidate term
labeling corpus to generate the training corpus. The detailed flow of the process is shown
in Figure 1.


Figure 1: Workflow diagram.
2.1 The Design of Database Structure
This system is designed using CS schema for convenience and Microsoft SQL Server 2005
is applied as the backend data server for corpus storage. The tables in the database are
corpus table, permission table, daily newspaper category saving table, periodical category
saving table, textbook category saving table, relations table and so on. The information is
recorded according to the format of these tables, such as the corpus table including
number, title, save path, date of entry, number of paragraphs, the user table including user
name, user password, user level, the management table including the name of the
administrator, the corresponding password and the corresponding permissions and so on.

2.2 Corpus Classification
Text categorization is the process of classifying a given text into one or more categories
according to its features, such as content and attributes, under a certain classification
system. In such a way, the research topic of text categorization involves many issues
related to natural language understanding and pattern recognition such as how to
understand and classify the content of the text, and a successful text categorization system
is not only a natural language processing system, but also a typical pattern recognition
system. So far, text categorization of the Kazakh language is still basically done manually.
Obviously, the traditional manual classification method has restricted the speed of Kazakh
text classification, and it is difficult to meet the needs of social development, the research
and development of a fast and accurate Kazakh text classification that can replace the
manual labor is very necessary

2.3 Training corpus preprocessing
Manual annotation for a piece of Kazakh corpus is inefficient and unable to achieve a high
accuracy rate, so it is more appropriate to use a dictionary-based matching method. First
of all, we get a considerable amount of Kazakh corpus by surfing the website
TIANSHAN.com, then we find the common specialized terms in the IT field by searching
the Kazakh dictionaries, and finally, we write a program to find out the specialized terms in
the dictionaries appearing in the corpus and annotate them manually.
   BIO annotation is a sequential annotation method for labeling entities or lexemes in
text. In BIO annotation, each word or phrase is labeled as one of three possible cases: B, I
or O. Where B means that the word is the beginning position of an entity or specialized
vocabulary, I mean that the word is the internal position of the entity or specialized
vocabulary, and O means that the word is not an entity or specialized vocabulary. For a
specialized vocabulary in a piece of Kazakh corpus, we need to mark it out using BIO
annotation.

2.4 Kazakh terminology recognition
The process of Kazakh terminology recognition requires corpus acquisition and labeling.
Firstly, we obtain the commonly used terms in the IT field by looking up the Kazakh
dictionary, then we obtain part of the Kazakh corpus in TIANSHAN.com, find the
specialized terms in the acquired corpus by string matching, and finally conduct manual
BIO annotation. Different from the Chinese model, the Kazakh model adopts
BLSTM-CNN-CRF model with one more layer of CNN network structure, and the CNN layer
can capture the local features in the input sequences, which improves the accuracy of the
model. After the model training is completed, when the live Kazakh corpus is imported
into the model, the terminology in the information domain in that corpus can be
recognized.

2.5 User Management Module
The user administrative module is majorly tasked with the management of the signed-in
personnel of the entire corpus language platform for managing resources, the setting and
the realization of the permissions of each user. It is also responsible for the operation of
adding, deleting and changing passwords for users. There are three levels of users in this
system: system administrators, editors and ordinary users. The system administrator has
full control over all resources, whereas other users are restricted to utilizing the language
resources of the corpus according to their permissions.


3. System Functional Structures

From the aspect of system functionality, the method of random fields is used as an
extraction criterion to process the term extraction issue, the term recognition in the IT
field of Kazakh language is perceived as a sequential lexical annotation question, the
feature quantization of term distribution is used as a feature for the training of the system,
and term feature templates are trained by using the toolkit of Conditional Random Fields
(CRFs).
    There are two subsystems of the entire system which can be categorized into term
annotation corpus and CRF pattern recognition, in which the term annotation corpus
subsystem also consists of preprocessing section, creating training corpus section, term
recognition section, term extraction section, delimitation rule section, etc., and another
CRF pattern subsystem also comprises model parameter segment, feature selection
segment, and feature template selection segment. The functional structure of the system is
shown in Figure 2.
Figure 2: System Functional Structures.

3.1 Generation of the training corpus
The linguistic data stored in the Information Technology Lexical Corpus has emerged
during actual language use and are the basic material for computers to carry linguistic
knowledge, and the authentic corpus needs to be addressed in order to become a usable
material. Using familiar corpus from the system as input, the terms are retrieved from the
given text according to the grammar principles, and then further modification process is
performed to generate the training corpus. The term can be either a word or a phrase on
its own, and there are various structures of terms in the domain of Kazakh information
technology, some of them consist of one word or two words joined together, and some of
them consist of various supplementary components or nested, taking the manner in
noun+noun, adjective+noun, noun+verb, and so on. The entire system is organized into
sections such as term extraction, generating training corpus, term recognition and exiting
the system. In the section of generating training corpus, it contains modules like opening
XML file, opening terminology file, annotating terms in XML file, saving annotation file, etc.,
which can be used for further related operations as needed, like accessing the XML
annotation file in the thesaurus [5-7]. The interface additionally contains options such as
previous paragraph, next paragraph or previous paragraph, next paragraph, etc., each of
which has different stages of operation procedure, and the detailed operation interface of
the specific module for generating training corpus is shown in Figure 3.
                    Figure 3: Interface for generating training corpus.

3.2 Term extraction
Due to the differentiation of word terms, multi-word terms, etc., and the different forms of
terminology in different languages, such as noun + noun, adjective + noun, noun + verb,
etc., the terminology extraction will be based on the characteristics of the language and the
composition of the terminology structure to define the rules of abstraction. The module is
mainly for the relevant information in the term extraction, into the page after the left and
right interfaces, the left side can conduct the document open, extraction, save, exit, term
statistics and other operations, the right side shows the extracted terms and the number of
extracted information. The detailed operation interface of the system's terminology
extraction architecture is shown in Figure 4 below.


Figure 4: Architecture diagram for automatic term extraction.
3.3 Term Recognition
The module contains 3 sections: training, testing and analyzing, and from various sections,
we enter various operation interfaces. When entering the training corpus section, users
can view the options of adding corpus, feature extraction, model training, etc., and in each
of these options, they can continue to carry out the corresponding operations.The testing
module includes test corpus, term recognition, result saving, and quick testing. In the
analysis module, it counts the number of correctly identified terms., the number of
incorrectly recognized terms, the number of terms labeled as terms by the system, the
number of undecided terms, the accuracy, the recall, the F-value and so on can be displayed.
The term recognition method is based on pre-selection, i.e. candidate terms are selected
first. Although Kazakh language belongs to adhesive language, the lexical properties of
information technology terms have certain regularity, by analyzing and observing, the
lexical rule list of information technology terms is prepared, and then the rules are used to
match with the text that has been labeled with lexical properties, The candidate term is
extracted based on the corresponding word or phrase. The detailed operation interface of
the term recognition system is shown in Figure 5 below.


Figure 5: System term recognition interface.

3.4 Feature Template Recognition
Various languages carry distinct grammatical principles and their unique characteristics.
Since the terms themselves are highly normative, the identification and incorporation of
terms need to go through a lengthy process, thus, the development of terminology
dictionaries in many languages often lags behind the emergence of new terms. The use of
computer-assisted identification of candidate terms, followed by their determination
through expert participation would be greatly beneficial. The template of relevant features
identified on the basis of the characteristics of the composition of terms in IT field in
Kazakh language can be divided into the following categories: stem, affix, lexical properties
and terminological annotation information for the left and right words. For instance, the
current word (CWord), the first word on the right (RWord), the lexical properties of the
first word on the left (LPos), the prefix of the previous word (CAffix) and the it annotation
of the first word on the right (RIT). The interpenetration between terms and ordinary
words is reflected in the fact that a term is itself a word, a term can be generalized into a
common word, and an ordinary word can be abstracted to a term. The same word is used
in different ways, it may be a term in one passage and a common word in another.
    Kazakh language belongs to the adhesive type, words in Kazakh texts are formed by
attaching certain morphemes to their roots, which are segmented into morphemes and
lexemes. Kazakh language differs from Chinese and English in that it is word-based, and in
this respect it is the same as English, but the Kazakh language is adhesive and rich in
contextual information, and the morphology of Kazakh words is richer than that of English.
Based on the characteristics of the Kazakh terminology in the field of IT itself, this paper
defines the feature space as:

Table 1
Term recognition feature space
 No    Feature    Significance              No     Feature   Significance
                                                             Morpheme of the first
  1       LWord   First word onthe left      5      LPos
                                                             wordon the left
                                                             Morpheme of the current
  2       CWord   current word               6      CPos
                                                             word
                                                             Morpheme of the first
  3       RWord   First word on the right    7      RPos
                                                             wordon the right
                   Morpheme of the
                                                            Morpheme of the second
  4       LLPos    second word on the 8          RRPos
                                                            woron the right
                   left
   Selecting suitable feature templates, 2 major representative composite feature
templates are constructed on the basis of Table 1. Each informational function derives its
values from the current word context, combining each function value to form the feature's
premise, and getting the role of the feature through the marking of the word, thus
extracting the feature:
   Template 1: [ RRPos, RRTE, RWord, RAffix, RPos, RTE, CPos, CTE, CWord, CAffix, LWord,
LAffix, LPos, LTE] Examine the impact of one word to the left and two words to the right of
the candidate word on the experimental outcomes.
   Template 2: [ RRPos, RRTE, RWord, RAffix, RPos, RTE, CPos, CTE, CWord, CAffix]
Observe the effect of one word to the right of the candidate word and two words to the
right on the experimental outcomes.
3.5 Experimental data
The article uses the following singular determination metrics: accuracy rate for term
recognition, error rate. They are defined as shown below: accuracy rate (P) = number of all
terms correctly recognized by the system/total number of terms recognized by the
system*100%; error rate (E) = 1 - accuracy rate.
The system was tested in an open manner using annotated training corpora of different
sizes. Here are the test results.

Table 2
Results of the term recognition test
      Corpus size      Test Type       P(%)     R(%)      E(%)      L(%)       F-Value(%)
   1200 sentences      open-ended      80.15    79.76     20.53     21.30          79.54
   2900 sentences      open-ended      79.27    78.97     17.79     18.11          78.03
   4900 sentences      open-ended      80.01    79.33     16.05     17.73          79.06

4. Conclusion

The establishment of a terminology recognition system is a large-scale project with a long
project period and a large requirement of data. The development of the system of
terminology recognition is still in its initial stage and has a long way to go. At present, only
the collection of raw data and the organization of basic information on terminology in the
IT field of the Kazakh language have been completed. Relevant professionals need to make
unremitting efforts to refine the technological methods of corpus tool processing and
analysis and to continuously improve the construction of the system in order to further
meet the various needs of Kazakh-language information research.


References

[1] Q. Wang, C. Zhang, H. Ding, “Formation of Academic Discourse System Based on
    Terminology,” China Terminology, vol. 26, no. 1, pp. 63-67, 2024.
[2] J. Jiang, X. Qi, “Design and Realization of Online Query System of Chinese-English
    Terminology of Chinese Medicine,” China Terminology, vol. 24, no. 4, pp. 92-96, 2022.
[3] Z. Li, Y. Zhong, H. Wang, J. Liu, Y. Sun, “Research on Domain Term Extraction Method
    Based on Deep Learning and Statistical Information,” Frontiers of Data and Computing,
    vol. 4, no. 2, pp. 87-98, 2022.
[4] Y. Jia, D. Zhu, “Medical Named Entity Recognition Based on Deep Learning,” Computer
    Systems & Applications, vol. 31, no. 9, pp.70-81, 2022.
[5] X. Wang, H. Tao, “Research on Chinese Named Entity Recognition based on Deep
    Learning,” Journal of chengdu university of information technology, vol. 35, no. 3,
    pp.264-270, 2020.
[6] J. Jiang, X. Qi, “Design and Realization of Online Query System of Chinese-English
     Terminology of Chinese Medicine,” China Terminology, vol. 24, no. 2, pp. 92-96, 2022.
[7] X. Wei, “The Name and Reality of China Terminology: Some Thoughts on the
     Identification of Its Problem DoMain,” China Terminology, vol. 23, no. 2, pp. 03-10,
     2021.
[8] M. Conrado, T. Pardo, S. Rezende, “A Machine Learning Approach to Automatic Term
     Extraction using a Rich Feature Set,” Naacl Student Research Workshop, 2013.
[9] G. Lample, M. Ballesteros, S. Subramanian, “Neural Architectures for Named Entity
     Recognition,” arXiv, 2016.
[10] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep
     Bidirectional Transformers for Language Understanding,” In Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.
     4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics, 2019.
[11] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer,
     “Deep Contextualized Word Representations,” In Proceedings of the 2018 Conference
     of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, New
     Orleans, Louisiana. Association for Computational Linguistics, 2018.
[12] K. Takeuchi, N. Collier, “Analysis of Machine Learning Model for Technical Term
     Extraction in Biological Science Papers,” Ipsj Sig Notes, 2002.
[13] N. Chatterjee, N. Kaushik, “Automatic Extraction of Agriculture Terms from Domain
     Text: A Survey of Tools and Techniques,” 2020.
[14] T. Yang, K. Hu, “Study on clinical terminology extraction of traditional Chinese
     medicine based on internal aggregation and boundary degree of freedom of character
     strings,” IEEE International Conference on Bioinformatics & Biomedicine, 2017.
[15] C. Xu, et al., “Chinese patent terminology extraction,” Computer Engineering and
     Design, 2013.