=Paper=
{{Paper
|id=Vol-2362/paper10
|storemode=property
|title=Development of System for Auto-Tagging Articles, Based on Neural Network
|pdfUrl=https://ceur-ws.org/Vol-2362/paper10.pdf
|volume=Vol-2362
|authors=Pavlo Mukalov, Oleksandr Zelinskyi,Roman Levkovych,Petro Tarnavskyi, Anastasiia Pylyp,Nataliya Shakhovska
|dblpUrl=https://dblp.org/rec/conf/colins/MukalovZLTPS19
}}
==Development of System for Auto-Tagging Articles, Based on Neural Network==
<pdf width="1500px">https://ceur-ws.org/Vol-2362/paper10.pdf</pdf>
<pre>
Development of System for Auto-Tagging Articles, Based
                 on Neural Network
       Pavlo Mukalov1 [0000-0002-0808-5809], Oleksandr Zelinskyi2 [0000-0003-1247-7511],
      Roman Levkovych2 [0000-0001-9393-714X],,Petro Tarnavskyi2 [ 0000-0003-3265-8168],
      Anastasiia Pylyp1 [0000-0002-0222-4687], Nataliya Shakhovska1 [0000-0002-6875-8534]
                1Lviv Polytechnic National University, Lviv 79013, Ukraine
              2Ivan Franko National University, Lviv 79000, Ukraine

          pmykalov@gmail.com, sashko.zel2000@gmail.com,
         rlevkovych098@gmail.com, petro28062000@gmai.com,
    anastasiia.pylyp@gmail.com, nataliya.b.shakhovska@lpnu.ua


       Abstract. The paper describes possibilities of natural language processing in
       data classification. In last decade AI technologies became widespread and easy
       to implement and use. One of the most perspective technology in the AI field is
       natural language processing. New technologies will become a central part of fu-
       ture life because they save a lot of time. In addition, the articles shows a com-
       plete article tagging cycle using Neural Networks, ranging from data acquisition
       to tag storing.

       Keywords: auto-tagging, language processing, neural network with LSTM lay-
       ers, multilayered system.


1      Introduction

Natural language processing or NLP is a part of computer science and artificial intel-
ligence associated with the interactions between computers and human (natural lan-
guage). The main tasks of NLP are data extraction, speech synthesis, language gen-
eration, speech recognition, machine translation, information receiving and many
others.
   That is why NLP used in many spheres of life from auto-supplementation in the
iPhone to marketing and advertising. Many modern resources provide an analysis of
trends over the past few years, while no one focus on predicting the popularity of
things for today.
   The basis of project, given in this paper, is the processing of articles from popular
web forums using neural networks. The project now focuses on articles processing
and anticipating trends in the IT industry. That is why this project can be useful when
people choose a stack of technologies for their new project.
2      State of arts

Today, the basic solution to the problem of recognition of entities is a combination of
gazers, basic rules and the Conditional random field (CRF). CRF is one of the classic
machine learning algorithms.
    Such a set of algorithms was used, for example, as a baseline in the WNUT2015
competition [1]. Most of the participants used CRF, as well as classic forward-
propagation neural networks (FFNN) and Markov algorithms. In addition to the texts
analysis, many participants also used the meanings of vector representations of words:
word embedding, using the word2vec and GloVe algorithms.
    In [2], the authors propose an algorithm for automatically constructing newspapers
based on WordNet and Wikipedia by identifying the type of entity. The algorithm is
moving up through the hierarchy of hypernames. The method shows rather weak re-
sults for such types of named entities as persona and organization, but it works better
for geographic locations. In addition, it is limited to the data available in the men-
tioned systems. As far as we know, this method of development has not received.
Today, rule-based systems are considered rather primitive, suitable only for automat-
ing the process of extracting information, which is already quite well structured. The
main disadvantage of rule-based systems is their limitations, that each new knowledge
section requires the development of its own set of rules capable of taking into account
the specificity of texts in this area, which requires the involvement of a large amount
of human resources. At the same time, the performance of more automated systems
based on machine learning algorithms has increased enough to compete with the best
rule-based systems.
    In [3], the authors stated the possibility of developing a rule-based system that can
be compared in quality with machine learning algorithms, if you first spend 8 person-
weeks to develop rules for a specific subject area.
    Speaking of machine learning algorithms, it is worthwhile to separate the algo-
rithms by two groups. The first one is learning with the teacher, when algorithm is
trained in sufficiently pre-marked manually examples. The second one is learning
without the teacher. In this case, algorithm learns to recognize entities using only the
information provided in the processed data and some previously known heuristics.
Algorithms with a teacher have a disadvantage akin to rule-based systems: their train-
ing requires a rather time-consuming process of preparing training data.
    Among machine learning algorithms with a teacher, most of the classical methods
reduce the task of recognizing entities to the markup of sequences and their subse-
quent element-by-element classification. From more specific examples, we can high-
light the CRFs mentioned above. CRF is one of the most popular search patterns for
named entities, defining tags based on attributes, but taking into account both the
current and previous words and the subsequent words in the text. So, this algorithm
forms the basis of a number of popular sequence markers [4-5].
    The next group of algorithms for sequences labeling is based on maximum entropy
[6]. They predict the label of a sequence element based on the probabilities of occur-
rence of certain attributes of a word and its predecessors, and Markov models that
perceive text markup as a Markov process, where states are the required classes , and
the probabilities of the labels of the current element are determined by the previous
state of the process.
   More sophisticated sequence classification algorithms can rely on complex neural
network models, such as LSTM, which has gained popularity in working with text
data due to its ability to take into account the history of sequences skipped through it.
Examples of using such models can be found in [7]. The use of bidirectional LSTM
networks makes it possible to simultaneously take into account the attributes of both
previous and subsequent words in a sentence when assigning an entity tag.
   In [8], the authors compare the performance of unidirectional and bidirectional
LSTM with use of CRF at the network output to account for tagged adjacent words
and improve quality. Unlike algorithms with a teacher, unsupervised algorithms often
identify entities in the text based on the search for similar words in a document, in an
attempt to identify named entities in common groups, based on context. An example
of this approach is [9, 10], in which authors use Word2vec to generate clusters of
words with similar contexts. This approach shows the best results in comparison with
the classical CRF for languages with a low volume of labeled cases.
   C-LSTM Neural Network for Text Classification is used in [11]. Authors use Con-
volutional neural network (CNN) and recurrent neural network (RNN) for sequence
extraction with higher-level phrase representations.
   In [12, 13] the method of text classification based on Big data approach and pattern
recognition is proposed.
   However, all these approaches are just mostly for text classification.
   The purpose of this paper is to design a system of looking for articles, auto-tagging
them and demonstrating results using RESTful subsystem.


3      Main part

The system architecture and main methods for text analysis will be proposed in the
paper.


3.1    System architecture
The proposed system consists of three main parts:

1. Data providing – exploitation open source data from open web resources using
   their API.
2. Data processing – processing of articles using neural networks.
3. Calculation of statistical indicators.

The architecture of the system is represent in the Fig. 1.
Fig. 1. The system architecture
From the beginning, Data providers scraped the text of articles from the sites of open
web resources using their APIs and sent to the database. Than Tag classifier obtained
data from the database, processes the text and writes the results to the same database.
The Information representor retrieves data from the database and displays them on a
web site.
   The Azure functions are used to enable data providers from time to time without
explicitly managing their call, Html Agility Pack for web scraping, Python 3.6.6 and
Keras are used for article typing, also SQLAlchemy and Entity Framework for work-
ing with the database, React JS is used to display data on a site.


3.2    Data Providers

In order to determine the popularity of certain tags, we need to find which percentage
of articles has this tag. The more articles are processed, the more precisely this statis-
tic is. Manually processing articles from different sites would take a lot of time. An
optimal solution to this task is to parsing an article by its reference. In order to find a
reference to an article we need to parse the site tab “Newer”, “Latest”, etc.
   A good tool for parsing web pages is HtmlAgilityPack. With this tool, own parsers
were created. Using parsers, we get references to new articles. The other parser pro-
cesses each received reference with the article, after which all the necessary infor-
mation about the article is transferred to the database. The design of data parsers is
such that it is possible to carry out periodic diagnostics for them. This diagnostic is
required to check whether the site markup for which the parser was written has not
changed. If the site markup has changed then this diagnostic will inform the developer
about it. If the site markup has changed then developer should change the parser for
this site.
   In order for our parsers to work automatically, we decided to use the Azure fea-
tures. Azure Functions is a great solution running functions, such as our parsers, in the
cloud. Write only the code for the problem in the locale and do not worry about how
to run it. Azure Functions can be used with different languages, such as C#, F#,
Node.js, Java, or PHP. Azure Function automatically starts every 24 hours and adds
new articles to the database, which improves the accuracy of our statistics.


3.3    Text pre-processing
Before training models, texts are processed according to the following principles:

 multi-line texts are combined into one line;
 texts are cleared of all characters that are not letters, numbers, space characters, or
  some special characters;
 each token is subjected to morphological analysis and is reduced to normal form (if
  possible);
 for normalized tokens, the mark of a part of speech is added;
 removal of the service parts of speech (conjunctions, prepositions and pronouns).


3.4    Text processing
To begin with, we decided to use neural network with LSTM layer to classify article
by tag, because there it allows us configure how many previous sentences will influ-
ence current output of neural network. As text can be spelled by "." and fitted to neu-
ral network sentence by sentence. Next, a sentence is transformed and passed through
the network LSTM (Long short-term memory). The layer remembers the sentence and
influences output of neural network in feature.
   The Word2vec can be used for text processing. This approach is presented in the
form of two variations of the neural network architecture containing a single hidden
layer. The final model, relying on the distributive hypothesis (linguistic units with
similar distributions have a similar meaning), learns to match the words and contexts
of their use. Training takes place without the help of a teacher, using only unplaced
texts, producing at the output a set of vectors of a given dimension for any word en-
countered in the learning process. At the same time, the resulting vectors reflect the
closeness of these words: closer words have closer vectors and vice versa. The posi-
tive characteristics of this model are the low rarefaction of the final vectors, the ability
to set their dimensions, as well as the speed of operation (compared to more complex
models that give a similar level of quality). The main disadvantage is the impossibil-
ity of interpreting the values of the coordinates of some vector. To obtain a vector
representation of the whole text, it is necessary to combine vector representations of
individual words, which is carried out, as a rule, by taking the mean value of the vec-
tors.
   That is why we propose to use Paragraph2Vec. This model has an architecture sim-
ilar to word2vec, with the only difference that, in addition to contextual words, the
model also takes into account the contextual document, learning in the process of
learning and its vector representation. As a result, paragraph2vec is able to return
vectors of whole texts of similar quality with vectors of individual words to
word2vec. At the same time, for previously not met documents, the vector can be
generated based on the words included in the document. Thus, using paragraph2vec,
you can get vector representations of texts without any additional actions (Fig. 2).


Fig. 2. The LSTM representation

Actually the network remembers previous sentence and it allows the network set a
linguistically covariance between the words in different sentences. Besides, LSTM
layer lets us configure how much time it will remember the previous input. The most
effective setting is to remember 5-6 sentences as in articles the most paragraph con-
sists of such an amount of sentences.
   Architecture of neural network is given below:


There is no static input size because it defines dynamically from training sample.
Every input neuron is a word that will be known by the neural network.


3.5    Data processor responsibilities and design

Data processor is an independent hostable service that is responsible of data organiz-
ing and processing it in database. This part works only with database and has no ex-
ternal dependencies in project.
   For this part of system, we used Keras [14 – 15] for neural network, SQLAlchemy
for interaction with database, Pandas [16 – 18] for loading sample from .csv file. For
subsystem deploying to Azure we used Docker container.
   Logical entities:

1. Classifier - provides generalized wrapper for RNN created using Keras with meth-
   ods for training and configurating it.
2. TextClassifier - inherits Classifier, expands functionality of Classifier with possi-
   bility of preparing training sample (text to sequences).
3. Interactor - implementation of repository design pattern that provides a set of SQL
   queries created with SQLAlchemy as ORM.
4. MSSQLInteractor - inherits Interactor. Connects to MSSQL server (connection can
   be configured in config.py)

Design layers:
1. Classifying - includes all business logic of auto tagging of articles.
2. Data interaction - includes logic of storing intermediate data and interaction with
   database.
3. Entry - includes warming up Classifier and services logic.


4      Results

As far as it concerns training sample we created sample that consists of 200 articles
about programming and tagged them.
   The Table 1 presents the parameters of training. The both methods – Word2vec
and Paragraph2vec, were compared. Basically, to predict a word, Word2Vec uses its
surrounding words as predictors.
   Paragraph2Vec, on the other hand, uses the resident paragraph id as an additional
predictor. After the algo finishes, it has learned an embedding for each word and an
embedding for each paragraph.

                           Table 1. The parameters of training.

     Model         Architecture       Dimension          Min frequency         Epoch
 Paragraph2vec     PV-DM             200                3                  5
 Paragraph2vec     PV-DBOW           200                3                  5
 Word2vec          skip-gram         300                3                  5
 Word2vec          CBOW              300                3                  5
Paragraph Vector is capable of constructing rrepresentations of input sequences of
variable length. Unlike some of the previous approaches, it is general inapplicable to
texts of any length: sentences, paragraphs, and documents. It does notrequiretask-
speciﬁctuningofthewordweightingfunctionnordoesitrelyontheparsetrees.
   Then after 100 epoch of training with 30 batch size we have got the next result:
    Loss: 0.0262
    Accuracy: 0.9806
    Value loss: 0.0649
    Value accuracy: 0.9641
As to testing it on non-automatically tagged articles we have got the next result:
Fig. 3. Comparison of human and system tags

As you can see the even with such a small training sample the neural network starts to
understand content of articles written by human. The bigger sample - the more accu-
rate result of classifier.
    For training neural model, we tried different configuration and chosen the most ef-
fective one. It started to understand an essence of an article and tag it as a human. The
main purpose is to train with a bigger sample and increase number of tags that the AI
model know.
   After first batches of articles analysing, we have seen that the most popular back-
end language is Java and Python, as far as it concerns front-end the first place was
acquired by Javasript. Not all other statistics can be defined as objectively correct due
to low sizes of sample.


5      Conclusion

In this paper, we introduced our approach to solve article auto tagging problem. In
order to find a solution, NLP were considered as the best option that suits our re-
quirements. The method for determining the type of entity when extracting infor-
mation from texts by calculating the semantic proximity of vectors obtained using
neural network language models was proposed and experimentally investigated. The
method has the advantage of low laboriousness of text corpus preparation in compari-
son with traditional methods of learning with a teacher and methods based on rules.
The experiment also showed the advantage of using word2vec model vectors without
TF-IDF or SIF weighting schemes in conditions of limited vocabulary of texts from
the knowledge base, automatically generated from professional standards [20].
   Taking everything into consideration a system that consists from four main parts
(data providers, database, tag classifier and information representor) were built. The
project automatically gets certain web-sites, save data in database, then the neural
network handles new information and saves it.
   Finally, the system has not been built into one app and tested. Each part of the sys-
tem was checked and it works correctly.
   The proposed system can be used for authorship recognizing [21 – 22], for data
imputation in user profile [23] too.
References
 1. Baldwin, T., de Marneffe, M. C., Han, B., Kim, Y. B., Ritter, A., Xu, W.: Shared tasks of
    the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named
    entity recognition. In: Proceedings of the Workshop on Noisy User-generated Text, 126-
    135 (2015)
 2. Toral, A., Munoz, R.: A proposal to automatically build and maintain gazetteers for
    Named Entity Recognition by using Wikipedia. In: Proceedings of the Workshop on NEW
    TEXT Wikis and blogs and other dynamic text sources (2006)
 3. Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., Vaithyanathan, S.: Domain adaptation
    of rule-based annotators for named-entity recognition tasks. In: Proceedings of the 2010
    conference on empirical methods in natural language processing Association for Computa-
    tional Linguistics, 1002-1012 (2010)
 4. Lehman, Jill Fain.: Adaptive parsing: self-extending natural language interfaces. In:
    Springer Science & Business Media, vol. 161 (2012)
 5. Finkel, J. R., Grenager, T., Manning, C.: Incorporating non-local information into infor-
    mation extraction systems by gibbs sampling. In: Proceedings of the 43rd annual meeting
    on association for computational linguistics, 363-370 (2015)
 6. Toutanova, K., Manning, C. D.: Enriching the knowledge sources used in a maximum en-
    tropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT conference on
    Empirical methods in natural language processing and very large corpora: held in conjunc-
    tion with the 38th Annual Meeting of the Association for Computational Linguistics, vol.
    13, 63-70 (2000)
 7. Chiu, J. P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. In:
    Transactions of the Association for Computational Linguistics, vol. 4, 357-370. (2016)
 8. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. In:
    arXiv preprint arXiv:1508.01991 (2015).
 9. Shakhovska, N., Shvorob, I.: The method for detecting plagiarism in a collection of docu-
    ments. In: 2015 Xth International Scientific and Technical Conference of Computer Sci-
    ences and Information Technologies (CSIT), 142-145 (2015)
10. Shvorob, I.: New Approach for Saving Semistructured Medical Data. In: Advances in In-
    telligent Systems and Computing, 29-40 (2017)
11. Zhou, C., Sun, C., Liu, Z., Lau, F.: A C-LSTM neural network for text classification. In:
    arXiv preprint arXiv:1511.08630 (2015).
12. Shakhovska, N.B., Noha, R.Y.: Methods and tools for text analysis of publications to study
    the functioning of scientific schools. In: Journal of Automation and Information Sciences,
    vol. 47(12) (2015)
13. Shakhovska, N., Vovk, O., Hasko, R., Kryvenchuk, Y.: The method of big data processing
    for distance educational system. In: Conference on Computer Science and Information
    Technologies, 461-473 (2017)
14. Antonio Gulli, Sujit Pal.: Deep Learning with Keras: Implementing deep learning models
    and neural networks with the power of Python. ISBN: 978-1787128422 (2017)
15. Keras Documentation. At: https://keras.io (2019)
16. David Taieb.: Data Analysis with Python.Packt Publishing. ISBN: 9781789958195 (2018)
17. Fabio. M. Soares, Rodrigo Nunes.: Neural Network Programming with Python. ISBN:
    978-1784398217 (2019)
18. Sarkar, Dipanjan: Text Analytics with Python: A Practical Real-World Approach to Gain-
    ing Actionable Insights from your Data. ISBN 978-1-4842-2388-8 (2018)
19. Pranjal Srivastava: How to create a poet / writer using Deep Learning (Text Generation us-
    ing Python)? At: https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-
    python-nlp/ (2018)
20. Chapman, Nigel P., LR Parsing: Theory and Practice, Cambridge University Press. ISBN
    0-521-30413-X (1987)
21. Vysotska, V., Kanishcheva, O., Hlavcheva, Y.: Authorship Identification of the Scientific
    Text in Ukrainian with Using the Lingvometry Methods. In: 2018 IEEE 13th International
    Scientific and Technical Conference on Computer Sciences and Information Technologies
    (CSIT) , vol. 2, 34-38 (2018)
22. Lytvyn, V., Vysotska, V., Burov, Y., Bobyk, I., Ohirko, O.: The Linguometric Approach
    for Co-authoring Author's Style Definition. In: 2018 IEEE 4th International Symposium on
    Wireless Systems within the International Conferences on Intelligent Data Acquisition and
    Advanced Computing Systems (IDAACS-SWS), 29-34 (2018).
23. Fedushko, S., Syerov, Yu., Korzh, R.: Validation of the user accounts personal data of
    online academic community. In: IEEE XIIIth International Conference on Modern Prob-
    lems of Radio Engineering, Telecommunications and Computer Science, 863–866 (2016).

</pre>