<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Research on Text Classification Model Based on Self-Attention Mechanism and Multi-Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaolin Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yue Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weipeng Tai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Engineering Research Institute in Anhui University of Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science and Technology in Anhui University of Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>244</fpage>
      <lpage>256</lpage>
      <abstract>
        <p>The traditional text classification algorithms based on deep learning include RNN and CNN and their variants. The too long sequence of RNN is easy to produce gradient explosion and the gradient disappears, so the long-distance dependence of the text cannot be extracted; CNN focuses on the local features of the sentence rather than the global structure of the sentence. In response to this problem, this paper proposes a text classification model (Self-attention and Multiple Neural Networks Unit based Text Classification SMNN) that integrates the selfattention mechanism and multiple neural networks. This structure uses a word embedding model based on the self-attention mechanism that focus on the important parts of the text to generate a global representation of the text, uses CNN to extract the local semantic features of the text at multiple granularities through different convolution kernel sizes and k-max-pooling, and uses BiLSTM with skip connections and pooling layers to extract the long-distance dependencies of the text for obtaining the global representation of the text. Then it fuses the global features and local features. Lastly, it classifies the text information through the softmax classifier. The experimental results on the text data set show that the SMNN model has better text classification accuracy which proves that the SMNN model has obvious advantage and value compared with the traditional text classification model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;SMNN</kwd>
        <kwd>BiLSTM</kwd>
        <kwd>CNN</kwd>
        <kwd>text</kwd>
        <kwd>classification</kwd>
        <kwd>semantic features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction 1</title>
      <p>
        In the early text classification algorithms, statistical-based learning models dominated[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], such as
Support Vector Machine (SVM), Naive Bayes and k-nearest neighbor algorithms. These text
classification algorithms have obvious shortcomings, the context information and sentence structure of
the sentence are not considered, and the relationship between words is ignored, resulting in low learning
ability and poor generalization ability of the model, and it is difficult to make accurate predictions. At
the same time, statistical-based learning methods are used in feature engineering. It consumes a lot of
time, so the concept of deep learning is proposed.
      </p>
      <p>
        Hinton[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a new concept in 2006 to solve the problems of machine learning algorithms.
They proposed the concept of deep learning to solve the problems of poor representation and
generalization ability of previous machine learning algorithms. Since then Later, deep learning
continued to develop with its powerful feature selection and information extraction capabilities.Text
classification algorithms based on deep learning gradually began to replace traditional machine learning
methods, such as convolutional neural networks and recurrent neural networks, and their models were
improved, and then to the transformer model based on the attention mechanism proposed by Google,
deep learning has made remarkable achievements in the field of text classification algorithms.
      </p>
      <p>In this paper, a word embedding model based on self-attention mechanism is used to obtain the
global representation of text, and CNN is used to extract local semantic features of text at multiple
granularities through different convolution kernel sizes and k-max-pooling, and BiLSTM with skip
connections is used. And the pooling layer to obtain the global semantic features of the text, and then
fuse the local features and global features, and classify the text information through the softmax
classifier. This structure can capture the local and global features of the sentence at the same time, and
improve the effect of text classification.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>The text classification method of deep learning has been developed to a great extent since 2013, and
gradually replaced the existing traditional machine learning methods. Many researchers have
participated in the research and proposed a large number of deep learning models for text classification.
Compared with traditional machine learning models, deep learning-based models can effectively solve
the high-dimensionality and matrix sparse problems of text feature vectors. Now the main research deep
learning models can be divided into: RNN-based text classification models, CNN-based classification
models, transformer-based, attention-based text classification models, and PLM-based text
classification models.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1.CNN network</title>
      <p>
        The neural network based on CNN has achieved great success in the field of images, so some
researchers and scholars in colleges and universities proposed to apply the CNN network to the related
tasks of text classification. The earliest CNN network model used for text classification tasks is
Kalchbrenner[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed the DCNN network model, which mainly captures sentences by convolution
filling as the way of same, that is the alternating structure of wide convolution and k-max dynamic
pooling. Feature map, which can well capture the local and global features of the sentence. Compared
with DCNN, the textCNN proposed by Kim[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a simpler network model. It uses single-layer
convolution and maximum pooling to extract and represent sentence features. It can extract the
information of each word in the text. complete feature. For the improvement of DCNN, Johnson[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
proposed a DPCNN model that improves the performance of the model by increasing the network depth,
which improves the performance of the model to a certain extent. Conneau[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also made further
improvements to DCNN and proposed the VDCNN network model. Its innovation lies in applying
convolution and pooling with small steps to character-level vectors. The classification performance will
increase, but with the increase of the network, the time complexity of model training increases and the
defect of semantic loss caused by the deepening of the network is particularly obvious.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2.RNN network</title>
      <p>
        The RNN text classification model is a deep learning network structure proposed by Jordan[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
which can better learn text features, understand text meaning, and capture the global features of text
sentences. The RNN neural network model mainly learns text sentence sequences through recursive
calculation. feature. However, there are its own defects in the process of RNN learning. The obvious
problem is that there are gradient disappearance and gradient explosion in the process of recursive
calculation. The above-mentioned problems Zhang[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and others refer to the long short-term memory
network model (long short-term memory, LSTM) improved on the basis of the RNN network structure
based on the sentence state in the text classification task and the performance has been greatly improved,
which effectively alleviates the problem above problems. Since the structure of the sentence has
contextual information, but LSTM only considers the one-way information of the sentence (only the
above information of the sentence is considered but not the following information of the sentence), the
deviation of LSTM in semantic understanding will affect the accuracy of the model.
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.3.Self-attention network and others</title>
      <p>
        The attention mechanism has made leap-forward achievements and breakthroughs in the field of
deep learning in recent years. Whether it is in the field of natural language or images, many scholars
and researchers are committed to combining attention mechanism and neural network for text
classification. Research. The core of the attention mechanism is to obtain and learn the most useful
information under limited resources. The essence of the attention mechanism queries the mapping of a
series of key-value pairs, and calculates the similarity between the keys and values in the key-value
pairs to obtain the weight, which improves the accuracy of text classification by assigning different
weights to the text content. The first application of the attention mechanism was used in machine
translation by Bahdanau[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and achieved very good results.
      </p>
      <p>
        Since then, Luong[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] Defined global attention and local attention in machine translation.
Vaswani[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a sentence representation method based on Self-Attention mechanism for
machine translation, which greatly improved the effect of machine translation. Due to the excellent
performance of the self-attention mechanism in the field of translation, many scholars have considered
combining the self-attention mechanism with deep learning models.For example, Jia Hongyu[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
proposed a text classification model RCNN_A that combines self-attention mechanism and recurrent
neural network; Xinqiang[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed to combine self-attention mechanism and BiLSTM to capture
local features and global features of sentences features, so that the classification effect of the network
model has been improved. The self-attention mechanism has its advantages and disadvantages. The
advantage is that it can capture the relevance of text without other information. The disadvantage is that
it cannot capture the timing information of sentences. Therefore, it is necessary to add positional
encoding to the self-attention mechanism to better capture the features of sentences.
      </p>
      <p>
        Early pre-trained language models are usually only used as text word embeddings and will not be
used for text classification tasks, such as the commonly used pre-trained language models Word2vec
and Glove[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The role of pre-trained language model models on text classification tasks begins
with the transformer model proposed by Google. Many pre-trained language models based on
transformer models not only focus on the word embedding of text, but also act on downstream tasks.
Such as the OpenGPT model proposed by Radford[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and the Bert model proposed by Devlin[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
Bert has achieved excellent results in multiple natural language processing tasks such as
ecommerce[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], medical[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and finance by learning contextual representations through a bidirectional
network structure. Subsequently, many researchers further improved the Bert model, such as the
BERTwwm model[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], the ALBERT model[21], the Bert-CNN[22] and the BiLSTM-CRF[23] model
proposed by Shi Zhenjie, Dong Zhaowei.
      </p>
    </sec>
    <sec id="sec-6">
      <title>3. SMNN network</title>
      <p>The SMNN model proposed in this paper obtains the representation of global and local semantic
feature vectors through the fusion of Word Embedding and self-attention mechanism, and finally
obtains the results of text classification through a multi-model fusion mechanism. The network structure
of the SMNN model is shown in the figure 1, which consists of an input layer, an embedding layer, a
CNN layer, a BiLstm layer and a fusion output layer.</p>
      <p>input
Word Embedding</p>
      <p>tanh</p>
      <p>Self-attention
Conv-1
256
K-M-p</p>
      <p>Conv-1
256
K-M-p</p>
      <p>Conv-1
256
K-M-p</p>
      <p>Conv-1
256</p>
      <p>K-M-p
concatenate</p>
      <p>BLSTM
concatenate</p>
    </sec>
    <sec id="sec-7">
      <title>3.1. Embedding layer</title>
      <p>Word embedding is a standard pre-trained language model that generates word vectors. In the input
layer, each word in the sentence is represented by a one-hot vector, and then the data of the input layer
is input into the word embedding to generate the distributed word vector of the sentence to realize the
matching Dynamic encoding of words. The traditional word vector representation model cannot take
into account the degree of mutual attention between the current word of the sentence and the words in
other positions, so as to affect the representation of the word vector. Different from the traditional text
classification model, the SMNN text classification model uses the word embedding mechanism based
on the self-attention mechanism to obtain the feature representation vector of each word in the sentence.
First, the word embedding is used to generate the distributed word vector, and then the activation
function Tanh is used. The processing is to speed up the calculation and training of the model, and then
use the self-attention mechanism to calculate the degree of association between each word and other
words in the text, and finally output the representation M of the word vector. For example, a sentence
vector containing contextual information is generated in the embedding layer and processed by the
activation function tanh to obtain H, and ℎ∗ represents the word vector corresponding to the t-th word
of H. Many staff, authors and scholars consider that each word pair is for the degree of influence of
global semantics is different, so it is considered to introduce a self-attention mechanism to assign
different weights to each sentence, and use the weights to determine the influence of the current word
on the semantics of the text and reflect the importance of the word to the sentence. and way to obtain
the representation M of the global feature vector of the text:
 =ℎ(
 =
∑
(
(
ℎ∗ 
)</p>
      <p>)</p>
      <p>M=∑  ℎ∗ (3)</p>
      <p>Where  represents the parameters of model training, represents the bias term,  represents the
transpose of  ,  is a randomly initialized context vector of a model, and  represents weight of
words in the t-th moment of the input sequence after normalization.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2. CNN layer</title>
      <p>Convolutional neural network (CNN) is a good local feature extractor, which captures the local
features of data by controlling the size of the convolution kernel. Therefore, Kim et al. proposed the
textCNN network structure to use the word vector generated by the embedding layer as input. As shown
in Figure 2, compared with the traditional textCNN model, the SMNN model takes the feature matrix
M
∈Rh×d generated by the attention mechanism
model training as input, and then uses a
onedimensional window size of 1, 2, 3, and 4. The convolution kernel performs vertical convolution on M
to obtain feature maps of different granularities of text. In the SMNN model, the Selu activation function
is used to process the feature maps to improve the classification effect and prevent the death of neurons
caused by relu processing. At the same time, in order to prevent max- Pooling feature selection results
in feature loss. The feature maps generated by different convolution kernel volumes in SMNN use
KMax-Pooling maximum pooling operation to select k features (k=2, indicating that the two most
important features are selected), thus generating The feature representation contains rich local feature
element  in the feature map is as follows:
information of text, and the final output vector  , 
layer. The expression of the feature map is C=[c1,c2,∙∙∙,
is the local feature representation of the CNN</p>
      <p>], and the calculation process of each
 =(
∗  :

)
(4)</p>
      <sec id="sec-8-1">
        <title>Among them, f represents the activation function (selu),</title>
        <p>∈Rh×d is the convolution kernel, *
represents the product operation of the elements in the matrix,  represents the bias term, and h∈{1, 2,</p>
      </sec>
      <sec id="sec-8-2">
        <title>3, 4} represents the convolution kernel size ,  :</title>
        <p>of M, where the value range of i is [1,n-h+1].
means to take the data from row i to row i+h-1
Generate feature maps through different convolution kernels,
then filter through the most pooling, and finally splicing the
fusion output</p>
        <p>The size of the
The word vector convolution kernel
matrix M
generated by the
embedding layer</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>3.3. BiLSTM layer</title>
      <p>Although LSTM has improved RNN, the network model can consider the one-way text information
of the sentence. Since the text is composed of context information, only considering the text information
on the left side without considering the text information on the right side will cause information loss
and affect the text. On this basis, an improved model BiLSTM based on long short-term memory
network is proposed. BiLSTM is proposed for the problem that LSTM cannot consider the global
structure of sentences. Therefore, the SMNN neural network model adopts a bidirectional LSTM
network structure to capture the context information of each word. The bidirectional LSTM obtains the
output</p>
      <p>and  of the hidden state at time t through a forward LSTM and a reverse LSTM, and then
Splicing the input  of the hidden state of the forward LSTM and the output of the hidden state of the
reverse LSTM to obtain the output ℎ of the final hidden state of the bidirectional LSTM,ℎ is the output
of the hidden state of the bidirectional LSTM corresponding to the time t, that is, the corresponding
word at time t vector representation. Taking the forward long short-term memory network as an example,
the LSTM calculation process is as follows:
 =(
 =(
 =(
 = ∗ 
 = ∗ ℎ(
∙ 
∙ 
∙ 
,  
,  
,  
)

∗ ℎ(
)
)
)
∙ 
,  
)
(5)
(6)
(7)
(8)
(9)</p>
      <p>Among them, represents the input gate, represents the output gate, represents the forget gate,
represents the word representation vector at the t-th position of the word vector generated by the
pre-trained language model, and σ represents the sigmoid activation function , W is the weight matrix
involved in the operation, b represents the offset,</p>
      <p>represents the output of the hidden layer, and finally
the output of the hidden layer and the output of the output gate jointly determine the output element  .</p>
      <p>Although the long short-term memory network solves the gradient disappearance and gradient
explosion problems of RNN to a certain extent, and can capture the context information of the sequence
well, but because the long short-term memory network introduces the forgetting gate, the LSTM
propagation process will generate memory. Missing is the loss of some important data, which affects
the experimental effect. In order to prevent semantic loss during the training process, the SMNN model
uses the word embedding model based on the attention mechanism to initialize and generate the word
vector et, and then the vector et output by the embedding layer is input into BiLSTM to output the feature
vector slt, and finally the output feature vector after splicing is spliced. slt is then output through
maximum pooling. On the one hand, this model stacking method increases the depth of the network and
helps to improve the performance and training efficiency of the model. On the other hand, it helps to
deeply capture text features and sentence structure. The connected structure helps avoid exploding and
vanishing gradients during model training, as shown in Figure 3.
concatenate
∙∙∙
∙∙∙</p>
    </sec>
    <sec id="sec-10">
      <title>3.4. Fusion output layer</title>
      <p>By inputting the global semantic feature representation generated by the self-attention mechanism
obtained by the embedding layer into the global feature representation F1 obtained by the max pooling
layer, the feature representation obtained by the CNN layer and the feature representation generated by
the BiLSTM layer are spliced in dimension. The global feature representation F, such a global
representation has rich text feature representation. In order to speed up the training, the tanh activation
function is introduced, and random deactivation is introduced to allow the neural units in the network
to be discarded from the network in a certain proportion. The purpose of this is to Improve the
generalization ability of the model to prevent overfitting during model training, then input it into the
full connection, and finally input it into the softmax to get the final prediction probability P.</p>
      <p>where  represents the trainable weight, represents the bias term, and finally the cross-entropy
loss function is used as the target function for text classification training.</p>
      <p>P=softmax(  +  )
H(p,q)=∑∈
∑∈  
(10)
(11)</p>
      <p>Among them, N is the number of training samples, C is the number of categories, q is the true label
of the sample, and the one-hot encoding used.</p>
    </sec>
    <sec id="sec-11">
      <title>4. Analysis of results</title>
    </sec>
    <sec id="sec-12">
      <title>4.1.Experimental dataset</title>
      <p>In order to verify the superiority of the SMNN network model in text classification, related
experiments are carried out on two public datasets, and the relevant information of the two datasets is
shown in Table 1. Among them, length represents the average length of the data, class represents the
number of classifications, train represents the number of samples in the training set, dev is the number
of samples in the validation set, and test refers to the number of samples in the test set:</p>
      <p>THUNews_Title dataset: THUCNews is generated by filtering and filtering the historical data of the
Sina News RSS subscription channel from 2005 to 2011, with a total of 74 records. The dataset is short
text data of multi-text classification. This article extracts 10 categories from the THUNews dataset for
training, which are finance, realty, stocks, education, science, society, politics, sports, game and
entertainment. The total number of data sets is 200,000, the training set is 180,000, and the validation
set and test set are 10,000 each.</p>
    </sec>
    <sec id="sec-13">
      <title>4.2.Experimental parameter settings</title>
      <p>
        Weight initialization: The model weights need to be initialized during model training. The SMNN
model is initialized by random sampling from the uniform distribution of [
        <xref ref-type="bibr" rid="ref1">-1,1</xref>
        ].
      </p>
      <p>Training hyperparameters: The word embedding and self-attention mechanism used by SMNN are
used as word embeddings. The dimensions of the output embedded word vector and word vector are
both 300, and the parameters of the embedding layer will be updated as the word vector is generated.
Four one-dimensional convolution kernels of different sizes are used in the CNN layer for vertical
convolution. The size of the convolution kernel is 3, 4, 5 and 6, and the number of each convolution
kernel is set to 256. The number of hidden units in BiLSTM is set to 128, the number of hidden layers
in BiLSTM is set to 2, and a dropout mechanism is introduced to prevent overfitting of the model,
dropout is equal to 0.2, the minimum batch size during model training is 128, and the training of the
model The parameters are optimized using the Adam optimizer with a learning rate of 0.003. The
relevant parameters of the SMNN model are shown in Table 2.</p>
    </sec>
    <sec id="sec-14">
      <title>4.3.Experimental results and analysis</title>
    </sec>
    <sec id="sec-15">
      <title>4.3.1.Comparative Test</title>
      <p>The SMNN model is compared with the widely used classification models, which mainly include
TextCNN, TextRCNN, DPCNN, Att-BiLSTM and Transformer.</p>
      <p>• TextCNN: Kim proposed to use multiple convolution kernels of different sizes to do vertical
convolution in the CNN network to extract the n-gram features of the text, process the data
through the activation function (relu) to speed up the model training, and finally pass the
maximum pooling. to extract the most important features of the text through softmax
classification.
• TextRNN: Liu et al. proposed a recurrent neural network structure for text classification. This
structure mainly uses the output of the hidden layer of the text in the last time step of LSTM as
the feature representation of the global text semantics, and finally passes the classifier (softmax).</p>
      <p>Classification.
• TextRCNN: A new network structure proposed by Lai et al. is used for text classification. This
network structure fully draws on the network structure of TextRNN. Different from TextRNN,
the maximum pool is added after the feature representation of the text is learned through the
recurrent neural network. to extract the salient features of the text.
• DPCNN: Johnson et al. proposed a new pyramid-like network structure for text classification,
which used increasing the depth of the network to improve the performance of DPCNN.
• Att-BiLSTM: Zhou et al. proposed to capture the global semantic features of sentences by
combining attention mechanism and bidirectional long-term and short-term memory network,
and using attention mechanism to assign different features to different words to capture
important semantic information of texts.
• Transformer: Transformer was proposed by Vaswani et al. in machine translation, which
consists of an encoder and a decoder. In the text classification task, the encoder is used to obtain
long-distance features of the text.</p>
    </sec>
    <sec id="sec-16">
      <title>4.3.2.Model comparison analysis</title>
      <p>This paper uses the accuracy rate (acc), precision rate (Precision), recall rate (Recall), and F1 value
(F1-socre) of the model to evaluate the proposed SMNN model. The calculation formulas of these four
indicators are as follows:
Precision=
Recall=
F1=×
(12)
(13)
(14)
(15)</p>
      <p>Among them, the total number of correct samples predicted by TP, the total number of wrong
samples predicted by FP, the total number of samples for actual text classification by TP+FP, the total
number of samples that should be classified by TP+FN, and the comprehensive index F1 is obtained by
comprehensively considering the precision rate and the recall rate.</p>
      <p>The classification results of this experiment are shown in Figure 4. Through the comparative analysis
of the results, it can be seen that the SMNN model has the best classification effect on sports and
education, and their F1 values exceed 95%. For properties, games are the next most effective category,
but their category F1 scores are also over 92%. Then there's entertainment, politics and society, which
have F1 averages between 91% and 92%. For finance, stocks and science were less effective, with F1
values ranging from 86% to 89%. Overall, the SMNN model performs well in various classifications,
indicating that the SMNN model has a superior classification effect on texts and can accurately achieve
text classification.</p>
      <p>100
e
g
a
t
n
e
c
r
e
p
50
finance realty stocks education scienccelassocsiety politic sports gaemnetertainment</p>
      <p>As shown in Figure 5, the SMNN model has a very superior performance on the Sina news dataset.
Its accuracy, precision, recall, and F1 are 91.51%, 91.54%, 91.51% and 91.51% respectively. In the first
6 experiments, the best experimental result is the RCNN model. Compared with the RCNN model, the
SMNN model has achieved 1.05%, 1.15%, 1.14%, and 1.12% in the accuracy rate, precision rate, recall
rate and F1 respectively. Through the comparison of the accuracy, precision, recall and F1 score of the
model, it is proved that the SMNN model extracts text signs with convolutional neural network and
recurrent neural network respectively and improves the performance of the model. Compared with the
traditional text classification model, the classification effect of the SMNN model is significantly
improved.</p>
      <p>As shown in Figure 6, the accuracy comparison chart of RCNN, Att-BLSTM and SMNN training
process is shown. The abscissa is the number of iterations of the dataset (unit is epoch, 100 batches, and
each batch has 128 data), and the ordinate is the accuracy of validation set. with the increase of the
training data set, the accuracy of the model changes in the validation set. the RCNN model, the
AttBiLSTM model and the SMNN model with the best performance among the above models are selected,
and the results show that the SMNN model is on the validation set outperforms the RCNN and
ATTBiLSTM models.</p>
    </sec>
    <sec id="sec-17">
      <title>4.3.3.Comparative Test</title>
      <p>In order to verify the influence of different modules of the SMNN model on the classification effect
of the model and further prove the effectiveness of the model, an ablation experiment is designed on
this basis. On the basis of SMNN as the original model, three groups of experiments were divided for
comparison. The first group is Word2vec-Attentation, which inputs the feature representation of the
output of the word embedding model based on the attention mechanism into the fusion output layer for
text classification, that is, directly removes the CNN layer and BiLSTM layer on the basis of the original
model; The second group of Word2vec-Attentation-CNN is to output the feature representation of the
output of the embedding layer to the CNN layer, extract text features through convolution kernels of
different sizes, and finally input them to the fusion output layer for text classification; the third group
is Word2vec -Attentation-BLSTM, which outputs the feature representation generated based on the
attention mechanism and Word2vec to the BLSTM layer, obtains the global feature representation of
the text through the long short-term memory network, and finally classifies it through the fusion output
layer. The experimental results of ablation are listed in Table 3.</p>
      <p>It can be seen from Table 3 that the classification effects of the model Word2vec-Attentation, the
model Word2vec-Attentation-CNN and the model Word2vec-AttentationBLSTM are far less than the
classification effect after the model fusion. As shown in Figure 7, the accuracy of the model training
process is compared. It can be seen from the figure that the SMNN model is higher than other
decomposition models in terms of data convergence speed and accuracy. The SMNN model can fuse
the global semantic features of the text with the local semantics at multiple granularities, and has
stronger semantic capture and information extraction capabilities.</p>
    </sec>
    <sec id="sec-18">
      <title>5. Conclusion</title>
      <p>The experimental results show that the SMNN text classification model has certain advantages
compared with the traditional text classification model. The word embedding model based on the
selfattention mechanism obtains the global representation of the text, uses CNN to extract the local semantic
features of the text at multiple granularities through different convolution kernel sizes, and uses the
BiLSTM with skip connections and the pooling layer to obtain the global text of the text. Semantic
features. The SMNN model has stronger feature extraction capabilities than a single attention-based
word embedding model (Word2vec-Attentation), CNN, BiLSTM, and an attention-based word
embedding model and a single CNN and BiLSTM combined model, which can Sufficient global and
local features of the text can achieve better classification results in each classification, and the
selfattention mechanism can improve the performance of the model in the process of combining with other
neural network models. In the following research, we will explore how to combine the self-attention
mechanism is combined with other deep network structures for text classification.</p>
    </sec>
    <sec id="sec-19">
      <title>6. References</title>
      <p>[21] Zhang Zhong-lin, Li Lin-chuan, Zhu Xiang-qi, et al. Aspect sentiment analysis combining
ONLSTM and self-attention mechanism[J]. Journal of Chinese Computer
Systems,2020,41(9):18391844.
[22] SHI Zhenjie, DONG Zhaowei, PANG Chaoyi, et al. Sentiment analysis of e-commerce reviews
based on BERT-CNN[J]. INTELLIGENT COMPUTER AND
APPLICATIONS,2020,10(02):711.
[23] Liu Jingru, Song Yang，Jia Rui，et al．A BiLSTM-CRF Model for Protected Health Information
in Chinese[J]．Data Analysis and Knowledge Discovery，2020，4 (10): 124-133.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>MARON M E</surname>
            ,
            <given-names>KUHNS J L.</given-names>
          </string-name>
          <article-title>On relevance，probabilistic indexing and information retrieval[J]</article-title>
          .
          <source>Journal of the ACM</source>
          ,
          <year>1960</year>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>216</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>HINTON G E，SALAKHUTDINOV R R.</surname>
          </string-name>
          <article-title>Reducing the dimensionality of data with neural networks [J]</article-title>
          .
          <source>Science</source>
          ,
          <year>2006</year>
          ,
          <volume>313</volume>
          (
          <issue>5786</issue>
          ):
          <fpage>504</fpage>
          -
          <lpage>507</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kalchbrenner</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grefenstette</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blunsom</surname>
            <given-names>P.</given-names>
          </string-name>
          <article-title>A convolutional neural network for modelling sentences</article-title>
          [J].
          <source>arXiv.preprint.arXiv:1404.2188</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>KIM</surname>
            <given-names>Y</given-names>
          </string-name>
          , et al.
          <article-title>Convolutional neural networks for sentence classification</article-title>
          [J].
          <source>arXiv.preprint.arXiv:1408.5882</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>JOHNSON</surname>
            <given-names>R</given-names>
          </string-name>
          ， ZHANG T.
          <article-title>Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</article-title>
          .
          <year>2017</year>
          . DOI:
          <volume>10</volume>
          .18653/V1/P17-1052.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>CONNEAU</surname>
            <given-names>A</given-names>
          </string-name>
          ，
          <string-name>
            <surname>SCHWENK H，BARRAULT L</surname>
          </string-name>
          ，et al.
          <article-title>Very deep convolutional networks for text classification [C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          .
          <year>2017</year>
          .DOI: 18653/V1/E17-1104.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>JORDAN M I.</surname>
          </string-name>
          <article-title>A parallel distributed processing approach</article-title>
          [J].
          <source>Advances in Psychology ，</source>
          <year>1997</year>
          ,
          <volume>121</volume>
          :
          <fpage>471</fpage>
          -
          <lpage>495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>ZHANG Y，LIU Q，SONG L F.</surname>
          </string-name>
          <article-title>Sentence-state LSTM for text representation [C]// Proceeding of the 56th Annual Meeting of the</article-title>
          .
          <source>Association for Computational Linguistics</source>
          .
          <year>2018</year>
          .DOI:
          <volume>10</volume>
          .18653/V1/P18-1030.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>BAHDANAU D，CHO K，BENGIO</surname>
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          [J].
          <source>arXiv preprint arXiv:1409</source>
          .0473，
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>LUONG M T，PHAM H，MANNING C D.</surname>
          </string-name>
          <article-title>Effective approaches to attention-based neural machine translation</article-title>
          [J].
          <source>arXiv preprint.arXiv:1508</source>
          .04025，
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>VASWANI A，SHAZEER N，PARMAR N</surname>
          </string-name>
          ，et al.
          <article-title>Attention is all you need [C]//</article-title>
          <source>Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          .
          <year>2017</year>
          :
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jia</surname>
            <given-names>Hongyu</given-names>
          </string-name>
          , Wang Yuhan,
          <string-name>
            <given-names>Cong</given-names>
            <surname>Riqing</surname>
          </string-name>
          , et al.
          <article-title>NEURAL NETWORK TEXT CLASSIFICATION ALGORITHM COMBINING SELF-ATTENTION</article-title>
          .MECHANISM[J].
          <source>COMPUTER APPLICATIONS AND SOFTWARE</source>
          .
          <year>2020</year>
          ,
          <volume>37</volume>
          (
          <issue>2</issue>
          ):
          <fpage>200</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Xinqiang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Weina</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaosong</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.
          <article-title>Improving Performance of Log Anomaly Detection With Semantic and Time Features Based on BiLSTM-</article-title>
          <string-name>
            <surname>Attention</surname>
          </string-name>
          [C]//.
          <source>Proceedings of 2021 2nd International Conference on Electronics, Communications and Information Technology (CECIT</source>
          <year>2021</year>
          ).,
          <year>2021</year>
          :
          <fpage>697</fpage>
          -
          <lpage>702</lpage>
          .DOI:
          <volume>10</volume>
          .26914/c.cnkihy.
          <year>2021</year>
          .
          <volume>065498</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>MIKOLOV</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>CHEN</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>CORRADO</surname>
            <given-names>G</given-names>
          </string-name>
          , et al.
          <article-title>Efficient estimation of word representations in vector space[J]</article-title>
          .
          <source>arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>PENNINGTON</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SOCHER</surname>
            <given-names>R</given-names>
          </string-name>
          , MANNINGC D.
          <article-title>Glove: Global vectors for word representation[C]//</article-title>
          <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.(EMNLP)</source>
          .
          <year>2014</year>
          ;
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Radford</surname>
          </string-name>
          , et al.
          <article-title>Language models are unsupervised multitask learners[J]</article-title>
          .
          <source>Open AI Blog</source>
          ,
          <year>2019</year>
          ,
          <volume>1</volume>
          (
          <issue>8</issue>
          ):
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>DEVLIN</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>CHANG M W</surname>
            , LEE
            <given-names>K</given-names>
          </string-name>
          , et al. BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics; Human Language Technologies</article-title>
          .
          <year>2019</year>
          :
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>LI</surname>
            <given-names>K Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>CHEN</surname>
            <given-names>Y</given-names>
          </string-name>
          , NIU Sz.
          <article-title>Social E-commerce Text Classification Algorithm Based on BERT</article-title>
          [J/OL].
          <source>Computer Science</source>
          ,
          <year>2021</year>
          ,
          <volume>48</volume>
          (
          <issue>2</issue>
          ):
          <fpage>87</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>RASMY</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>XIANG</surname>
            <given-names>Y</given-names>
          </string-name>
          , XIE
          <string-name>
            <surname>Z Q</surname>
          </string-name>
          , et al.
          <article-title>Med-BERT : pretrained contextualized embedding on largescale structured electronic health records for disease prediction[J]</article-title>
          .
          <source>NPJ Digital Medicine</source>
          ,
          <year>2021</year>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20] LAN
          <string-name>
            <given-names>Z</given-names>
            ,
            <surname>CHEN</surname>
          </string-name>
          <string-name>
            <given-names>M</given-names>
            ,
            <surname>GOODMAN S</surname>
          </string-name>
          , et al. Albert :
          <article-title>A lite bert for self-supervised learning of language representations[C]//</article-title>
          <source>Proceedings of the 8th International Conference on Learning Representations. ICLR</source>
          ,
          <year>2020</year>
          :
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>