<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CVTE SLU: a Hybrid System for Command Understanding Task Oriented to the Music Field</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuquan Le</string-name>
          <email>leyuquan@yeah.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xian Li⋆</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Suixue Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peng Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haiqian Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guanyu Jiang</string-name>
          <email>jiangguanyug@cvte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Guangzhou Shiyuan Electronic Technology co., LTD</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The dialogue system lies at the core of many natural language processing applications, and thus has received much attention. The 2018 China conference on knowledge graph and semantic computing (CCKS) challenge sets up a evaluation competition for command understanding task oriented to the music field. In this task, we mainly focus on two sub-tasks: (1) Intent identification in music field task, and (2) Slot filling in field task. We proposed a hybrid system based on rule, Multi-fastText and CRF (conditional random field) methods for this task. The experiment results show that, the F1 score of intent identification is 0.867, the F1 score of slot filling is 0.780 and the accuracy of the no-slot discourse is 0.977. The overall score is 1.312, which proves the effectiveness of our system.</p>
      </abstract>
      <kwd-group>
        <kwd>Intent Identification</kwd>
        <kwd>Slot Filling</kwd>
        <kwd>Conditional Random Field</kwd>
        <kwd>Multi-fastText</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>⋆ means the corresponding author.
mentioned in the utterance need to be extracted. This task is called slot filling.
In this paper, we take the text classification methods to deal with the first task
and named entity recognition (NER) approaches to handle the second task.</p>
      <p>
        Text classification is an important task in natural language processing with
many applications [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The key problem in text classification is feature
representation, which is commonly based on bag-of-words model. Several feature selection
approaches, which include frequency and term frequency-inverse document
frequency (TF-IDF), are applied to select more features. Owing to the success of
word embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], recent popular neural network methods [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have applied on
text classification, obtaining attractive performance. We propose Multi-fastText
based on fastText [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to handle music domain intent identification. The most
existing approaches of NER are based on machine learning methods, which include
Hidden Markov Model (HMM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Support Vector Machine (SVM) and
Conditional Random Field (CRF) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In recent years, some neural network methods
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have been successful in NER, and achieve competitive performance.
      </p>
      <p>Although many methods have gained competitive performance in text
classification and named entity recognition, respectively. There are lots of problems
in applying the above methods to the current task. Since music domain intent
identification and slot filling together require comprehensive factors, and at the
same time, the spoken dialogue text in a specific scenario (music scenario in
this paper) is irregular. In this study, we develop a hybrid system and
participate in 2018 CCKS challenge. The proposed hybrid system is based on three
mainly component methods (rule, CRF and Multi-fastText). To summarize, the
main contributions of this paper are: (1) We develop a hybrid system, which
attempts to comprehensively consider the performance of music domain intent
identification and slot filling tasks. (2) For the spoken dialogue text generated
by the particular music scene, we have explored some favorable rules (including
some external dictionary resources). (3) We experiment on the CCKS-2018 task
2 datasets and the result proves the effectiveness of our system.
1
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <sec id="sec-2-1">
        <title>FastText</title>
        <p>
          FastText [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is a library for the learning of word embedding and the text
classification. The architecture is similar to the cbow model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], where the
middle word is replaced by a label. We use the softmax function to compute the
probability distribution over the predefined classes.
1.2
        </p>
        <p>CRF</p>
        <p>
          Conditional Random Field (CRF) is a kind of discriminative undirected
probabilistic graphical model. It is often used for labeling or parsing of sequential
data. Particularly, it has been shown to be useful in POS tagging, shallow parsing
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and named entity recognition [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>We assume that the random variables sequence X and Y are of the same
length, and use x = x1; x2; :::xn and y = y1; y2; ::yn for the generic input
sequence and label sequence, respectively. A CRF on (X; Y ) is specified by a
vector f of local features and a corresponding weight vector . The CRF global
feature vector is given by F (y; x) = if (y; x; i), where x is the input sequence,
y is the label sequence and i ranges over the input positions. The conditional
probability distribution defined by the CRF is then pλ(Y jX) = ΣeyxepxλpFλ (FY(,Xy,x)) .
For training example f(xt; yt)gtN=1, the goal is to maximize the log-likelihood
Lλ = t log pλ(ytjxt).
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Model</title>
      <p>Figure 1 shows the overview architecture of our system for command
understanding task oriented to the music field. In order to increase the diversity of data
for named entity recognition, we used some external data (NLPCC 2018 task
2 datasets). Pre-processing includes character segmentation and part of speech
tagging (we treat a character as an independent individual for POS). Since the
datasets come from the real discourse of the user in the dialogue system. Each
sample is a segment of the user’s utterance, including three sentences. Our
goal is to determine if the last sentence is a musical intent and and slot filling.
Through statistics, we found that there are a large number of datasets in which
the last sentence is a short one. Therefore, we must make reasonable use of the
previous information. We cut the datasets into three parts: each contains the
first, the second, and the third sentence of each sample. We have designed a
Multi-fastText method in order to better mine the user’s multi-round dialogue
information. Multi-fastText method works as follows: (1) If fastText3 determines
that the third sentence is a musical intent, then the third sentence is classified
as musical intent; (2) Otherwise, when fastText1, fastText2 judges that the first
and second sentences are all musical intents, the third sentence is also classified
as musical intent; (3) In other cases, the third sentence is no musical intent.
Post-processing includes some rules, the details of rules are as follows:</p>
      <p>Rule 1: Full sentence matching rule (FSMR): (a) The entire sentence
contains only one artist name, which is labeled as the artist label. Specifically,
we crawled 28712 artist names from the Internet as an external artist dictionary
resources. (b) The entire sentence contains only one song name and the song
name is not in the ambiguous song dictionary, it is marked as a song label.
Specifically, we crawled 172511 song names from the Internet as an external
song dictionary resources. However, there are some ambiguous song names in
the song dictionary, such as ”点歌”, ” 一首歌”, etc. Therefore, we also maintain
an external ambiguous song dictionary.</p>
      <p>Rule 2: Entity re-identify rule (ERIR): Specifically, we consider two
entities: ”artist”, ”song”. By using the song dictionary, if the model-based result
is the substring of certain entities in the dictionary (and the entity is the
substring of sentences), then correct the result to the one with the shortest length
that meets the requirements. For example, assuming the model-based result is
“路口”, while the song dictionary include ” 下一个路口” and the utterance
includes ”下一个路口”. The final result can be revised as ” 下一个路口” by
entity re-identify rules. If some of the entities in the dictionary (entity is the
substring of the sentence) are also substrings of the model-based result, the
reCRF+Multi-fastText
CRF+Multi-fastText+FSMR
CRF+Multi-fastText+FSMR+ERIR</p>
      <p>F 1E
sult is corrected to the one with the longest length that meets the requirements.
Just as one example, assuming the model-based result is “刘德华冰雨”, while
the song dictionary includes ” 冰雨”. The final result can be revised as ” 冰
雨” by entity re-identify rules. Artist external dictionary resources also perform
similar operations.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>3.1</p>
      <sec id="sec-4-1">
        <title>Datasets and Implementation</title>
        <p>We trained the 300 size pre-trained character embedding (as a pre-training
embedding for the Multi-fastText) using the word2vec tool 1, and the training
corpus used the entire wikipedia 2018 2. We migrated the NLPCC 2018 task 4
datasets 3 along with the CCKS datasets as a training set for the CRF (using
crfsuite tools 4 in this paper) model, using the 5 tool for part-of-speech tagging.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experiment result and analysis</title>
        <p>From the table 1, we can find that our system has achieved competitive
performance. The best result is that F 1E is 0.780, F 1I is 0.867, Acc is 0.977, and
the overall score is 1.312. Specifically, we find that the rules are effective. Without
using rules, the bottleneck of F 1E is 0.753, while CRF + rules achieves 0.780
F 1E score. However, adding ERIR based on FSMR has a very weak performance
improvement (only 0.09). We speculate that there are several reasons for this : (1)
1 https://code.google.com/archive/p/word2vec/
2 https://dumps.wikimedia.org/zhwiki/
3 http://tcci.ccf.org.cn/conference/2018/taskdata.php#
4 http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html
5 https://pypi.org/project/PyNLPIR/0.4.1/
The resources used as external dictionaries are obtained through web crawlers,
and some data may not be cleaned clearly, thus introducing new noise; (2) We
only select two entities (”artist” and ”song”), whose the total amount is huge,
to join the rule action, but in fact, the task contains many types of entities.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we proposed a hybrid system, which is named after CVTE SLU,
for command understanding task oriented to the music field. Experiments on
2018 CCKS corpus prove the effectiveness of our system. In order to make the
results better for future work, we will start to work on two aspects. First,
maintaining external dictionary resources to make their quality more reliable. Second,
applying the rule method proposed in the article to all entity categories.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Levin</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Pieraccini</surname>
          </string-name>
          ,
          <article-title>User modeling for spoken dialogue system evaluation</article-title>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Yao</surname>
          </string-name>
          , “
          <article-title>Acv-tree: A new method for sentence similarity modeling</article-title>
          .”
          <string-name>
            <surname>in</surname>
            <given-names>IJCAI</given-names>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>4137</fpage>
          -
          <lpage>4143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ducharme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Jauvin</surname>
          </string-name>
          , “
          <article-title>A neural probabilistic language model</article-title>
          ,
          <source>” Journal of machine learning research</source>
          , vol.
          <volume>3</volume>
          , no.
          <source>Feb</source>
          , pp.
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, and P. B. T. Mikolov, “
          <article-title>Bag of tricks for efficient text classification</article-title>
          ,
          <source>” EACL</source>
          <year>2017</year>
          , p.
          <fpage>427</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Morwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jahan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Chopra</surname>
          </string-name>
          , “
          <article-title>Named entity recognition using hidden markov model (hmm</article-title>
          ),
          <source>” IJNLC</source>
          , vol.
          <volume>1</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>23</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>J. Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>-J. Chen</surname>
          </string-name>
          , “
          <article-title>Automatic recognition of chinese organization name based on cascaded conditional random fields,”</article-title>
          <source>Acta Electronica Sinica</source>
          , vol.
          <volume>34</volume>
          , no.
          <issue>5</issue>
          , p.
          <fpage>804</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          and E. Hovy, “
          <article-title>End-to-end sequence labeling via bi-directional lstm-cnns-</article-title>
          <string-name>
            <surname>crf</surname>
          </string-name>
          ,
          <source>” arXiv preprint arXiv:1603.01354</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          , “
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>” in Advances in neural information processing systems</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>F.</given-names>
            <surname>Sha</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          , “
          <article-title>Shallow parsing with conditional random fields</article-title>
          ,” in NAACL,
          <year>2003</year>
          , pp.
          <fpage>134</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>