<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward Latent Knowledge Extraction Based on the Correlation of Heterogeneous Text Data Related to Space System Development</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ibaraki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ibaraki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Japan ueda.yasushi@jaxa.jp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Naoko Okubo Japan Aerospace Exploration Agency Tsukuba, Ibaraki, Japan Masafumi Katahira Japan Aerospace Exploration Agency Tsukuba, Ibaraki, Japan Toshiyuki Amagasa University of Tsukuba Tsukuba</institution>
          ,
          <addr-line>Ibaraki</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>2786</fpage>
      <lpage>2792</lpage>
      <abstract>
        <p>This paper highlights the importance of careful selection of appropriate NLP tasks or techniques to derive value from past documents and improve the requirement engineering process. As a case study, an experience about introducing NLP techniques to nd the lack of requirements by using heterogeneous documents are shown. Using word similarity is one of the ways to determine the relevance between two documents though, the result of proposed scheme in nding meaningfully related pairs of document and further investigation shows that word similarity is not able to solve our problem. In our experimental results, CNN (convolutional neural network) model could estimate the relevance the best compare to other trial models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Recently, deep learning has led to remarkable improvements in Natural language processing (NLP) research
[You17]. As a result, not surprisingly, many industries encouraged to introduce NLP to improve internal
engineering process using various types of document resources which contain potential value [Fal13]. However,
careful selection of appropriate NLP tasks or techniques is quite important to derive values for one's problem.</p>
      <sec id="sec-1-1">
        <title>Features</title>
        <p>Average number of words</p>
        <p>Quantity</p>
        <p>In this paper, we show our experience about introducing NLP techniques to nd the lack of requirements
by using heterogeneous documents maintained in JAXA (Japan Aerospace Exploration Agency). It is vital to
make the best use of the past documents due to the characteristics of space systems, i.e., the products are not
mass-produced, and their development life cycle is very long, e.g., 20 years. Here, anomaly reports are selected
to explore the insu cient requirement by searching the relevance between the development documents, thereby
contributing to preventing previously experienced anomalies. In JAXA, requirements for a software component
are organized and technically documented as a design document which is reviewed circumspectly. Hence, the
design document is thought to be an appropriate target to con rm the adequateness of their requirements from
the viewpoints of both the contents and the importance of the process phase gate. More precisely, if we can
present past anomalies that are related to a design document with high precision, we think that users may
recognize potential errors in the design. Besides, it may contribute to training novice designers. From these
perspectives, we think nding valuable anomaly reports which provide unknown, latent, or forgotten knowledge
about the target software component is same as estimating the goodness of correlation between anomaly reports
and design documents.</p>
        <p>We rst explain the features of two target documents, namely, past anomaly reports, and design documents.
Tables 1 and 2 show the features of each document. Let us look into the characteristics of anomaly reports.
They are created by various workers and are stored in a web-based system and used to manage the status of
anomalies and, more importantly, to prevent recurrence of previously experienced anomalies. Thereby in general,
the description tends to be speci c, but they often contain ambiguity caused by human, e.g., spelling variations
and abbreviations. Also, the reports are collected across a wide range of products (launch vehicles, spacecraft,
ground systems, etc.) and causes (deterioration, incorrect operation, logical errors, etc.). Consequently, we
observe a signi cant discrepancy between the anomaly reports and the design documents in various aspects. On
the other hand, the design documents are generated during development processes and referred to in technical
reviews through phased project planning.1 As describe before, the review of design documents seems to be an
important phase gate to ensure requirement quality, thus nding related past anomaly reports is bene cial.</p>
        <p>In our rst challenge, it was di cult to nd the relevance of two documents with simple keyword matching
(Section 2). The underlying assumption with the keyword matching was that relevance of anomaly reports and
development documents can be explained by the similarity, especially the similarity of keywords. This result
highlights the importance of selecting appropriate NLP task or techniques which match with the feature of one's
problem, also the feature of language resource.</p>
        <p>The contribution of this paper is twofold. First, a simple scheme to associate heterogeneous documents
(development documents and anomaly reports) using word embedding and convolutional neural network (CNN)
is proposed and its results show the capability of the proposed scheme in nding meaningfully related pairs of
document portions (Section 3 and 4). Second, the importance of selecting the appropriate NLP task is highlighted
since word similarity is not able to solve our problem (Section 5).</p>
        <p>1The actual samples of the design documents and anomaly reports were introduced at Appendix A.
The goal of this work is to nd related anomaly reports for a design document, thereby contributing to preventing
previously experienced anomalies. Two preliminaries would like to be introduced in this section, one is using
keyword matching technique, and another based on document similarity comparison. Although they used some
statistical processing for accumulated documents, were not kinds of supervised learning approach. Thereby they
can be applied with reasonable cost, but their performance was insu cient.
At rst, we adopted a simple keywords matching approach to detect valuable correlations between anomaly
reports and design documents. Keyword matching was executed using ve keywords extracted from the target
document. Okapi BM25[Rob09], which can quantify word importance from a document based on the frequency
of word appearance, was used to select the keywords. Three examineers evaluated the results about twenty
design documents. The top twenty anomaly reports extracted by the similarities based on keyword matching
were evaluated respectively in sense of correlation to design document. As a result, the precision rate was 8.25%,
which is insu cient. It revealed that the approach seems not to have the capability to achieve the goal.</p>
        <p>There were two issues related to such low precision. One is the homograph problem, for example sometimes
"wheel" means parts of ground vehicles, and in other situations it means a component of a reaction wheel, which
is used to control the attitude of satellites. This issue leads to anomaly reports with di erent contexts that are
likely to be found with high similarity. Another is the spelling variations problem. Although "Star Tracker,"
"star tracker," "STT," and "Star Tracker" in Japanese have completely the same meaning, keyword matching
cannot treat them as the same. Although preparing a dictionary for name collation may be a solution, the cost
will not be reasonable.
2.2</p>
        <sec id="sec-1-1-1">
          <title>Sentence Similarity Approach Using Word Embedding</title>
          <p>As the second preliminary, we tried to estimate the correlation using a similarity of two documents derived
from a word embedding technique. The cosine similarity of the document feature vectors is used to score the
correlation, and the document feature vector d was simply de ned as below.</p>
          <p>d =
1 Xn v
n
i=0
i
(1)
vi is an embedding vector of a word that appeared in the document, and n is the number of words. A famous
method word2vec[Tom13] was adopted to obtain appropriate word embedding vectors. All the anomaly reports
and Wikipedia topics related to space system development were used for the training. There were 760,337
documents including 101,991 vocabularies, and other parameters were listed in Table.3.</p>
          <p>The evaluation was conducted using the same design documents and examineers as in Subsection2.1. The
precision rate is 19.3% which is better than the result of keyword matching approach. Although it was not also
enough performance, the two problems described at previous subsection was mitigated.</p>
          <p>At rst, the top ve similarities of the keyword matching approach includes 44.0% misguided reports by the
in uence of the homograph problem, on the other hand the error rate of this approach was 10.0%. We think
the advance was from the pros of the document feature vector which can contain the whole document not only
keywords. The basic potency of the word embedding which can quantify semantically similar words as similar
vectors would also contribute to make the bad e ect of the spelling variations problem small.</p>
          <p>Additionally, there is a wide range of variations about the number of good correlations depending on the query
design document. Figure 1 shows the number of good anomaly reports for each design document in the top ve
similarity. Looking at the details of the results, the design documents with high precision have two properties.
First is the contents which was described about a single topic. For example the design document indexed 17
which has only the de nition of the reference time in the module is an sample. On the other hand, index number
20 is about the porting of a function which has been implemented as hardware to software, thereby there are
words concerning to both hardware and software. Second is about devices which have discriminative character.
The device named star tracker which is an optical sensor to capture the designated planets to estimate the pose
of the satellite is one of an example. No.19 has both properties.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Correlation Estimation Model</title>
      <p>Based on the results and consideration of preliminary experiments, we used CNN to estimate the correlation
between a design document and an anomaly report. The e ectiveness of CNN for the document classi cation
tasks has been reported[Kim14][Ye15]. Figure 2 shows our model.
3.1</p>
      <sec id="sec-2-1">
        <title>Dataset</title>
        <p>We prepared a dataset to train and evaluate our model. One record consists of a design document and an anomaly
report, and it is annotated by three examiners who have similar expertise. They judged whether checking the
anomaly report is useful in preventing design aws against the design document. Items which were judged
valuable by two or more examiners were treated as positive.
This subsection describes the details of the CNN architecture. We formulated correlation estimation as a
classi cation task that classi es whether a document pair is valuable. It was designed to be simple and have few
parameters. Complex, large architecture models are thought to be di cult to apply to our problem, because
a large scale dataset would be required in order to train them, but our dataset was not so. As described in
Subsection 3.1, it was prepared with limited resources from full scratch.</p>
        <p>The input for CNN is the list of words included in a design document and an anomaly report. A document
was divided into words by MeCab[Tak04], which is a popular morphological analysis tool for Japanese text.</p>
        <p>The popular word embedding layer was applied to the input layer, since it was showed there is not an obvious
relationship between the quality of word vectors and one of a downstream task[Gla16]. The length of the input
word list is de ned as L and xi means the i-th word's one-hot-vector representation. The vocabulary le was
prepared in advance to create one-hot-vectors. By using an embedding layer, the i-th one-hot-vector becomes a
word embedding vector vi 2 RNword . Nword is a prede ned dimension of the word embedding vector. When the
length of the input word list is less than the de ned length which is m + 1 for the design document in Figure 2,
zero ll vectors are placed in the vacant area.</p>
        <p>The architecture has several size lters in the convolution layer. The lter size can be de ned as RNword Lw ,
Lw means the window size of a lter. The lters will learn to be able to extract word sequence patterns which are
essential to being classi ed. The max pooling layer gets its strongest signal from the output of the convolution
layers. Although the position information of the word sequence pattern is lost by this operation, we considered
it acceptable to classify the patterns because it is an element which varies by writer, the category of articles,
and so on. After that the fully connected layer learns the combination of carried signals and judges whether an
input document pair is valuable or not. The softmax layer normalizes and outputs two signals, one means the
pair is valuable, another means not.</p>
        <p>Although this architecture does not directly compare the two document features to judge their correlation, the
relation was thought to be acquired during the learning phase. In other words, each design document has both
a related anomaly report and not related one in the training dataset, therefore the CNN will learn to extract
features from them. In addition, from our viewpoint, the functional similarity of words is thought not to be
equal to the value of correlation in this case. This is the reason why the architecture was adopted. Although
there might be some cases in which valuable correlation items have a high functional similarity, it is not always
true. We will discuss this in Section 5 with an experiment.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>As an evaluation for our model, we conducted the comparative experiment between the model estimations and
the judgments by the examiners.</p>
      <p>The dataset described at Subsection 3.1 consists of 5,000 pairs. There are 10 anomaly reports for each of 500
design documents. The design documents were selected from technical documents for about four di erent earth
orbiters. The technical domain is about attitude and orbit control system of a satellites. There are 882 positive
and 4,118 negative samples.
4.1</p>
      <sec id="sec-3-1">
        <title>Training Parameters</title>
        <p>Parameters for the training of the CNN are shown in Table 4. In the training step, 882 negative samples were
randomly picked from the population of negative samples to avoid incorrect learning due to the imbalance of
positives and negatives. As the validation set, 15 positive and negative cases were also selected at random. So
in the training phase, 1,734 samples with the same number of positives and negatives samples were used, and 30
samples are used to validate the generalization performance.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>Forty pairs of design documents and anomaly reports were prepared to evaluate the model. The pairs in the
set consist of 4 design documents and 10 anomaly reports each. They were completely separate from training
data for both design documents and anomaly reports. The examiner was also di erent from the person who
annotated the training set. There were 17 positive samples and 23 negative ones.</p>
        <p>Accuracy, precision, recall rate, F measure and mean average precision (MAP)@5 are measured to evaluate
the model performance. These scores are the average of 10 training / evaluation trials. When measuring the
MAP@5 score, the results with only one design document of 4 are collected rst. By sorting the 10 items based
on the softmax layer output value to arrange results similar to the recommended ones.</p>
        <p>Table 5 shows the results and their distributions as a boxplot in Figure 3. In the cases in where pairs with
their softmax output value is more than 0.9 are treated as the model's recommendations, the precision rate
becomes 89.0%.</p>
        <p>The recall rate is slightly worse than other scores and the accuracy score for only positive samples is 64.7%
and the one for negatives is 87.0%. From these results, we can say that the model is useful in removing such
anomaly reports that are useless. However, if we think about using our model as a part of recommender system,
it is desirable if it recommends anomaly records from di erent perspectives while avoiding recommending similar
ones. We recognize that the recommendation diversi cation is an important issue, and address it as a part of
our future work.</p>
        <p>The proposed model worked more powerfully than the statistical approaches described in Section 2. The bad
e ects from homograph and spelling variation problems were also decreased. Homograph problems will be slightly
improved by using whole documents in the estimate. The embedding layer will learn to treat spelling variations
the same in training as same as the word embedding approach. In this case, the vector in the embedding layer
for "STT" and the one for "Star Tracker" in Japanese is quite similar. It is 14th from the top when sorting all
18,274 word vectors based on the cosine similarity against the "STT" word vector.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Comparison with the Similarity Based Approach</title>
      <p>The CNN-based architecture is used in our model to estimate the values of correlation among design documents
and anomaly reports, which are annotated by examiners. On the other hand, the similarity among documents is
often used as an important feature with keyword matching approaches or so. However our underlying assumption
is that it will not work su ciently for our task because the correlation among our target documents is not simple
from the viewpoint of knowledge. To con rm this, an experiment using a similarity estimation approach was
done as described in this section.</p>
      <p>In the experiment, the similarity is estimated using a Siamese network[Bro93]. It is a well known model that
can handle the similarity of two chunks of data, and recently the performance of the framework in learning the
similarity of sentences has been reported[Mue16]. Hence the similarity can be said to be used as the basis to
judge the value of the correlation, if the Siamese network can estimate them with high accuracy.</p>
      <p>This network consists of two parts. One is the feature extraction part which is a neural network, another is
the similarity measuring part which compares two feature vectors. The weights of the feature extraction network
are trained to reduce the error between the feature distance and the supervised distance. For example in a case
of similarity measuring, the positive sample distance as a supervised signal will become 1 and a negative one will
become 0 or -1 when using cosine similarity as the distance measuring method.
Figure 4 showed the overview of our Siamese network model. The CNN described in Subsection 3.2 was adopted
as the feature extraction part. There are two slight modi cations. One is the output dimension of the fully
connected layer - it was modi ed to 256. The second is the length of input words list. It was changed to
431. The distance was calculated based on the dot product, therefore the distance of a valuable pair is 1 as a
supervised signal.
5.2</p>
      <sec id="sec-4-1">
        <title>Evaluation of Similarity Based Model</title>
        <p>This subsection describes the evaluation of the Siamese network model. It was trained using the same dataset
described in Subsection 3.1. The evaluation set and other parameters are also the same as Subsection 4.1 and
4.2. However the number of epochs was doubled to 400, because the training was not going well.</p>
        <p>The results were listed in Table 6. They are signi cantly lower than our proposed model(Table 5). Figure 5
shows the loss and accuracy value shift during a training run. The loss cannot be reduced enough to achieve
su cient accuracy even for the training set. It revealed that the valuable correlation judgements by examiners
should have di erent attributes from the documents similarity. These results validate our underlying assumption.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We explored appropriate techniques to estimate the value of correlation between a design document and an
anomaly report in terms of whether it is valuable knowledge to nd the lack of requirements that raise the
recurrence of past anomalies. As a result, the CNN based model which can learn non-linear relationship worked
e ectively. In our evaluation experiments, the performance of the model achieved a 71.0% F measure and 80.0%
as MAP@5. This is better performance than other approaches described at Section 2 and Section 5. Especially
the performance of Siamese network shows that the correlation, which would like to be estimated, seem to be
a di erent feature from the similarity of two documents. This would be important information to modify the
model aiming to improve the performance or to nd new models.</p>
      <p>As for future works, we are planning to use the model in an actual space system development process. The
trial in the process will be able to extract improvements to use e ciently in the actual development. Although in
the evaluation the satellite attitude control systems were focused, challenges to other technical areas is necessary
to bring the bene t widely in our organization. We will not only apply additional annotated data but will also
use transfer learning or more sophisticated neural language models.</p>
    </sec>
    <sec id="sec-6">
      <title>Appendix A</title>
    </sec>
    <sec id="sec-7">
      <title>Sample of documents</title>
      <p>Actual samples of the design document and the anomaly record are introduced here. The sample design document
is about the input generated from the star tracker for attitude control calculations. Here is the translation result
from Japanese to English.</p>
      <p>The star tracker captures and tracks several visible stars and outputs their position, star coordinates and
star luminosity in the eld of view. The obtained star coordinates are subjected to star identi cation
processing and attitude determination processing, and the results are output as attitude quaternions.
(Number of words: 45)</p>
      <p>Two contrasting estimation results were shown in Table 7. The correlation column means the relevance
against the design document. The rst content shows the potential failure mode which may occurs when using
the star tracker for the attitude control calculation. Although the content of the second example refers to the
optical component, the main topic is about the calculation of eld of view outside the context of attitude control
calculations.
[Fal13]</p>
      <p>Falessi, Davide et al. Automated classi cation of NASA anomalies using natural language
processing techniques. 2013 IEEE International Symposium on Software Reliability Engineering Workshops,
ISSREW 2013, pp. 5-6, nov 2013.
[Rob09] Robertson, Stephen et al. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations
and Trends R in Information Retrieval 3 (4), pp. 333-389, 2009.
[Tom13] Tomas, Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality.</p>
      <p>Advances in Neural Information Processing Systems., oct 2013.
[Kim14] Kim, Yoon et al. Convolutional Neural Networks for Sentence Classi cation. Foundations and Trends
in Information Retrieval, pp. 1746-1751, oct 2014.
Content</p>
      <p>The reaction wheel operated with a larger control amount than expected. The data from
the STT was rejected, the attitude determination process was performed using the attitude
data including the error, and STT was unable to identify stars, and repeated acquisition
and tracking. During that time, the input and update of the attitude quaternion are not
performed, the attitude is determined based on the data including the error obtained from
the other sensors, and the attitude angle recognized by the attitude control software and
the actual attitude angle are shifted. Data from STT was rejected even if STT returned
since the gap between data from STT and calculated attitude angle get larger than the
threshold. (Number of words: 115)
When capturing a target object with an optical camera, the eld of view range was
calculated. However, the pose of the target object and the own pose were set in reverse.</p>
      <p>Therefore, in conjunction with the gradient of the object pose, eld of view for the optical
camera inclined as well. Then, the angle of eld of view get di erent between + X-axis
direction and the -X direction. (Number of words: 70)</p>
      <sec id="sec-7-1">
        <title>Correlation</title>
        <p>X
[Ye15]</p>
        <p>Ye, Zhang et al. A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks
for Sentence Classi cation. arXiv preprint arXiv:1510.03820, 2015.
[Gla16]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[You17] Young</source>
          , Tom et al.
          <source>Recent Trends in Deep Learning Based Natural Language Processing [Review Article]. IEEE Computational Intelligence Magazine</source>
          ,
          <volume>13</volume>
          .3 (
          <year>2018</year>
          ):
          <volume>55</volume>
          {
          <fpage>75</fpage>
          .
          <string-name>
            <surname>Crossref</surname>
          </string-name>
          . Web.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Tak04] Taku, Kudo et al.
          <source>Applying Conditional Random Fields to Japanese Morphological Analysis Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004)</source>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          , may
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gladkova</surname>
          </string-name>
          , Anna et al.
          <article-title>Intrinsic Evaluations of Word Embeddings: What Can We Do Better?</article-title>
          .
          <source>Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP</source>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>42</lpage>
          , aug
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>