=Paper= {{Paper |id=Vol-2385/paper4 |storemode=property |title=Legal Query Reformulation using Deep Learning |pdfUrl=https://ceur-ws.org/Vol-2385/paper4.pdf |volume=Vol-2385 |authors=Arunprasath Shankar,Venkata Nagaraju Buddarapu |dblpUrl=https://dblp.org/rec/conf/icail/ShankarB19 }} ==Legal Query Reformulation using Deep Learning== https://ceur-ws.org/Vol-2385/paper4.pdf
                    Legal Query Reformulation using Deep Learning
                          Arunprasath Shankar                                                    Venkata Nagaraju Buddarapu
                               LexisNexis                                                                   LexisNexis
                             Raleigh, USA                                                                 Raleigh, USA
                  arunprasath.shankar@lexisnexis.com                                        venkatanagaraju.buddarapu@lexisnexis.com

ABSTRACT                                                                              achieves this by using either his own prior knowledge or through
Query reformulation is the process of iteratively modifying a query                   assisted tools. Legal query reformulation is not a trivial task, as
to improve the quality of search engine results. In recent years,                     legal search queries are often short, complex and lack context. Au-
the task of reformulating natural language (NL) queries has re-                       tomatic query correction systems are one of the most widely used
ceived considerable diligence from both industry and academic                         tools within NL applications. Search engines support users in this
communities. Traditionally, query reformulation has been mostly                       task in two ways, (a) explicitly by suggesting related queries or
approached by using the noisy channel model. Since legal queries                      query completions, or (b) implicitly by expanding the query to im-
are diverse and multi-faceted, these traditional approaches can-                      prove quality and recall of organic results. Successful reformula-
not effectively handle low frequency and out-of-vocabulary (OOV)                      tions must closely align with the original query both syntactically,
words. Motivated by these issues, we rethink the task of legal query                  as sequences of characters or words, and semantically, often in-
reformulation as a type of monolingual neural machine translation                     volving transparent taxonomic relations.
(NMT) problem, where the input (source) query is potentially er-                          From a probabilistic perspective, given an incorrect query q −
roneous and the output (target) query is its corrected form. We                       that we wish to reformulate to a correct query q + , we seek to
propose a unified and principled framework with multiple levels                       model q + = arg maxq P (q|q − ) where q is the original query or
of granularity. Specifically, (i) an encoder with character atten-                    ground truth. In a traditional noisy channel model [2], the model
tion which augments the subword representation; (ii) a decoder                        consists of two parts: (i) a language model i.e., P(q) that represent
with attentions that enable the representations from different lev-                   the prior probability of the intended correct input query; and (ii)
els of granularity to control the translation cooperatively and (iii) a               an error model i.e., P (q|q − ) that represent the process in which
semi-supervised methodology to extract and augment a large-scale                      the correct input query gets corrupted to its incorrect form. There
dataset of NL query pairs combining syntactic and semantic opera-                     are several drawbacks with this approach: (i) we need two sepa-
tions. We establish the effectiveness of our methodology using an                     rate models and the error in estimating one model would affect
internal dataset, where the training data is automatically obtained                   the performance of the final output, (ii) it is not easy to model the
from user query logs. We further demonstrate that training deep                       channel since there is a lot of sources for these mistakes e.g., typing
neural networks on additional data with synthesized errors can                        too fast, unintentional key stroke, phonetic ambiguity, etc. and (iii)
improve performance for translation.                                                  it is not easy to obtain clean training data for language model as
                                                                                      the input does not follow what is typical in NL. Since the end goal
1.    INTRODUCTION                                                                    is to get a query that maximizes P (q|q − ), can we directly model
Technology and innovation are transforming the legal profession                       this conditional distribution instead.
in manifold ways. The legal industry is undergoing significant dis-                       In this work, we explore this route, which by passes the need
ruption and arguably machine intelligence is most advancing in                        to have multiple models and avoid suffering errors from multiple
the area of discovery and search. Technologies are moving from                        sources. We achieve this by applying the encoder-decoder frame-
basic keyword searches to predictive coding, which involves algo-                     work combined with attention learning using recurrent neural net-
rithms that predict whether a document is relevant or not. In [1],                    works [3] and rethink the query correction problem as a NMT prob-
McGinnis et al. envisage two phases of technological changes in                       lem, where the incorrect input is treated as a foreign language.
this area. The first phase, expected to come in the next 10 years, in-                We also propose a semi-supervised methodology to perform er-
volves perfecting semantic search that will allow lawyers to input                    ror detection and construct a large-scale dataset for training and
NL queries to interfaces and systems responding to those queries                      experimentation. To the best of our knowledge, there is no prior
directly with relevant information. The second phase involves tech-                   research work on the idea of using attention learning in a legal
nology that is able to identify issues, given a set of facts and then                 setting. Also, our work is the first that uses neural machine trans-
suggest relevant authorities that might apply to those issues.                        lation based methodologies on user generated legal data for query
   Query reformulation or correction is a crucial component for                       reformulation. We demonstrate that NMT models can successfully
any service that requires users to type in NL, such as a search en-                   be applied to the task of query reformulation in a legal background
gine. It is an iterative process where a user reformulates (rewrites)                 and validate that the trained models are naturally capable of han-
queries to improve search results or gain new information. He                         dling orthographic errors and rare words, and can flexibly correct
                                                                                      a variety of error types. We further find that augmenting the net-
In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa-      work training data with queries containing synthesized errors can
tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada.                 result in significant gains in performance.
©2019 Copyright held by the owner/author(s). Copyrighting permitted for private and
academic purposes.
Published at http:// ceur-ws.org.
2.    APPLICATION                                                           approaches and, second is the process of error detection (selecting
An “answer card” is a search feature, usually displayed in a box or         candidate queries that are incorrect) for training the reformulation
panel, that occurs above organic results and tries to directly answer       models. We will introduce related studies specific to these areas in
a question. Most answer boxes are primarily text, containing a rel-         the this section.
atively short answer and may provide limited information based
on user’s entered query. In legal context, the source can be a legal        3.1    Query Reformulation
question, profile search or can take many forms. Figure 1 shows a           3.1.1 Traditional Approaches: In [5], Xu et al. used top results
card that is an answer to a legal question “define motion to com-           retrieved by the original query to perform reformulation by expan-
pel” and Figure 2 shows a profile card for a judge search using the         sion. This method is popular and influenced by the initial ranking
query “who is judge antonin scalia ?”.                                      of results, however it cannot utilize user-generated data. Other ap-
                                                                            proaches focused on using user query logs to expand a query by
                                                                            means of clickthrough rate [6], co-occurrence in search sessions,
     Motion to compel                                                       or query similarity based on click-through graphs [7]. The advan-
     A motion to compel asks the court to order either the opposing         tage of these approaches is that user feedback is readily available
     party or a third party to take some action.
                                                                            in user query logs and can efficiently be precomputed. However,
                                                                            these approaches all face the problem of requiring some resource
                                                                            dependency on a specific search engine.
                       Figure 1: Answer Card                                3.1.2 Statistical Machine Translation: The task of machine cor-
   Vertical engines are search engines that specialize in different         rection is defined as correcting a N -character or word source query
types of search. Answer cards are usually backed by vertical search         S = s 1 , . . . , s N = s 1N into a M-character or word target query
engines that are tightly coupled with AI/NLP systems. These NLP             T = t 1 , . . . , t M = t 1M . Thus, the correction system can be defined
systems may be a machine/deep learning model(s) recognizing and             as a function F:
classifying entities that trigger and render respective cards based
on user’s query intent[4]. NLP models rely heavily on vocabulary                                          TD = F (S)                             (1)
and any OOV words contained in the query can have an adverse
                                                                            which returns a correction hypothesis TD given an input word S.
effect on the coverage or functioning of answer cards.
                                                                            Recently, several studies have been adopting data from user query
                                                                            logs as input to Statistical Machine Translation (SMT) for query re-
                                                                            formulation and expansion [8][9]. These methods treat user queries
                                Antonin Scalia                              as one language and the reformulated queries as another language.
                                                                            However, their major drawback is the difficulty of modeling cor-
                                                                            rections in different granularities, e.g., characters or words, which
                        Former Associate Justice
                          Supreme Court (U.S.)                              is necessary to reduce the rate of unknown words that are detri-
                                                                            mental to their proper functioning.
                     Born: March 11, 1936, Trenton, NJ
                                                                               Our approach differs in several ways. First, we consider full
                     Died: February 13, 2016                                query pairs as training data, and do not use single word tokens
                                                                            as a primary mode of operation. Second, we do not train explicit
                     Appointed by: Ronald Reagan
                                                                            error models P(w |s) for words w and observed corrections s, but
                                                                            use standard word/character based NMT models to derive more
                                                                            meaningful and accurate lexical translation models.
                        Figure 2: Profile Card
                                                                            3.1.3 Neural Machine Translation: Recently, the use of neu-
   The scope of this research is to develop a framework to detect           ral networks has delivered significant gains for mapping tasks be-
these erroneous queries and autocorrect them. The framework acts            tween pairs of sequences due to to their ability to learn a better
as an extra layer between the answer cards and NLP recognition              representation of the data and context. Recurrent Neural Networks
systems to facilitate a coherent mechanism for coping up with               (RNNs) are a set of neural networks for processing sequential data
missing information (vocabulary); which in turn enable us to re-            and modeling long-distance dependencies which is a common phe-
turn more appropriate cards. For e.g., for the above mentioned              nomenon in human language. NMT models [10] use RNNs and
answer card queries, lets say a user typed in “defin e motiom               learn to map from source language input to target language input
to compell” or “whi is jufge antonin scakia?” instead, our de-              via continuous-space intermediate representations effectively. We
ployed framework would fix (reformulate) the queries to its correct         use an encoder-decoder bound uni/bi-directional RNNs with an at-
form and display the cards successfully.                                    tention mechanism as the core component of our proposed system.
                                                                            [11]. The encoder maps the input query to a higher-level represen-
3.    BACKGROUND AND RELATED WORK                                           tation with a uni or bi-directional RNN architecture similar to that
There are two major areas related to our research and the proposed          of [12]. The decoder also employs a RNN that uses content-based
models. First is the work on the task of query reformulation in             attention mechanism [13] to attend to the encoded representation
information retrieval involving traditional, statistical and neural         and generate the output query one character at a time.
                                                                        2
3.1.4 Character Level Reasoning: Word-level NMT models are                   words (tokens). Table 1 shows top 10 ranked queries (left) and to-
poorly suited to handle OOV words due fixed vocabulary [14]. Re-             kens (right) at the end of query preparation process.
cent works have proposed workarounds for this problem. Since a
word is usually thought of as the basic unit of language commu-                     Rank            Query            Rank Token
nication [15], early NMT systems built these representations start-                   1      breach of contract        1   motion
ing from the word level [16]. Machine learning/NLP models typi-                       2      unjust enrichment         2   contract
cally limit vocabulary size due to the complexity of training [17].                   3     summary judgement          3   court
Therefore, they are unable to translate rare words, and it is a stan-                 4          res judicata          4   judgment
dard practice to replace OOV words with unknown (UNK) symbol.                         5              fraud             5   insurance
By doing so, useful information is discarded, resulting in systems                    6       motion to dismiss        6   evidence
that are not able to correct erroneous words. In our framework,                       7          conversion            7   property
we strive to circumvent this issue by using smaller units such as                     8     statute of limitations     8   attorney
subwords to address the problem of OOV words.                                         9          defamation            9   liability
                                                                                     10        civil procedure        10   statute
3.2    Error Detection                                                                     Table 1: Top 10 Queries and Tokens
In most translation systems, before any reformulation is carried
out, a detection process is conducted on the input query to extract          4.2       Query Representation
any potentially incorrect words. In [18], He et al. proposed a learn-        The first step in creating a model, is to choose a representation for
ing to rewrite framework that focuses on candidate ranking for               the input and output. For instance, a query can be represented in
error detection. A non-word error refers to a potentially incorrect          word-level, which is transferring the data with indices that refer to
word that does not exist in a given dictionary. Dictionary lookup            the words or in character-level, which is transferring the data with
is one of the basic techniques employed to compare input strings             indices that refer to a closed vocabulary of language specific char-
with the entries of a language resource, e.g., lexicon or corpus [19].       acters and symbols. However, in this work, we anticipate the need
Such a language resource must contain all inflected forms of the             to handle an exponentially large input space of tokens by choosing
words and it should be updated regularly. If a given word does not           a character-level query representation. Thus, model inputs are in-
exist in the language resource, it will be marked as a potentially           dividual characters which are a sequence of indices corresponding
incorrect word which is a huge disadvantage of this approach.                characters.
   Minimum edit distance is one of the most studied techniques
for error detection. It is based on counting edit operations, which                            Legal
are defined in most systems as insertion, deletion, substitution and                          Names

transposition. Jaro-Winkler [20], Wagner-Fischer [21], and Leven-                                              Data
                                                                                                            Augmentation
                                                                                Legal
shtein [22] are among the most famous edit distance algorithms.                Corpora
This approach has a main drawback, in that it penalizes all change                                                                  User Input
                                                                                               Error
operations in the same way, without taking into account the char-                            Detection

acter that is used in the change operation. In contrast to above               Query                        Training Pairs    NMT                Reformulated
                                                                                                                                                     Input
mentioned approaches, our error detection methodology combines                  Log

both syntactic and semantic attributes of a user query to extract in-                        FastText
                                                                                            Embeddings
correct non-word occurrences. Here, we establish a unified frame-
work (see Figure 3) that encompasses aspects like frequency dis-
tribution, query rank and subword representations for error detec-                           Figure 3: Proposed Framework
tion alongside NMT and attention learning.                                      We have defined specific characters like  and 
                                                                             in order to specify start and end of each query. This character will
                                                                             be later used in the training process as a criteria. Note that in
4. PROPOSED FRAMEWORK
                                                                             the character-level representation, we have also considered spaces,
4.1 Query Preparation                                                        certain diacritics and punctuations as unique characters, i.e., 40
For our experiments, we collected a total of ∼ 62M distinct queries          unique characters in all. Once a query is mapped into indices, the
grouped by frequency, collected over a period of 5 years. The queries        characters are embedded, i.e. each character is represented by a
under study included a considerable proportion of boolean queries            one-hot vector and is multiplied by a trainable matrix with the size
∼ 32M of them. We used a heuristic approach to filter search queries.        of the input X embeddings size. The embedding allows the model
Using a combination of regular expressions and strict constraints,           to group together characters that are similar for the task.
we excluded any query with boolean or special character connec-
tors; this left us with ∼ 29M NL queries.                                    4.3       Candidate Selection
   We used the punkt sentence tokenizer [23] to tokenize the queries         Our scoring heuristic calculates a sequence of feature scores for
into words. The tokens were further filtered down to remove noise            each token from the prepared vocabulary. The key motivation is
and short forms. Any token that is a digit and of length ⩾ 5 was             to separate these tokens into +ve (correct) and −ve (incorrect)
excluded from our study. The created vocabulary contained ∼ 1M               buckets. We achieve this by combing three distributional attributes.
                                                                         3
First, since our vocabulary is derived from user log queries, we                     is computed as follows:
can associate the individual tokens to the rank of its constituent                          
queries. Second, we can also affiliate a token to its individual fre-                        1, if (x ∈ ▽H ) ∨ (x ∈ ▽ J ) ∨ (x ∈ ▽E ) ∨ (x ∈ ▽C )
                                                                                            
                                                                                        ϕ=  1                                                    (5)
quency within the vocabulary. And third, by the membership of                                , otherwise
                                                                                            2
the tokens to selected legal lexicons. Our selected lexicons com-
prise of extracted vocabulary from headnotes (∼ 217K) derived                        4.3.5 Relevance Score: Finally, a relevance score (Ω) is com-
from US legal caselaw documents, legal names consisting of judge                     puted by combining all the scores derived from equations 2, 3, 4
(∼ 67K) and expert witness (∼ 388K) names derived legal master                       and 5 as follows:
                                                                                                       (      1 ) (             ) (       1 )
databases. We also use a secondary lexicon of common US english                             Ω= ϕ· 1−                · 1 − e −χ · 1 −               (6)
names (∼ 234K).                                                                                           e 2ξ + 1                      1+ζ
    Let Q be an ordered set of distinct legal queries derived from                   The relevance score expresses the relative importance of a token
user logs and ranked by frequency. Q ⇒ {q 1 , q 2 , . . . , qr } where r             and can be used to classify a word into +ve or −ve bucket. We es-
is the rank of the query; r ∈ R. Let R denote an ordered set of ranks;               tablish this separation by using a cut-off threshold (∼0.83) obtained
R ⇒ {1, 2, . . . , L}, where L is the length of Q i.e., total number of              via trial and error experiments.
distinct queries. Let ▽ ⇒ {x 1 , x 2 , . . . , xl } denote the vocabulary of
distinct tokens derived from Q where l is length of the individual                   4.4    Ranking Strategy using FastText
query. For any given token x in vocabulary ▽, it can belong to a                     Traditional approaches to representing rare and OOV words follow
subset of queries Q̂; Q̂ ⊆ Q. This implies that the token can mapped                 assigning a random vector [24] or binning all rare words into a new
to a closed set of ranks R̂; R̂ ⊆ R. E.g., the token discriminatory can              “UNK” word type. Another popular technique is to encode word
belong to multiple queries within Q, hence R̂ could be a closed set                  definitions with an external resource (Long et al., 2016) and train
e.g., {6, 37, 763, 23445}.                                                           at the morpheme level [25]. Creating semantic representations for
                                                                                     OOV words is a difficult NLP task. And word-based (Mikolov et
4.3.1 Mean Rank Score: Given R̂ ⊆ R, for a token x ∈ Q, R̂ can
                                                                                     al., 2013; aka word2vec)[26] approaches do not contribute well to
be defined as {r 1 , r 2 , . . . , rl }. We compute the mean rank score (ξ )
                                                                                     resolve this problem.
as follows:
                                    ∑
                                    l            ri                                                Token       Score         Token     Score
                                          1−
                                   i =1          L                       (2)                      roberds     0.9967        cbreach    0.9839
                             ξ =
                                          |R̂|                                                     robers     0.9962         fbreach   0.9771
                                                                                                  robergs     0.9913       onbreach 0.9684
where |R̂| is size of R̂.                                                                         robertus 0.9851           breachh 0.9683
4.3.2 Neighbor Rank Score: Given a token x 0 , it can co-occur                                   robertson 0.9797          bebreach 0.9659
alongside other tokens from the vocabulary contained to the queries                                robertt    0.9777         breachj   0.9657
under investigation. Let ▽  ˆ ⇒ {x 1 , x 2 , . . . , xl } denote the subset of                     rowert     0.9766        breachn 0.9643
tokens co-occurring alongside x. For every candidate token inside                                 rochfort 0.9758           ofbreach 0.9641
▽
ˆ , we have a corresponding mean rank score ξ derived from equa-                                   rogerts    0.9756       mcbreach 0.9627
tion 2. We restrict the window of co-occurrence to 2 neighbors, one                                robtert    0.9731       orbreach 0.9612
before and after the token. Let ξˆ be the set of corresponding mean                            Table 2: Top-10 Similar Words by fastText
rank scores; ξˆ ⇒ {x 1 , x 2 , . . . , xl } The neighbor rank score (χ ) is             Character-based embedding approaches are more suited to han-
computed as:                                                                         dle the OOV problem (Bojanowski et al., 2017; aka fastText)[27].
                                                                                     For mapping the +ve tokens to their corresponding variants (in-
                                      ∑
                                      l
                                        ξi                                           correct −ve counterparts), we used fastText [28]. fastText is a
                               χ=                                        (3)         character-based embedding approach that it is well-suited to pro-
                                                 ˆ
                                      i =1 | ξ |                                     duce embeddings for OOV words. It’s able to do this by learning
4.3.3 Frequency Score: Let ϵ denote the frequency of a token;                        vectors for character n-grams within the word and summing those
x ∈ ▽. Frequency score (ζ ) is a normalized value representing the                   vectors to produce the final vector or embedding for the word itself.
importance of frequency of occurrence of a token in the corpus. It                   For training, we set the embedding size to 100 and the training win-
is defined as follows:                                                               dow to 10. We also set alpha value to 0.025, min and max values of
                                                                                     n to 2 and 6 respectively. The embedding matrix was created using
                                           ϵ
                                ζ =                                      (4)         a batch size of 10K words in 100 epochs. We trained fastText em-
                                          |▽|                                        beddings on the 29M NL queries using a p3.2xlarge EC2 instance
4.3.4 Lexicon Membership Score: Let ▽H denote the vocabu-                            with 1 Tesla V 100 GPU. Table 2 shows top 10 most similar words
lary of tokens derived from headnotes. Let ▽ J denote the vocab-                     for the OOV +ve word “roberts” (left) and OOV −ve word “breach”.
ulary of tokens derived from judge master (judge name database)
and let ▽E denote tokens from expert witness database. Let ▽C                        4.5    Query Augmentation
indicate all the tokens from common valid US English names vo-                       Legal names like judge, expert witness, attorney names etc. are
cabulary. For any given token x, the lexicon membership score (ϕ)                    often misspelled by users during search. Since we were not able
                                                                                 4
to capture a considerable amount of these errors from query logs                and uses the hidden state for the next input character. Figure 4(a)
(real world data), we decided to synthetically augment these type               portrays a simple overview of the encoder.
of errors and generate queries for legal names algorithmically. For
                                                                                                                                            Previous
this purpose, we took all the names (+ve instances) from our legal                                                       Input
                                                                                                                                            Hidden

corpora and applied a variant of Needleman-Wunsch algorithm
to create its −ve counterparts.
                                                                                                                         Embedding
   The distance computation is identical to Levenshtein except that
character mistakes are given different weights depending on how
far two characters are on a standard keyboard layout (QWERTY).                                             Previous               ReLU
                                                                                        Input
The weights are assigned by projecting the keyboard characters                                             Hidden

onto a cartesian system. For e.g., A to S is given a mistake weight
of 0.4, while A to D is a 0.6. We generate query pairs using 4 types of                     Embedding
                                                                                                                                      GRU

transformation - addition, deletion, replacement and transposition.
Table 3 shows a few of the generated examples for augmentation.
                                                                                                     GRU                         Softmax


      Misspelled             Corrected                  Operation
      mchael lowy            michael lowy         addition                              Output                 Hidden      Output           Hidden

      david yamhamoto        david yamamoto       deletion                                       (a) Encoder                     (b) Decoder
      christibe ballard      christine ballard  replacement
      warren glciker         warren glicker    transposition                                        Figure 4: Encoder and Decoder
                    Table 3: Augmented Queries
                                                                                   Following [29], we use a bi-directional RNN with Gated Recur-
                                                                                rent Units (GRUs) to encode the source query:
4.6     Training Pairs                                                                               #»          (# »      #»)
                                                                                                     hi = GRU hi−1 , si ; Θ
In previous sections, we discussed the need to segregate the vocab-                                  »#          (» #     »# )               (8)
ulary into +ve and −ve tokens and the use of fastText character                                      hi = GRU hi−1 , si ; Θ
n-grams or subwords to capture word mappings (+ve to −ve). At
                                                                                where si is the i th source embedding, GRU is a gated recurrent unit,
the end of the candidate selection process, the entire vocabulary is            #»      »#
                                                                                Θ and Θ are the parameters of forward and backward GRU, respec-
broken into two groups of ∼180K +ve and ∼830K −ve tokens. At
                                                                                tively. The annotation of each source character qi− is obtained by
the end of ranking using fastText, we end up with around ∼2.4M
                                                                                concatenating the forward and backward hidden states:
legal term mapping and ∼50K legal name mapping. The legal name
                                                                                                                     #» 
mapping is expanded via the augmentation process elaborated in                                                »#»#      hi
the previous section to around ∼1.5M pairs. The legal term and                                                hi =  »#                       (9)
                                                                                                                      hi 
name mappings are then matched against the indexed 60M user
queries to create query pairs that will be used for training our
translation models.                                                             4.7.2 Decoder: The decoder is also built using a RNN. It takes the
                                                                                same representation of the input context along with the previous
4.7     Neural Machine Translation                                              hidden state and output character prediction, and generates a new
                                                                                hidden decoder state and a new output character prediction. The
Most NMT systems follow the encoder-decoder framework with
                                                                                decoder is a forward RNN (uni-directional) with GRUs predicting
attention mechanism proposed by Bahdanau et al. [29]. Given a
                                                                                the translation y character by character. This prediction takes the
source query q − = q 1− · · · qi− · · · q −  I and a target query q
                                                                    + =
  +         +        +
                                                                                form of a probability distribution over the entire output vocabulary
q 1 · · · q j · · · q J , we aim to directly model the translation proba-       (100 characters). At every step of decoding, the decoder is given
bility as:                                                                      an input character and hidden state. The initial input character is
                                   ∏
                                   J   (                 )                      the start of query character , and the first hidden state
               P(q + |q − ; Θ) =     P q j |q +     −