=Paper= {{Paper |id=Vol-2327/ExSS8 |storemode=property |title=Learn2Sign: Explainable AI for Sign Language Learning |pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-13.pdf |volume=Vol-2327 |authors=Prajwal Paudyal,Junghyo Lee,Azamat Kamzin,Mohamad Soudki,Ayan Banerjee,Sandeep Gupta |dblpUrl=https://dblp.org/rec/conf/iui/PaudyalLKSBG19 }} ==Learn2Sign: Explainable AI for Sign Language Learning== https://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-13.pdf
            Learn2Sign: Explainable AI for Sign Language Learning
       Prajwal Paudyal, Junghyo Lee, Azamat Kamzin, Mohamad Soudki, Ayan Banerjee, Sandeep K.S.
                                               Gupta
                                                            Arizona State University
                                                          Tempe, Arizona, United States
                                      (ppaudyal,jlee375,akamzin,msoudki,abanerj3,sandeep.gupta)@asu.edu

ABSTRACT                                                                               Extended studies have shown that providing item-based feedback in
Languages are best learned in immersive environments with rich                         CALL systems is very important [35]. Towards this goal, extensive
feedback. This is specially true for signed languages due to their                     language learning softwares for spoken languages such as Rosetta
visual and poly-componential nature. Computer Aided Language                           Stone or Duolingo support some form of assessments and automatic
Learning (CALL) solutions successfully incorporate feedback for                        feedback [31]. Although, there are numerous instructive books [23],
spoken languages, but no such solution exists for signed languages.                    video tutorials or smartphone applications for learning popular
Current Sign Language Recognition (SLR) systems are not inter-                         sign languages, there hasn’t yet been any work towards providing
pretable and hence not applicable to provide feedback to learners. In                  automatic feedback as seen in Table 1. We conducted a survey [13]
this work, we propose a modular and explainable machine learning                       of 52 first-time ASL users (29M, 21F) in 2018 and 96.2 % said that
system that is able to provide fine-grained feedback on location,                      reasonable feedback is important but lacking in solutions for sign
movement and hand-shape to learners of ASL. In addition, we also                       language learning (Table 2).
propose a waterfall architecture for combining the sub-modules to
prevent cognitive overload for learners and to reduce computation                      Table 1: Some ASL learning applications for smartphones.
time for feedback. The system has an overall test accuracy of 87.9 %
on real-world data consisting of 25 signs with 3 repetitions each                               Application        Can Increase Vocab     Feedback
from 100 learners.                                                                              ASL Coach                  No               None
                                                                                               The ASL App                 No               None
CCS CONCEPTS                                                                                 ASL Fingerspelling            No               None
• Human-centered computing → Interaction design; • Com-                                         Marlee Signs              Yes               None
puting methodologies → Artificial intelligence; • Applied                                     SL for Beginners             No               None
computing → Interactive learning environments.                                                     WeSign                  No               None

KEYWORDS
                                                                                           Studies show that elaborated feedback such as providing mean-
Explainable AI; Sign Language Learning; Computer-aided learning                        ingful explanations and examples produce larger effect on learning
ACM Reference Format:                                                                  outcomes than just feedback regarding the correctness [35]. The
Prajwal Paudyal, Junghyo Lee, Azamat Kamzin, Mohamad Soudki, Ayan                      simplest feedback that can be given to a learner is whether their
Banerjee, Sandeep K.S. Gupta. 2019. Learn2Sign: Explainable AI for Sign                execution of a particular sign was correct. State-of-the-art SLR and
Language Learning. In Joint Proceedings of the ACM IUI 2019 Workshops, Los             activity recognition systems can be easily trained to accomplish
Angeles, USA, March 20, 2019 , 7 pages.
                                                                                       this. However, to truly help a learner identify mistakes and learn
                                                                                       from them, the feedback and explanations generated must be more
1    INTRODUCTION
                                                                                       fine-grained.
Signed languages are natural mediums of communication for the                              The various ways in which a signer can make mistakes during
estimated 466 million deaf or hard of hearing people worldwide [16].                   the execution of a sign can be directly linked to how minimum pairs
Families and friends of the deaf can also benefit from being able                      are formed in the phonetics of that language. The work of Stokoe
to sign. The Modern Language Association [2] reports that the                          postulates that the manual portion of an ASL sign is composed of
enrollment in American Sign Language (ASL) courses in the U.S.                         1) location, 2) movement and 3) hand-shape and orientation [30]. A
has increased nearly 6,000 percent since 1990 which shows that                         black box recognition system cannot provide this level of feedback,
interest to acquire sign languages is increasing. However, the lack                    thus there is the need for an explainable AI system because feedback
of resources for self-paced learning makes it difficult to acquire,                    from the system is analogous to explanations for its final decision.
specially outside of the traditional classroom setting [26].                           Non-manual markers such as facial expressions and body gaits
   The ideal environment for language learning is immersion with                       also change the meaning of signs to some extent but they are less
rich feedback [27] and this is specially true for sign languages [9].                  important for beginner level language acquisition, so these will be
                                                                                       considered for future work.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                         Studies have also shown that the effect of feedback is highest
© 2019 Copyright for the individual papers by the papers’ authors. Copying permitted   if provided immediately [35], thus feedback systems should be
for private and academic purposes. This volume is published and copyrighted by its
editors.                                                                               real-time. The requirement for immediate feedback also restricts
                                                                                       the usage of complicated learning algorithms that require heavy
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                          P. Paudyal et al.




                                                   Figure 1: System Model for Feedback.


computing [6] and extensive training. The usability and usefulness        Table 2: Survey Results from 52 Users of the Application.
of applications is enhanced if learning is self-paced, learners are
allowed to use their own devices, and the learning vocabulary can                   Category                     Response
be easily extended. However, current solutions for SLR require re-                 Importance
training to support unseen words and large datasets initially. To                                   Yes: 96.2%        No: 3.8%
                                                                                   of Feedback
solve these challenges, we designed Learn2Sign(L2S), a smartphone                                           Correctness: 9.6%
application that utilizes explainable AI to provide fine-grained feed-                                    Colored Bones: 1.9%
back on location, movement, orientation and hand-shape for ASL                     Movement
                                                                                                             Sentence: 15.4%
learners. L2S is built using a waterfall combination of three non-                 Feedback
                                                                                                      Correctness+Sentence: 9.6%
parametric models as seen in Figure1 to ensure extendibility to new                                             All: 65.3
vocabulary. Learners can use L2S with any smartphone or com-                       Handshape        Circle around handshape: 63.5%
puter with a front-facing camera. L2S utilizes a bone localization                  Feedback           Actual handshape: 36.5%
technique proposed by [17] for movement and location based feed-                                            Not helpful: 1.9%
back and a light-weight pre-trained Convolutional Neural Network                Self-Assessment             Somewhat: 5.8%
(CNN) as a feature extractor for hand-shape feedback.                                                      Very helpful: 93.3%
    The methodology and evaluations are provided in Sections 3                                              Not helpful: 5.8%
and 4. As part of the work, we collected video data from 100 users                Expandability             Somewhat: 7.7%
executing 25 ASL signs three times each. The videos were recorded                                          Very helpful: 86.5%
by L2S users in real-world settings without restrictions on device-
type, lighting conditions, distance to the camera or recording pose
(sitting or standing up). This was to ensure generalization to real-     on design principles for using Automatic Speech Recognition (ASR)
world conditions, however, this makes the dataset more challenging.      techniques to provide feedback for language learners [36]. Sign
More details about the resulting dataset of about 7500 instances         Language Recognition (SLR) is a research field that closely mirrors
can be found in [13].                                                    ASR and can potentially be utilized by systems for sign language
                                                                         learning. However, to the best of our knowledge, no such system
2   RELATED WORK                                                         exists. This can be explained by the inherent difficulties in SLR as
There have been many works on providing meaningful feedback for          well as the lack of detailed studies on design principles for such
spoken language learners [8, 21, 22]. On the practical side, Rosetta     systems. In this work, we propose some design principles and an
Stone provides both waveform and spectrograph feedback for pro-          explainable smart system to meet this goal.
nunciation mistakes by comparing acoustic waves of a learner to             Continuously translating a video recording of a signed language
that of a native speaker [31]. There has also been some recent work      to a spoken language is a very challenging problem and has been
Learn2Sign: Explainable AI for Sign Language Learning                                 IUI Workshops’19, March 20, 2019, Los Angeles, USA


                                                                         3.1    User Interface
                                                                         For initial data collection and for testing the UI, we developed an
                                                                         android application called L2S. We preloaded the application with
                                                                         25 tutorial videos from Signing Savvy corresponding to 25 ASL
                                                                         signs [24]. The application has three main components: a) Learning
                                                                         Module b) Practice Module, and c) Extension.

                                                                         3.1.1 Learning Module. The learning module of the L2S application
                                                                         is where all the tutorial videos are accessible. A learner selects an
                                                                         ASL word/phrase to learn and can then view the tutorial videos.
                                                                         The learner can pause, play, and repeat the tutorials as many times
                                                                         as needed. In this module, the learner can also record executions of
                                                                         their signs for self-assessment.
     (a) TIGER mostly in bucket (b) DECIDE in buckets 3 and
     1 for left-hand.           6 for right-hand.                        3.1.2 Practice Module. The practice module is designed to give au-
                                                                         tomatic feedback to the learners. A learner selects a sign to practice
Figure 2: Automatic bucketing for Location Identification                and sets up their device to record their execution. After this, L2S
for varying distances from camera. Left Wrist-Yellow, Right              determines if the learner performed the sign correctly. The result
Wrist-Red, Eyes - White, Shoulders-Green.                                is correct if the sign meets the thresholds for movement, location,
                                                                         and hand-shape and a ‘correct’ feedback is given. If, the system
                                                                         determines that the learner did not execute the sign correctly, an ap-
                                                                         propriate feedback is provided as seen in Figure 1. Details about the
                                                                         recognition and feedback mechanisms is discussed in Section 3.4.
tackled recently by various researchers with some success [6]. For
the purposes of this application, such complex measures are not          3.1.3 Extension Module. To extend the supported vocabulary of
desirable, as they mandate extensive datasets for training and large     L2S, a learner can upload one or more tutorial videos from a source
models for translation which decreases their usability. Isolated         of their choosing. The application processes them for usability
Sign Language Recognition has the goal of classifying various sign       before they appear in the Learning Module as new tutorial sign(s).
tokens into classes that represent some spoken language words [11,
12, 18, 29]. Some researchers have utilized videos [14] while some       3.2    Data Collection
others have attempted to use wearable sensors [18, 19] with varying      We collected signing videos from 100 learners, for 25 ASL signs
performances. In this work, we utilize the insights and advances         with three repetitions each in real-world settings using L2S app.
from such systems to help a new learner acquire the sign language        Learners used their own devices, with no restrictions on lighting
words. To our knowledge, this work is the first attempt at such a        conditions, distance to the camera or recording pose (sitting or
practical and much needed application.                                   standing up). After reviewing a tutorial video, a learner was given
   For this work, we require an estimation of human pose, specifi-       a 5 s setup time before recording a 3 s video using a front-facing
cally the estimates on the location of various joints throughout a       camera. Both the tutorial and the newly recorded video were then
video, known as keypoints. There have been several works towards         displayed in the same screen for the user to accept or reject. This
this goal [3–5, 25, 32, 33]. Some of these works first detect the key-   self-assessment served not only as a review but it also helped prune
points in 2D and then attempt to ‘lift’ that set to 3D space while       incorrect data due to device or timing errors as suggested by the
others return the 2D coordinates of the various keypoints relative       new learner survey in Table 2.
to the image. In order to fulfill the requirement to use pervasive
cameras, we did not focus on the approaches that utilize depth
                                                                         3.3    Preprocessing
information such as Microsoft Kinect [20]. Thus, we utilized the
pose estimates from a Tensorflow JS implementation of a model            Determining joint locations: Since, different devices record in
proposed by Papandreou et al. [17] which can run on devices with         different resolutions, all videos for learning, practice or extension
or without GPUs (Graphical Processing Units).                            are first converted to a 320*240 resolution. Then, PoseNet Javascript
                                                                         API for single pose estimation [17] was used to compute the esti-
                                                                         mated locations and confidence levels for the various keypoints as
3   METHODOLOGY                                                          seen in Table 3. Figure 2 shows the estimated eyes, shoulder and
Stokoe proposed that a sign in ASL consists of three parts which         wrist locations for the signs TIGER and DECIDE for all the frames
combine simultaneously: the tab (location of the sign), the dez (hand-   in one video.
shape) and the sig (movement) [30]. Signs like ‘HEADACHE’ and            Normalization: There is a difference in scale of the bodies relative
‘STOMACH ACHE’ that are similar in hand-shape and movement               to the frame-size corresponding to the distance between the learner
may differ only by the signing location. Similarly, there will be        and the camera. This scaling factor can negatively impact recog-
other minimal pairs of signs that differ only by the movement or         nition since the relative location, movement and hand-shape will
hand-shape. Following this understanding, L2S is composed of three       vary with distance. We perform min-max normalization and zero-
corresponding recognition and feedback modules.                          ing based on the distance between the average estimated locations
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                          P. Paudyal et al.

  Table 3: 4 out of 17 ‘keypoints’ for one frame in a video             3.5    Location
                                                                        To correctly and efficiently determine the location for signing, we
               Part          Score        x           y                 first assume the shoulders stay fairly stationary throughout the
          Left Shoulder      0.8325   180.0198    196.2646              execution of a sign. This is a fair assumption for ASL since there are
         Right Shoulder     0.78601   138.7879    195.5847              no minimal pairs exclusively associated with a signer’s shoulders.
            Left Wrist       0.6844   198.9818    223.4856              Then we divide the video canvas into 6 different sub-sections called
           Right Wrist       0.1564    1.9084     211.0473              buckets as seen in Figure 2. Then, as the learner executes any given
                                                                        sign, the location of both the wrist joints is tracked for each bucket
                                                                        resulting in a vector of length 6.
                                                                           This same procedure is followed for the tutorials, and a cosine-
                                                                        based comparison between is done between the two vectors. A
                                                                        heuristic threshold that is determined during training is utilized
                                                                        as a cut-off point. If the resulting cosine similarity is lower than a
                                                                        threshold, some feedback is shown to the learner as seen in Figure 4.
                                                                        For each hand, the user’s own video is replayed in Graphics Inter-
                                                                        change Format (GIF) with a red highlight on the location section
                                                                        that was incorrect and a green highlight on the section of the frame
                                                                        where the sign should have been executed. A text feedback with
                                                                        details and a link to the tutorial is also provided and the learner is
                                                                        prompted to try again.

                                                                        3.6    Movement
                                                                        Determination of correct movement is perhaps the single most
                                                                        important feedback we can provide to a learner. We compute a
       Figure 3: Feedback Sensitivity vs. Performance.                  segmental DTW distance between a learner and the tutorial us-
                                                                        ing keypoints for the wrists, elbows and shoulders as suggested
                                                                        in [1]. Normalization as discussed in Section 3.3 was found to be
for the right and left shoulders throughout the video frame as sug-     very important. Experimental results showed that segmental DTW
gested by [15]. Normalization was found to be specially important       outperformed DTW or Global Alignment Kernel (GAK).
for correct movement recognition.                                          The dataset had a wide variation in the number of frames per
                                                                        video. It was found that this affected the distance scores adversely.
3.4    Recognition and Feedback                                         Thus, as an additional step of preprocessing, the video with the
L2S is designed to give incremental feedback to learners for the        higher number of frames was down-sampled before comparison and
various modalities in sign language: a) Location b) Movement and        the segmental DTW is utilized to find the best sub-sample matching.
c) Hand-shape. The various models are arranged in a waterfall           Thresholds for the signs were determined experimentally using 10
architecture as seen in Figure 1. If the location of signing was not    training videos for each sign. If segmental DTW distance between
correct, then immediate feedback is provided and the learner is         a learner’s recording and a tutorial was higher than the threshold
prompted to try again. Similarly, if the movement of the elbows or      for each arm section, then a movement-based feedback is provided
the wrists for either hand was incorrect, the learner is prompted       as seen in Figure 1. A GIF is replayed to the user with the section(s)
to try again. Finally, if the shape and orientation of either of the    of the arm for which the movement was incorrect in red as seen in
hands does not appear to be correct, a hand-shape based feedback        Figure 5b. A textual feedback is also generated with an explanation
is provided. Consequently, the learner can move on to a practice        after which the user is prompted to watch the tutorial and try again.
a new sign, only if all these modalities were sufficiently correct.
A waterfall architecture was chosen in the final application over
a linear weighted combination to make learning progressive and          3.7    Hand Shape and Orientation
to decrease the cognitive load on the learner due to the potential      ASL signs which are otherwise similar, may differ only by the
of mistakes in multiple modalities. This architecture also helps        shape or orientation of the hands. Since, CNNs have state-of-the-
to reduce the time taken for recognition and feedback since the         art image recognition results, we utilized Inception v3 or Mobilenet
models are stacked in an increasing order of execution time. Each       CNN depending on the device being used. A model that was pre-
of the feedback screens shown to the user also has a link to the        trained on ImageNet is retrained using hand-shape images from the
tutorial video. Users can also manually tune the amount of feedback     training users. The wrist location obtained during pre-processing
by altering the value of ‘feedback sensitivity’ in the application      was used as a guide to auto-crop these hand-shape images. During
settings. Increasing this value alters the thresholds for each of the   recognition time, hand-shape images from each hand are extracted
sub-modules so that the overall rate of feedback is increased. This     automatically in a similar way from a learner’s recording. Then 6
involves a trade-off in performance which is summarized in Figure 3.    images for each hand are passed separately through the CNN and
                                                                        the softmax layer is obtained and are concatenated together as seen
Learn2Sign: Explainable AI for Sign Language Learning                                   IUI Workshops’19, March 20, 2019, Los Angeles, USA


in Figure 1. Similar processing is done on the tutorial video to obtain
a vector of the same length. Then a cosine similarity is calculated
on the resultant vector. If the similarity between a learner’s sign
and that of a tutorial is above a set threshold for a sign, then the
execution is determined to be correct, otherwise the hand-shape
based feedback as seen in Figure 5 is provided.
   Although the retrained CNN could theoretically be used as a
classifier, we use it only as a feature extractor for cosine similarity
to ensure that the system can extend to unseen classes. A new
tutorial can then be effectively added to the system without the need
for retraining. An analysis of the effectiveness of hand-shape and
orientation recognizer is provided in Section 4.Similar to location
and movement, feedback for hand shape and orientation is also
                                                                                 (a) HERE: Red box(upper): (b) DEAF: Red box(upper):
provided in the form of a replay GIF and text. A zoomed in image
                                                                                 Detected Location, Green Detected Location, Green
of the incorrect hand shape is shown side by side with the correct               Box(lower): Correct Loca- Box(lower): Correct Loca-
image from a tutorial as seen in Figure 5(a).                                    tion.                     tion.

4   RESULTS AND EVALUATION                                                     Figure 5: Feedback for incorrect location for right hand.
An ideal system should give feedback to a learner only if their
execution is incorrect. Giving unnecessary feedback for correct ex-
ecutions will hinder the learning process and decrease the usability.      incorrect class to avoid class imbalance. A pre-trained model from
Conversely, providing sound and timely explanations for incorrect          C3D [34] was retrained with the data we collected and was used as
executions helps to improve utility and user trust. Smart systems          the baseline for comparison. This model has an accuracy of 82.3 %
such as L2S that use explainable machine learning tend to have a           on UCF101 [28] and 87.7 % in YUPENN-Scene [7] datasets. The final
trade-off between explainability and performance which should be           recognition accuracy of C3D on L2S dataset using the same train-
minimized.                                                                 test split was 45.38 %. Our approach achieves a higher accuracy of
   The overall performance of the system was tested for 10 test            87.9 % while still offering explanations about its decisions in the
users for a total of 750 signs. The training of the CNN for hand-          form of learner feedback.
shape feature extraction and optimal threshold determination was              To obtain the results, data collected from one learner was selected
done using the remaining users. For each sign, 30 executions from          at random and served as the tutorial dataset. Then each sign for each
the test dataset were taken as true class while 30 randomly se-            user in the test dataset was compared against the corresponding
lected executions from the pool of remaining signs was taken as            tutorial sign. The location module had an overall recall of 96.4 % and
                                                                           precision of 24.3 %. The lower precision is due to the fact that many
                                                                           signs in the test dataset had similar locations. We performed a test
                                                                           comparing only the signs ‘LARGE’ to the sign ‘FATHER’ and both
                                                                           the precision and recall were 100 %. The movement module had an
                                                                           overall recall of 93.2 % and a precision of 52.4 %. The hand-shape
                                                                           module had a recall of 89 % and a precision of 74 %. The overall
                                                                           model is constructed as a waterfall combination of all three models
                                                                           such that the movement model is executed only when the location
                                                                           was found to be correct, and the hand-shape model is executed only
                                                                           when both the location and movement were correct. The overall
                                                                           precision, recall, f-1 score and accuracies is summarized in Table 4.

                                                                           5     DISCUSSION AND FUTURE WORK
                                                                           We demonstrated the need for a feedback based technological solu-
                                                                           tion for sign language learning and provided an implementation
                                                                           with a modular feedback mechanism. The user preference for the
                                                                           desired amount of feedback can be changed by altering the value for
                                                                           ‘Feedback Sensitivity’. The trade-off between ‘Feedback Sensitivity’
                                                                           and amount of feedback received as well as other performance met-
                                                                           rics is summarized in Figure 3. Although, we designed our feedback
                                                                           mechanism based on principles from linguistics and user survey,
     (a) Hand-shape feedback for AFTER. (b) Movement Feedback for ABOUT.   only a large scale usage of such an application will provide defini-
                                                                           tive best practices for the most effective feedback. In such future
              Figure 4: Feedback given by the app.                         studies, issues such as the extent of user control for determining
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                                           P. Paudyal et al.

                         Table 4: Precision(P), Recall(R), F-1 Score (F1) and Accuracy(A) for 25 ASL tokens.

            Sign       P      R      F1     A      Sign            P      R      F1       A       Sign             P       R        F1       A
            About      0.92   0.71   0.80   0.85   Decide          0.91   0.55   0.69     0.74    Here             0.92    0.96     0.94     0.96
            After      0.85   0.92   0.88   0.92   Father          0.86   0.86   0.86     0.92    Hospital         0.96    0.86     0.91     0.93
            And        0.86   0.86   0.86   0.92   Find            0.54   0.81   0.65     0.81    Hurt             0.81    0.92     0.86     0.91
            Can        0.96   0.77   0.85   0.89   Gold            0.88   0.81   0.84     0.89    If               0.79    0.90     0.84     0.91
            Cat        0.96   0.59   0.73   0.78   Goodnight       0.96   0.59   0.73     0.78    Large            0.96    0.63     0.76     0.79
            Cop        0.91   0.91   0.91   0.95   goout           0.88   0.85   0.87     0.91    Sorry            0.85    1.00     0.92     0.95
            Cost       0.85   0.79   0.81   0.87   Hearing         0.85   1.00   0.92     0.95    Tiger            0.58    0.64     0.61     0.76
            Day        0.96   0.80   0.87   0.91   Hello           0.96   0.81   0.88     0.91    Average          0.86    0.82     0.83     0.88
            Deaf       0.56   0.88   0.68   0.83   Help            0.88   1.00   0.93     0.96


types of feedback and the possibility of peer-to-peer feedback for          REFERENCES
on-line learning has to be evaluated as suggested by works such              [1] Xavier Anguera, Robert Macrae, and Nuria Oliver. 2010. Partial sequence match-
as [10]. This work provides the foundations and feasibility for in-              ing using an unbounded dynamic time warping algorithm. In Acoustics Speech
                                                                                 and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 3582–
teractive and intelligent sign language learning to pave the path                3585.
for such future work.                                                        [2] Modern Language Association. 2016. Language Enrollment Database. https:
                                                                                 //apps.mla.org/flsurvey_search. [Online; accessed 24-September-2018].
   We collected usage and interaction data from 100 new learners             [3] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero,
as part of this work, which will be foundational to assist future                and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human
researchers. Although, the focus of this work was on the man-                    pose and shape from a single image. In European Conference on Computer Vision.
                                                                                 Springer, 561–578.
ual portion of sign languages, the preprocessing includes location           [4] Ching-Hang Chen and Deva Ramanan. 2017. 3d human pose estimation= 2d pose
estimates for the eyes, ears and the nose. This can be utilized for in-          estimation+ matching. In CVPR, Vol. 2. 6.
cluding facial expression recognition and feedback in future works.          [5] Xianjie Chen and Alan L Yuille. 2014. Articulated pose estimation by a graphical
                                                                                 model with image dependent pairwise relations. In Advances in neural information
We evaluated only 25 isolated words for ASL, but in the future,                  processing systems. 1736–1744.
this work can be extended to more words and phrases and to in-               [6] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard
                                                                                 Bowden. 2018. Neural Sign Language Translation. In Proceedings of the IEEE
clude other sign languages since the general principles will remain              Conference on Computer Vision and Pattern Recognition. 7784–7793.
the same. In this work, we used sign language as a test applica-             [7] Konstantinos G Derpanis, Matthieu Lecce, Kostas Daniilidis, and Richard P Wildes.
tion, however, the insights from this work can be easily applied to              2012. Dynamic scene understanding: The role of orientation features in space and
                                                                                 time in scene classification. In Computer Vision and Pattern Recognition (CVPR),
other gesture domains such as combat sign training for military or               2012 IEEE Conference on. IEEE, 1306–1313.
industrial operator signs.                                                   [8] Farzad Ehsani and Eva Knodt. 1998. Speech technology in computer-aided
                                                                                 language learning: Strengths and limitations of a new CALL paradigm. (1998).
                                                                             [9] Karen Emmorey. 2001. Language, cognition, and the brain: Insights from sign
6   CONCLUSION                                                                   language research. Psychology Press.
There is an increasing need and demand for learning sign language.          [10] Rebecca Fiebrink, Perry R Cook, and Dan Trueman. 2011. Human model evalua-
                                                                                 tion in interactive supervised learning. In Proceedings of the SIGCHI Conference
Feedback is very important for language learning and intelligent                 on Human Factors in Computing Systems. ACM, 147–156.
language learning softwares must provide effective and meaningful           [11] Kirsti Grobel and Marcell Assan. 1997. Isolated sign language recognition using
                                                                                 hidden Markov models. In Systems, Man, and Cybernetics, 1997. Computational
feedback. There has also been significant advances in research for               Cybernetics and Simulation., 1997 IEEE International Conference on, Vol. 1. IEEE,
recognizing sign languages, however technological solutions that                 162–167.
leverage them to provide intelligent learning environments do not           [12] Pradeep Kumar, Himaanshu Gauba, Partha Pratim Roy, and Debi Prosad Do-
                                                                                 gra. 2017. Coupled HMM-based multi-sensor data fusion for sign language
exist. In this work, we identify different types of potential feedback           recognition. Pattern Recognition Letters 86 (2017), 1–8.
we can provide to learners of sign language and address some chal-          [13] Impact Lab. 2018. Learn2Sign Details Page. https://impact.asu.edu/projects/
lenges in doing so. We propose a pipeline of three non-parametric                sign-language-recognition/learn2sign
                                                                            [14] Kian Ming Lim, Alan WC Tan, and Shing Chiang Tan. 2016. A feature covariance
recognition modules and an incremental feedback mechanism to                     matrix with serial particle filter for isolated sign language recognition. Expert
facilitate learning. We tested our system on real-world data from a              Systems with Applications 54 (2016), 208–218.
                                                                            [15] Malek Nadil, Feryel Souami, Abdenour Labed, and Hichem Sahbi. 2016. KCCA-
variety of devices and settings to achieve a final recognition accu-             based technique for profile face identification. EURASIP Journal on Image and
racy of 87.9 %. This demonstrates that using explainable machine                 Video Processing 2017, 1 (2016), 2.
learning for gesture learning is desirable and effective. We also           [16] World Health Organization. 2018. Deafness and hearing loss. http://www.who.
                                                                                 int/news-room/fact-sheets/detail/deafness-and-hearing-loss. [Online; accessed
provided different types of feedback mechanisms based on results                 24-September-2018].
of a user survey and best practices in implementing them. Finally,          [17] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan
we collected data from 100 users of L2S with 3 repetitions for each              Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person
                                                                                 pose estimation in the wild. In CVPR, Vol. 3. 6.
of the 25 signs for a total of 7500 instances [13].                         [18] Prajwal Paudyal, Ayan Banerjee, and Sandeep KS Gupta. 2016. Sceptre: a perva-
                                                                                 sive, non-invasive, and programmable gesture recognition technology. In Pro-
                                                                                 ceedings of the 21st International Conference on Intelligent User Interfaces. ACM,
7   ACKNOWLEDGMENTS                                                              282–293.
We thank SigningSavvy[24] for letting us use their tutorial videos          [19] Prajwal Paudyal, Junghyo Lee, Ayan Banerjee, and Sandeep KS Gupta. 2017.
                                                                                 Dyfav: Dynamic feature selection and voting for real-time recognition of fin-
in the application.                                                              gerspelled alphabet using wearables. In Proceedings of the 22nd International
Learn2Sign: Explainable AI for Sign Language Learning                                                    IUI Workshops’19, March 20, 2019, Los Angeles, USA


     Conference on Intelligent User Interfaces. ACM, 457–467.                                 arXiv:1212.0402 (2012).
[20] Fabrizio Pedersoli, Sergio Benini, Nicola Adami, and Riccardo Leonardi. 2014.       [29] Thad Starner, Joshua Weaver, and Alex Pentland. 1998. Real-time american
     XKin: an open source framework for hand pose and gesture recognition using               sign language recognition using desk and wearable computer based video. IEEE
     kinect. The Visual Computer 30, 10 (2014), 1107–1122.                                    Transactions on pattern analysis and machine intelligence 20, 12 (1998), 1371–1375.
[21] Martha C Pennington and Pamela Rogerson-Revell. 2019. Using Technology for          [30] William C Stokoe Jr. 2005. Sign language structure: An outline of the visual
     Pronunciation Teaching, Learning, and Assessment. In English Pronunciation               communication systems of the American deaf. Journal of deaf studies and deaf
     Teaching and Research. Springer, 235–286.                                                education 10, 1 (2005), 3–37.
[22] Sean Robertson, Cosmin Munteanu, and Gerald Penn. 2018. Designing Pronunci-         [31] Rosetta Stone. 2016. Talking back required. https://www.rosettastone.com/
     ation Learning Tools: The Case for Interactivity against Over-Engineering. In            speech-recognition. [Online; accessed 28-September-2018].
     Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems.       [32] Denis Tome, Christopher Russell, and Lourdes Agapito. 2017. Lifting from
     ACM, 356.                                                                                the deep: Convolutional 3d pose estimation from a single image. CVPR 2017
[23] Russell S Rosen. 2010. American sign language curricula: A review. Sign Language         Proceedings (2017), 2500–2509.
     Studies 10, 3 (2010), 348–381.                                                      [33] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint
[24] Signing Saavy. 2018. Signing Saavy: Your Sign Language Resouce. https://www.             training of a convolutional network and a graphical model for human pose
     signingsavvy.com/. [Online; accessed 28-September-2018].                                 estimation. In Advances in neural information processing systems. 1799–1807.
[25] Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, and Ioannis A Kakadiaris.      [34] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri.
     2016. 3d human pose estimation: A review of the literature and analysis of               2015. Learning spatiotemporal features with 3d convolutional networks. In
     covariates. Computer Vision and Image Understanding 152 (2016), 1–20.                    Proceedings of the IEEE international conference on computer vision. 4489–4497.
[26] YoungHee Sheen. 2004. Corrective feedback and learner uptake in communicative       [35] Fabienne M Van der Kleij, Remco CW Feskens, and Theo JHM Eggen. 2015. Effects
     classrooms across instructional settings. Language teaching research 8, 3 (2004),        of feedback in a computer-based learning environment on students’ learning
     263–300.                                                                                 outcomes: A meta-analysis. Review of educational research 85, 4 (2015), 475–511.
[27] Peter Skehan. 1998. A cognitive approach to language learning. Oxford University    [36] Ping Yu, Yingxin Pan, Chen Li, Zengxiu Zhang, Qin Shi, Wenpei Chu, Mingzhuo
     Press.                                                                                   Liu, and Zhiting Zhu. 2016. User-centred design for Chinese-oriented spoken
[28] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A                     english learning system. Computer Assisted Language Learning 29, 5 (2016),
     dataset of 101 human actions classes from videos in the wild. arXiv preprint             984–1000.