=Paper=
{{Paper
|id=Vol-2327/ExSS8
|storemode=property
|title=Learn2Sign: Explainable AI for Sign Language Learning
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-13.pdf
|volume=Vol-2327
|authors=Prajwal Paudyal,Junghyo Lee,Azamat Kamzin,Mohamad Soudki,Ayan Banerjee,Sandeep Gupta
|dblpUrl=https://dblp.org/rec/conf/iui/PaudyalLKSBG19
}}
==Learn2Sign: Explainable AI for Sign Language Learning==
Learn2Sign: Explainable AI for Sign Language Learning
Prajwal Paudyal, Junghyo Lee, Azamat Kamzin, Mohamad Soudki, Ayan Banerjee, Sandeep K.S.
Gupta
Arizona State University
Tempe, Arizona, United States
(ppaudyal,jlee375,akamzin,msoudki,abanerj3,sandeep.gupta)@asu.edu
ABSTRACT Extended studies have shown that providing item-based feedback in
Languages are best learned in immersive environments with rich CALL systems is very important [35]. Towards this goal, extensive
feedback. This is specially true for signed languages due to their language learning softwares for spoken languages such as Rosetta
visual and poly-componential nature. Computer Aided Language Stone or Duolingo support some form of assessments and automatic
Learning (CALL) solutions successfully incorporate feedback for feedback [31]. Although, there are numerous instructive books [23],
spoken languages, but no such solution exists for signed languages. video tutorials or smartphone applications for learning popular
Current Sign Language Recognition (SLR) systems are not inter- sign languages, there hasn’t yet been any work towards providing
pretable and hence not applicable to provide feedback to learners. In automatic feedback as seen in Table 1. We conducted a survey [13]
this work, we propose a modular and explainable machine learning of 52 first-time ASL users (29M, 21F) in 2018 and 96.2 % said that
system that is able to provide fine-grained feedback on location, reasonable feedback is important but lacking in solutions for sign
movement and hand-shape to learners of ASL. In addition, we also language learning (Table 2).
propose a waterfall architecture for combining the sub-modules to
prevent cognitive overload for learners and to reduce computation Table 1: Some ASL learning applications for smartphones.
time for feedback. The system has an overall test accuracy of 87.9 %
on real-world data consisting of 25 signs with 3 repetitions each Application Can Increase Vocab Feedback
from 100 learners. ASL Coach No None
The ASL App No None
CCS CONCEPTS ASL Fingerspelling No None
• Human-centered computing → Interaction design; • Com- Marlee Signs Yes None
puting methodologies → Artificial intelligence; • Applied SL for Beginners No None
computing → Interactive learning environments. WeSign No None
KEYWORDS
Studies show that elaborated feedback such as providing mean-
Explainable AI; Sign Language Learning; Computer-aided learning ingful explanations and examples produce larger effect on learning
ACM Reference Format: outcomes than just feedback regarding the correctness [35]. The
Prajwal Paudyal, Junghyo Lee, Azamat Kamzin, Mohamad Soudki, Ayan simplest feedback that can be given to a learner is whether their
Banerjee, Sandeep K.S. Gupta. 2019. Learn2Sign: Explainable AI for Sign execution of a particular sign was correct. State-of-the-art SLR and
Language Learning. In Joint Proceedings of the ACM IUI 2019 Workshops, Los activity recognition systems can be easily trained to accomplish
Angeles, USA, March 20, 2019 , 7 pages.
this. However, to truly help a learner identify mistakes and learn
from them, the feedback and explanations generated must be more
1 INTRODUCTION
fine-grained.
Signed languages are natural mediums of communication for the The various ways in which a signer can make mistakes during
estimated 466 million deaf or hard of hearing people worldwide [16]. the execution of a sign can be directly linked to how minimum pairs
Families and friends of the deaf can also benefit from being able are formed in the phonetics of that language. The work of Stokoe
to sign. The Modern Language Association [2] reports that the postulates that the manual portion of an ASL sign is composed of
enrollment in American Sign Language (ASL) courses in the U.S. 1) location, 2) movement and 3) hand-shape and orientation [30]. A
has increased nearly 6,000 percent since 1990 which shows that black box recognition system cannot provide this level of feedback,
interest to acquire sign languages is increasing. However, the lack thus there is the need for an explainable AI system because feedback
of resources for self-paced learning makes it difficult to acquire, from the system is analogous to explanations for its final decision.
specially outside of the traditional classroom setting [26]. Non-manual markers such as facial expressions and body gaits
The ideal environment for language learning is immersion with also change the meaning of signs to some extent but they are less
rich feedback [27] and this is specially true for sign languages [9]. important for beginner level language acquisition, so these will be
considered for future work.
IUI Workshops’19, March 20, 2019, Los Angeles, USA Studies have also shown that the effect of feedback is highest
© 2019 Copyright for the individual papers by the papers’ authors. Copying permitted if provided immediately [35], thus feedback systems should be
for private and academic purposes. This volume is published and copyrighted by its
editors. real-time. The requirement for immediate feedback also restricts
the usage of complicated learning algorithms that require heavy
IUI Workshops’19, March 20, 2019, Los Angeles, USA P. Paudyal et al.
Figure 1: System Model for Feedback.
computing [6] and extensive training. The usability and usefulness Table 2: Survey Results from 52 Users of the Application.
of applications is enhanced if learning is self-paced, learners are
allowed to use their own devices, and the learning vocabulary can Category Response
be easily extended. However, current solutions for SLR require re- Importance
training to support unseen words and large datasets initially. To Yes: 96.2% No: 3.8%
of Feedback
solve these challenges, we designed Learn2Sign(L2S), a smartphone Correctness: 9.6%
application that utilizes explainable AI to provide fine-grained feed- Colored Bones: 1.9%
back on location, movement, orientation and hand-shape for ASL Movement
Sentence: 15.4%
learners. L2S is built using a waterfall combination of three non- Feedback
Correctness+Sentence: 9.6%
parametric models as seen in Figure1 to ensure extendibility to new All: 65.3
vocabulary. Learners can use L2S with any smartphone or com- Handshape Circle around handshape: 63.5%
puter with a front-facing camera. L2S utilizes a bone localization Feedback Actual handshape: 36.5%
technique proposed by [17] for movement and location based feed- Not helpful: 1.9%
back and a light-weight pre-trained Convolutional Neural Network Self-Assessment Somewhat: 5.8%
(CNN) as a feature extractor for hand-shape feedback. Very helpful: 93.3%
The methodology and evaluations are provided in Sections 3 Not helpful: 5.8%
and 4. As part of the work, we collected video data from 100 users Expandability Somewhat: 7.7%
executing 25 ASL signs three times each. The videos were recorded Very helpful: 86.5%
by L2S users in real-world settings without restrictions on device-
type, lighting conditions, distance to the camera or recording pose
(sitting or standing up). This was to ensure generalization to real- on design principles for using Automatic Speech Recognition (ASR)
world conditions, however, this makes the dataset more challenging. techniques to provide feedback for language learners [36]. Sign
More details about the resulting dataset of about 7500 instances Language Recognition (SLR) is a research field that closely mirrors
can be found in [13]. ASR and can potentially be utilized by systems for sign language
learning. However, to the best of our knowledge, no such system
2 RELATED WORK exists. This can be explained by the inherent difficulties in SLR as
There have been many works on providing meaningful feedback for well as the lack of detailed studies on design principles for such
spoken language learners [8, 21, 22]. On the practical side, Rosetta systems. In this work, we propose some design principles and an
Stone provides both waveform and spectrograph feedback for pro- explainable smart system to meet this goal.
nunciation mistakes by comparing acoustic waves of a learner to Continuously translating a video recording of a signed language
that of a native speaker [31]. There has also been some recent work to a spoken language is a very challenging problem and has been
Learn2Sign: Explainable AI for Sign Language Learning IUI Workshops’19, March 20, 2019, Los Angeles, USA
3.1 User Interface
For initial data collection and for testing the UI, we developed an
android application called L2S. We preloaded the application with
25 tutorial videos from Signing Savvy corresponding to 25 ASL
signs [24]. The application has three main components: a) Learning
Module b) Practice Module, and c) Extension.
3.1.1 Learning Module. The learning module of the L2S application
is where all the tutorial videos are accessible. A learner selects an
ASL word/phrase to learn and can then view the tutorial videos.
The learner can pause, play, and repeat the tutorials as many times
as needed. In this module, the learner can also record executions of
their signs for self-assessment.
(a) TIGER mostly in bucket (b) DECIDE in buckets 3 and
1 for left-hand. 6 for right-hand. 3.1.2 Practice Module. The practice module is designed to give au-
tomatic feedback to the learners. A learner selects a sign to practice
Figure 2: Automatic bucketing for Location Identification and sets up their device to record their execution. After this, L2S
for varying distances from camera. Left Wrist-Yellow, Right determines if the learner performed the sign correctly. The result
Wrist-Red, Eyes - White, Shoulders-Green. is correct if the sign meets the thresholds for movement, location,
and hand-shape and a ‘correct’ feedback is given. If, the system
determines that the learner did not execute the sign correctly, an ap-
propriate feedback is provided as seen in Figure 1. Details about the
recognition and feedback mechanisms is discussed in Section 3.4.
tackled recently by various researchers with some success [6]. For
the purposes of this application, such complex measures are not 3.1.3 Extension Module. To extend the supported vocabulary of
desirable, as they mandate extensive datasets for training and large L2S, a learner can upload one or more tutorial videos from a source
models for translation which decreases their usability. Isolated of their choosing. The application processes them for usability
Sign Language Recognition has the goal of classifying various sign before they appear in the Learning Module as new tutorial sign(s).
tokens into classes that represent some spoken language words [11,
12, 18, 29]. Some researchers have utilized videos [14] while some 3.2 Data Collection
others have attempted to use wearable sensors [18, 19] with varying We collected signing videos from 100 learners, for 25 ASL signs
performances. In this work, we utilize the insights and advances with three repetitions each in real-world settings using L2S app.
from such systems to help a new learner acquire the sign language Learners used their own devices, with no restrictions on lighting
words. To our knowledge, this work is the first attempt at such a conditions, distance to the camera or recording pose (sitting or
practical and much needed application. standing up). After reviewing a tutorial video, a learner was given
For this work, we require an estimation of human pose, specifi- a 5 s setup time before recording a 3 s video using a front-facing
cally the estimates on the location of various joints throughout a camera. Both the tutorial and the newly recorded video were then
video, known as keypoints. There have been several works towards displayed in the same screen for the user to accept or reject. This
this goal [3–5, 25, 32, 33]. Some of these works first detect the key- self-assessment served not only as a review but it also helped prune
points in 2D and then attempt to ‘lift’ that set to 3D space while incorrect data due to device or timing errors as suggested by the
others return the 2D coordinates of the various keypoints relative new learner survey in Table 2.
to the image. In order to fulfill the requirement to use pervasive
cameras, we did not focus on the approaches that utilize depth
3.3 Preprocessing
information such as Microsoft Kinect [20]. Thus, we utilized the
pose estimates from a Tensorflow JS implementation of a model Determining joint locations: Since, different devices record in
proposed by Papandreou et al. [17] which can run on devices with different resolutions, all videos for learning, practice or extension
or without GPUs (Graphical Processing Units). are first converted to a 320*240 resolution. Then, PoseNet Javascript
API for single pose estimation [17] was used to compute the esti-
mated locations and confidence levels for the various keypoints as
3 METHODOLOGY seen in Table 3. Figure 2 shows the estimated eyes, shoulder and
Stokoe proposed that a sign in ASL consists of three parts which wrist locations for the signs TIGER and DECIDE for all the frames
combine simultaneously: the tab (location of the sign), the dez (hand- in one video.
shape) and the sig (movement) [30]. Signs like ‘HEADACHE’ and Normalization: There is a difference in scale of the bodies relative
‘STOMACH ACHE’ that are similar in hand-shape and movement to the frame-size corresponding to the distance between the learner
may differ only by the signing location. Similarly, there will be and the camera. This scaling factor can negatively impact recog-
other minimal pairs of signs that differ only by the movement or nition since the relative location, movement and hand-shape will
hand-shape. Following this understanding, L2S is composed of three vary with distance. We perform min-max normalization and zero-
corresponding recognition and feedback modules. ing based on the distance between the average estimated locations
IUI Workshops’19, March 20, 2019, Los Angeles, USA P. Paudyal et al.
Table 3: 4 out of 17 ‘keypoints’ for one frame in a video 3.5 Location
To correctly and efficiently determine the location for signing, we
Part Score x y first assume the shoulders stay fairly stationary throughout the
Left Shoulder 0.8325 180.0198 196.2646 execution of a sign. This is a fair assumption for ASL since there are
Right Shoulder 0.78601 138.7879 195.5847 no minimal pairs exclusively associated with a signer’s shoulders.
Left Wrist 0.6844 198.9818 223.4856 Then we divide the video canvas into 6 different sub-sections called
Right Wrist 0.1564 1.9084 211.0473 buckets as seen in Figure 2. Then, as the learner executes any given
sign, the location of both the wrist joints is tracked for each bucket
resulting in a vector of length 6.
This same procedure is followed for the tutorials, and a cosine-
based comparison between is done between the two vectors. A
heuristic threshold that is determined during training is utilized
as a cut-off point. If the resulting cosine similarity is lower than a
threshold, some feedback is shown to the learner as seen in Figure 4.
For each hand, the user’s own video is replayed in Graphics Inter-
change Format (GIF) with a red highlight on the location section
that was incorrect and a green highlight on the section of the frame
where the sign should have been executed. A text feedback with
details and a link to the tutorial is also provided and the learner is
prompted to try again.
3.6 Movement
Determination of correct movement is perhaps the single most
important feedback we can provide to a learner. We compute a
Figure 3: Feedback Sensitivity vs. Performance. segmental DTW distance between a learner and the tutorial us-
ing keypoints for the wrists, elbows and shoulders as suggested
in [1]. Normalization as discussed in Section 3.3 was found to be
for the right and left shoulders throughout the video frame as sug- very important. Experimental results showed that segmental DTW
gested by [15]. Normalization was found to be specially important outperformed DTW or Global Alignment Kernel (GAK).
for correct movement recognition. The dataset had a wide variation in the number of frames per
video. It was found that this affected the distance scores adversely.
3.4 Recognition and Feedback Thus, as an additional step of preprocessing, the video with the
L2S is designed to give incremental feedback to learners for the higher number of frames was down-sampled before comparison and
various modalities in sign language: a) Location b) Movement and the segmental DTW is utilized to find the best sub-sample matching.
c) Hand-shape. The various models are arranged in a waterfall Thresholds for the signs were determined experimentally using 10
architecture as seen in Figure 1. If the location of signing was not training videos for each sign. If segmental DTW distance between
correct, then immediate feedback is provided and the learner is a learner’s recording and a tutorial was higher than the threshold
prompted to try again. Similarly, if the movement of the elbows or for each arm section, then a movement-based feedback is provided
the wrists for either hand was incorrect, the learner is prompted as seen in Figure 1. A GIF is replayed to the user with the section(s)
to try again. Finally, if the shape and orientation of either of the of the arm for which the movement was incorrect in red as seen in
hands does not appear to be correct, a hand-shape based feedback Figure 5b. A textual feedback is also generated with an explanation
is provided. Consequently, the learner can move on to a practice after which the user is prompted to watch the tutorial and try again.
a new sign, only if all these modalities were sufficiently correct.
A waterfall architecture was chosen in the final application over
a linear weighted combination to make learning progressive and 3.7 Hand Shape and Orientation
to decrease the cognitive load on the learner due to the potential ASL signs which are otherwise similar, may differ only by the
of mistakes in multiple modalities. This architecture also helps shape or orientation of the hands. Since, CNNs have state-of-the-
to reduce the time taken for recognition and feedback since the art image recognition results, we utilized Inception v3 or Mobilenet
models are stacked in an increasing order of execution time. Each CNN depending on the device being used. A model that was pre-
of the feedback screens shown to the user also has a link to the trained on ImageNet is retrained using hand-shape images from the
tutorial video. Users can also manually tune the amount of feedback training users. The wrist location obtained during pre-processing
by altering the value of ‘feedback sensitivity’ in the application was used as a guide to auto-crop these hand-shape images. During
settings. Increasing this value alters the thresholds for each of the recognition time, hand-shape images from each hand are extracted
sub-modules so that the overall rate of feedback is increased. This automatically in a similar way from a learner’s recording. Then 6
involves a trade-off in performance which is summarized in Figure 3. images for each hand are passed separately through the CNN and
the softmax layer is obtained and are concatenated together as seen
Learn2Sign: Explainable AI for Sign Language Learning IUI Workshops’19, March 20, 2019, Los Angeles, USA
in Figure 1. Similar processing is done on the tutorial video to obtain
a vector of the same length. Then a cosine similarity is calculated
on the resultant vector. If the similarity between a learner’s sign
and that of a tutorial is above a set threshold for a sign, then the
execution is determined to be correct, otherwise the hand-shape
based feedback as seen in Figure 5 is provided.
Although the retrained CNN could theoretically be used as a
classifier, we use it only as a feature extractor for cosine similarity
to ensure that the system can extend to unseen classes. A new
tutorial can then be effectively added to the system without the need
for retraining. An analysis of the effectiveness of hand-shape and
orientation recognizer is provided in Section 4.Similar to location
and movement, feedback for hand shape and orientation is also
(a) HERE: Red box(upper): (b) DEAF: Red box(upper):
provided in the form of a replay GIF and text. A zoomed in image
Detected Location, Green Detected Location, Green
of the incorrect hand shape is shown side by side with the correct Box(lower): Correct Loca- Box(lower): Correct Loca-
image from a tutorial as seen in Figure 5(a). tion. tion.
4 RESULTS AND EVALUATION Figure 5: Feedback for incorrect location for right hand.
An ideal system should give feedback to a learner only if their
execution is incorrect. Giving unnecessary feedback for correct ex-
ecutions will hinder the learning process and decrease the usability. incorrect class to avoid class imbalance. A pre-trained model from
Conversely, providing sound and timely explanations for incorrect C3D [34] was retrained with the data we collected and was used as
executions helps to improve utility and user trust. Smart systems the baseline for comparison. This model has an accuracy of 82.3 %
such as L2S that use explainable machine learning tend to have a on UCF101 [28] and 87.7 % in YUPENN-Scene [7] datasets. The final
trade-off between explainability and performance which should be recognition accuracy of C3D on L2S dataset using the same train-
minimized. test split was 45.38 %. Our approach achieves a higher accuracy of
The overall performance of the system was tested for 10 test 87.9 % while still offering explanations about its decisions in the
users for a total of 750 signs. The training of the CNN for hand- form of learner feedback.
shape feature extraction and optimal threshold determination was To obtain the results, data collected from one learner was selected
done using the remaining users. For each sign, 30 executions from at random and served as the tutorial dataset. Then each sign for each
the test dataset were taken as true class while 30 randomly se- user in the test dataset was compared against the corresponding
lected executions from the pool of remaining signs was taken as tutorial sign. The location module had an overall recall of 96.4 % and
precision of 24.3 %. The lower precision is due to the fact that many
signs in the test dataset had similar locations. We performed a test
comparing only the signs ‘LARGE’ to the sign ‘FATHER’ and both
the precision and recall were 100 %. The movement module had an
overall recall of 93.2 % and a precision of 52.4 %. The hand-shape
module had a recall of 89 % and a precision of 74 %. The overall
model is constructed as a waterfall combination of all three models
such that the movement model is executed only when the location
was found to be correct, and the hand-shape model is executed only
when both the location and movement were correct. The overall
precision, recall, f-1 score and accuracies is summarized in Table 4.
5 DISCUSSION AND FUTURE WORK
We demonstrated the need for a feedback based technological solu-
tion for sign language learning and provided an implementation
with a modular feedback mechanism. The user preference for the
desired amount of feedback can be changed by altering the value for
‘Feedback Sensitivity’. The trade-off between ‘Feedback Sensitivity’
and amount of feedback received as well as other performance met-
rics is summarized in Figure 3. Although, we designed our feedback
mechanism based on principles from linguistics and user survey,
(a) Hand-shape feedback for AFTER. (b) Movement Feedback for ABOUT. only a large scale usage of such an application will provide defini-
tive best practices for the most effective feedback. In such future
Figure 4: Feedback given by the app. studies, issues such as the extent of user control for determining
IUI Workshops’19, March 20, 2019, Los Angeles, USA P. Paudyal et al.
Table 4: Precision(P), Recall(R), F-1 Score (F1) and Accuracy(A) for 25 ASL tokens.
Sign P R F1 A Sign P R F1 A Sign P R F1 A
About 0.92 0.71 0.80 0.85 Decide 0.91 0.55 0.69 0.74 Here 0.92 0.96 0.94 0.96
After 0.85 0.92 0.88 0.92 Father 0.86 0.86 0.86 0.92 Hospital 0.96 0.86 0.91 0.93
And 0.86 0.86 0.86 0.92 Find 0.54 0.81 0.65 0.81 Hurt 0.81 0.92 0.86 0.91
Can 0.96 0.77 0.85 0.89 Gold 0.88 0.81 0.84 0.89 If 0.79 0.90 0.84 0.91
Cat 0.96 0.59 0.73 0.78 Goodnight 0.96 0.59 0.73 0.78 Large 0.96 0.63 0.76 0.79
Cop 0.91 0.91 0.91 0.95 goout 0.88 0.85 0.87 0.91 Sorry 0.85 1.00 0.92 0.95
Cost 0.85 0.79 0.81 0.87 Hearing 0.85 1.00 0.92 0.95 Tiger 0.58 0.64 0.61 0.76
Day 0.96 0.80 0.87 0.91 Hello 0.96 0.81 0.88 0.91 Average 0.86 0.82 0.83 0.88
Deaf 0.56 0.88 0.68 0.83 Help 0.88 1.00 0.93 0.96
types of feedback and the possibility of peer-to-peer feedback for REFERENCES
on-line learning has to be evaluated as suggested by works such [1] Xavier Anguera, Robert Macrae, and Nuria Oliver. 2010. Partial sequence match-
as [10]. This work provides the foundations and feasibility for in- ing using an unbounded dynamic time warping algorithm. In Acoustics Speech
and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 3582–
teractive and intelligent sign language learning to pave the path 3585.
for such future work. [2] Modern Language Association. 2016. Language Enrollment Database. https:
//apps.mla.org/flsurvey_search. [Online; accessed 24-September-2018].
We collected usage and interaction data from 100 new learners [3] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero,
as part of this work, which will be foundational to assist future and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human
researchers. Although, the focus of this work was on the man- pose and shape from a single image. In European Conference on Computer Vision.
Springer, 561–578.
ual portion of sign languages, the preprocessing includes location [4] Ching-Hang Chen and Deva Ramanan. 2017. 3d human pose estimation= 2d pose
estimates for the eyes, ears and the nose. This can be utilized for in- estimation+ matching. In CVPR, Vol. 2. 6.
cluding facial expression recognition and feedback in future works. [5] Xianjie Chen and Alan L Yuille. 2014. Articulated pose estimation by a graphical
model with image dependent pairwise relations. In Advances in neural information
We evaluated only 25 isolated words for ASL, but in the future, processing systems. 1736–1744.
this work can be extended to more words and phrases and to in- [6] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard
Bowden. 2018. Neural Sign Language Translation. In Proceedings of the IEEE
clude other sign languages since the general principles will remain Conference on Computer Vision and Pattern Recognition. 7784–7793.
the same. In this work, we used sign language as a test applica- [7] Konstantinos G Derpanis, Matthieu Lecce, Kostas Daniilidis, and Richard P Wildes.
tion, however, the insights from this work can be easily applied to 2012. Dynamic scene understanding: The role of orientation features in space and
time in scene classification. In Computer Vision and Pattern Recognition (CVPR),
other gesture domains such as combat sign training for military or 2012 IEEE Conference on. IEEE, 1306–1313.
industrial operator signs. [8] Farzad Ehsani and Eva Knodt. 1998. Speech technology in computer-aided
language learning: Strengths and limitations of a new CALL paradigm. (1998).
[9] Karen Emmorey. 2001. Language, cognition, and the brain: Insights from sign
6 CONCLUSION language research. Psychology Press.
There is an increasing need and demand for learning sign language. [10] Rebecca Fiebrink, Perry R Cook, and Dan Trueman. 2011. Human model evalua-
tion in interactive supervised learning. In Proceedings of the SIGCHI Conference
Feedback is very important for language learning and intelligent on Human Factors in Computing Systems. ACM, 147–156.
language learning softwares must provide effective and meaningful [11] Kirsti Grobel and Marcell Assan. 1997. Isolated sign language recognition using
hidden Markov models. In Systems, Man, and Cybernetics, 1997. Computational
feedback. There has also been significant advances in research for Cybernetics and Simulation., 1997 IEEE International Conference on, Vol. 1. IEEE,
recognizing sign languages, however technological solutions that 162–167.
leverage them to provide intelligent learning environments do not [12] Pradeep Kumar, Himaanshu Gauba, Partha Pratim Roy, and Debi Prosad Do-
gra. 2017. Coupled HMM-based multi-sensor data fusion for sign language
exist. In this work, we identify different types of potential feedback recognition. Pattern Recognition Letters 86 (2017), 1–8.
we can provide to learners of sign language and address some chal- [13] Impact Lab. 2018. Learn2Sign Details Page. https://impact.asu.edu/projects/
lenges in doing so. We propose a pipeline of three non-parametric sign-language-recognition/learn2sign
[14] Kian Ming Lim, Alan WC Tan, and Shing Chiang Tan. 2016. A feature covariance
recognition modules and an incremental feedback mechanism to matrix with serial particle filter for isolated sign language recognition. Expert
facilitate learning. We tested our system on real-world data from a Systems with Applications 54 (2016), 208–218.
[15] Malek Nadil, Feryel Souami, Abdenour Labed, and Hichem Sahbi. 2016. KCCA-
variety of devices and settings to achieve a final recognition accu- based technique for profile face identification. EURASIP Journal on Image and
racy of 87.9 %. This demonstrates that using explainable machine Video Processing 2017, 1 (2016), 2.
learning for gesture learning is desirable and effective. We also [16] World Health Organization. 2018. Deafness and hearing loss. http://www.who.
int/news-room/fact-sheets/detail/deafness-and-hearing-loss. [Online; accessed
provided different types of feedback mechanisms based on results 24-September-2018].
of a user survey and best practices in implementing them. Finally, [17] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan
we collected data from 100 users of L2S with 3 repetitions for each Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person
pose estimation in the wild. In CVPR, Vol. 3. 6.
of the 25 signs for a total of 7500 instances [13]. [18] Prajwal Paudyal, Ayan Banerjee, and Sandeep KS Gupta. 2016. Sceptre: a perva-
sive, non-invasive, and programmable gesture recognition technology. In Pro-
ceedings of the 21st International Conference on Intelligent User Interfaces. ACM,
7 ACKNOWLEDGMENTS 282–293.
We thank SigningSavvy[24] for letting us use their tutorial videos [19] Prajwal Paudyal, Junghyo Lee, Ayan Banerjee, and Sandeep KS Gupta. 2017.
Dyfav: Dynamic feature selection and voting for real-time recognition of fin-
in the application. gerspelled alphabet using wearables. In Proceedings of the 22nd International
Learn2Sign: Explainable AI for Sign Language Learning IUI Workshops’19, March 20, 2019, Los Angeles, USA
Conference on Intelligent User Interfaces. ACM, 457–467. arXiv:1212.0402 (2012).
[20] Fabrizio Pedersoli, Sergio Benini, Nicola Adami, and Riccardo Leonardi. 2014. [29] Thad Starner, Joshua Weaver, and Alex Pentland. 1998. Real-time american
XKin: an open source framework for hand pose and gesture recognition using sign language recognition using desk and wearable computer based video. IEEE
kinect. The Visual Computer 30, 10 (2014), 1107–1122. Transactions on pattern analysis and machine intelligence 20, 12 (1998), 1371–1375.
[21] Martha C Pennington and Pamela Rogerson-Revell. 2019. Using Technology for [30] William C Stokoe Jr. 2005. Sign language structure: An outline of the visual
Pronunciation Teaching, Learning, and Assessment. In English Pronunciation communication systems of the American deaf. Journal of deaf studies and deaf
Teaching and Research. Springer, 235–286. education 10, 1 (2005), 3–37.
[22] Sean Robertson, Cosmin Munteanu, and Gerald Penn. 2018. Designing Pronunci- [31] Rosetta Stone. 2016. Talking back required. https://www.rosettastone.com/
ation Learning Tools: The Case for Interactivity against Over-Engineering. In speech-recognition. [Online; accessed 28-September-2018].
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. [32] Denis Tome, Christopher Russell, and Lourdes Agapito. 2017. Lifting from
ACM, 356. the deep: Convolutional 3d pose estimation from a single image. CVPR 2017
[23] Russell S Rosen. 2010. American sign language curricula: A review. Sign Language Proceedings (2017), 2500–2509.
Studies 10, 3 (2010), 348–381. [33] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint
[24] Signing Saavy. 2018. Signing Saavy: Your Sign Language Resouce. https://www. training of a convolutional network and a graphical model for human pose
signingsavvy.com/. [Online; accessed 28-September-2018]. estimation. In Advances in neural information processing systems. 1799–1807.
[25] Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, and Ioannis A Kakadiaris. [34] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri.
2016. 3d human pose estimation: A review of the literature and analysis of 2015. Learning spatiotemporal features with 3d convolutional networks. In
covariates. Computer Vision and Image Understanding 152 (2016), 1–20. Proceedings of the IEEE international conference on computer vision. 4489–4497.
[26] YoungHee Sheen. 2004. Corrective feedback and learner uptake in communicative [35] Fabienne M Van der Kleij, Remco CW Feskens, and Theo JHM Eggen. 2015. Effects
classrooms across instructional settings. Language teaching research 8, 3 (2004), of feedback in a computer-based learning environment on students’ learning
263–300. outcomes: A meta-analysis. Review of educational research 85, 4 (2015), 475–511.
[27] Peter Skehan. 1998. A cognitive approach to language learning. Oxford University [36] Ping Yu, Yingxin Pan, Chen Li, Zengxiu Zhang, Qin Shi, Wenpei Chu, Mingzhuo
Press. Liu, and Zhiting Zhu. 2016. User-centred design for Chinese-oriented spoken
[28] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A english learning system. Computer Assisted Language Learning 29, 5 (2016),
dataset of 101 human actions classes from videos in the wild. arXiv preprint 984–1000.