A Real-Time Vision Based System for Recognition of Static Dactyls of Albanian Alphabet Eriglen Gani Alda Kika Department of Informatics Department of Informatics Faculty of Natural Sciences Faculty of Natural Sciences University of Tirana University of Tirana eriglen.gani@fshn.edu.al alda.kika@fshn.edu.al Bruno Goxhi Department of Informatics Faculty of Natural Sciences University of Tirana bruno.goxhi@fshnstudent.info impaired people and hearing ones. It comes from in- ability of hearing people to understand sign language. Abstract To overcome this gap most of the times interpreters can be used. The other, more comfortable solution is The aim of the paper is to present a real- usage of technology. Natural interfaces can be used to time vision based system that is able to rec- capture the signs and understand their meaning by us- ognize static dactyls of Albanian alphabet. ing body positions, hand trajectories and head move- We use Kinect device, as an image receiv- ments. Using technology for catching, processing and ing technology. It has simplified the process translating dactyls in an understandable form for non of vision based object recognition, especially deaf people, would help deaf ones integrate faster in for segmentation phase. Different from hard- the society [GK16]. A real time dactyls translator sys- ware based methods, our approach does not tem would provide many facilities for this community. require that signers wear extra objects like Many countries have tried to develop real-time sign data gloves. Two pre-processing techniques, language translator like: [Ull11, GK15c, TL11]. Un- including border extraction and image nor- fortunately Albanian sign language (AlbSL) did not malization have been applied in the segmented get much focus as other languages. images. Fourier transform is applied in the re- Deaf people in Albania used to communicate on the sultant images which generates 15 Fourier co- way that is based on finger-spelled Albanian words efficients representing uniquely that gesture. [ANA13]. Although not so efficient, dactyls play an Classification is based on a similarity distance important role in this type of communication. They measure like Euclidian distance. Gesture with form the bases of communication for deaf people. Al- the lowest distance is considered as a match. banian dactyl alphabet is composed of 36 dactyls. Our system achieved an accuracy of 72.32% Among them 32 are static dactyls. 4 of them are and is able to process 68 frames per second. dynamics ones, which are obtained from consecutive sequences of frames. The dynamic dactyls include (Ç, 1 Introduction Ë, SH and ZH) [ANA13]. Our work is focused only in Sign language is used as a natural way of communi- 32 static dactyls. cation between hearing impaired people. It is very Two most widely used methods for building real- important for the inclusion of deaf people in society. time translator system are hardware based and vision There exist a gap in communication between hearing based [ZY14]. In hardware based method the sign- ers have to wear data gloves or some other marker devices. It is not very natural to them. Vision based methods are more challenging to be developed data gloves achieve high performance but are expen- but are more natural for deaf people. Two most sive and not a proper way to human-computer inter- common problems include a)complex background and action perspective [GK15b]. b)illumination change [ZY14]. Sometimes it is hard Web cameras with an image processing system can to distinguish human hands from other objects parts be used in vision based approaches. Research at of the same environment. Sometimes the shadow or [SSKK16] presents a vision based methodology using light effects the correct identification of human hand. web cameras to recognize gesture from Indian sign lan- Kinect sensor by Microsoft, has simplified the process guage. The system achieves high recognition rate. Au- of vision based object recognition, especially the seg- thors at [WKSE02] and [LGS08] use color camera to mentation phase. It offers some advantages like: pro- capture input gestures and then SVM (Support Vector vides color and depth data simultaneously, it is inex- Machine) and Fuzzy C-Means respectively to classify pensive, the body skeleton can be obtained easily and hand gestures. Despite this, in general web cameras it is not effected by the light [GK16]. We are using generate low quality of images and have an inability to Kinect sensor as a real-time image receiving technol- capture other body parts. It is also hard to generalize ogy for our work. the algorithms for web cameras due to many different Our Albanian sign language translator system in- shapes and colors of hands [GK15b]. cludes a limited set of number signs and dactyls. In Kinect sensors by Microsoft has simplified the pro- the future other numbers, dynamic dactyls and signs cess of vision based object recognition. It has many will be integrated by making this system usable in advantages as: provide color and depth data simulta- many scenarios that require participation of deaf peo- neously, it is inexpensive, the body skeleton can be ob- ple. One usage of the system includes a program in tained easily and it is not effected by the light. Various a bar that could help the deaf people making some researchers are using Microsoft Kinect sensor for sign orders by combining numbers and dactyl gestures. language recognition as in [GK15c], [SB13], [VAC13]. Till now there does not exist any gesture data set Vision based hand gesture recognition provides for Albanian sign language. We are trying to built a more intuitive interaction with a system. It is a system that is able to translate static dactyls signs for challange task to identify and classify hand gesture. Albanian sign language and in the future it will be Shape and movement play an important role in ges- extended to dynamic dactyls and other signs. Creat- ture categorization. A comparison between two most ing and continuously adding new signs to an Albanian widely used algorithm for shape recognition is done at gesture data set would help building a more reliable [CBM07]. It compares Fourier descriptors (FD) and and useful recognition system for our sign language. HU moments in terms of performance and accuracy. Section 1 gives a brief introduction. Section 2 sum- Algorithms are compared against a custom and a real- marizes some related works. The rest of the paper is life gesture vocabulary. Experiment results show that organized as follows. Section 3 presents an overview of FD is more efficient in terms of accuracy and perfor- methodology and a brief description of each methodol- mance. ogy’s processes. Section 4 describes the experimental Research at [BGRS11] addresses the issue of fea- environment. Section 5 presents the experiments and ture extraction for gesture recognition. It compares results. The paper is concluded in Section 6 by pre- Moment In-variants and Fourier descriptors in terms senting the conclusions and future work. of in-variance to certain transformations and discrim- ination power. ASL images were used to form gesture dictionary. Both approaches found difficult to classify 2 Related Work correctly some classes of ASL. Many researchers have followed different methodolo- Authors at [BF12] compare different methods for gies for building sign language recognition systems. shape representation in terms of accuracy and real- They are categorized into several types based on in- time performance. Methods that were used to com- put data and hardware dependency. Signs, which are pare them include region based moments (Hu moments mostly performed by human hands can be static or and Zenike moments) and Fourier descriptors. Conclu- dynamic. The sign language recognition systems are sions showed that Fourier descriptors have the highest categorized as hardware based or vision based. recognition rate. Many works have been done to integrate some hard- Shape is an important factor for gesture recognition. ware based technologies to capture and translate sign There exist many methods for shape representation gestures, among them the most widely used are data and retrival. Among them Fourier descriptors achieve gloves. Authors at [Sud14] built a portable system for good representation and normalization. Authors at deaf people using a smart glove capable of capturing [ZL+ 02] compare different shape signatures used to de- finger movements and hand movements. In general rived Fourier descriptors. Among them: complex coor- dinates, centroid distance, curvature signature and cu- Fourier descriptors can be derived from complex mulative angular function. Article concludes that cen- coordinates, centroid distance, curvature signature or troid distance is significantly better than other three cumulative angular function. In our case centroid dis- signatures. tance is used due to [ZL+ 02]. After locating the cen- Sign language is not limited only in static ges- ter of white pixels in the image, we have calculated ture. Majority of signs are dynamic ones. Research the distance of every border pixels from it. It gives at [RMP+ 15] proposed a hand gesture recognition the centroid function which represents two dimensions method using Microsoft Kinect. It uses two differ- area. ent classification algorithms DTW and HMM by dis- The normalization process consists of extracting the cussing the pros and cons of each technique. same number of pixel, equally distributed, among hand border. Choosing a lower number of border pixels de- 3 Methodology crease the system accuracy, while choosing a higher Figure 1 gives an overview of the followed methodology number decrease the system performance. In our case for our work. a number of 128 pixels has been chosen. Fourier descriptors are used to transform the resul- tant image into a frequency domain. For each image, Real-Time Hand Contour only the first 15 Fourier coefficients are used to define Image them uniquely. Other Fourier coefficients do not ef- Segmentation Tracing Retrieval fect system accuracy. Every input gesture is compared against a training data set using a similarity distance measure like Euclidian distance. The gesture with the lowest distance is considered as a match. Gesture Fourier Classification Transformation 4 Experiment Environment Figure 1: Methodology Experiment environment used for implementing and testing our real-time static dactyls recognition system Microsoft Kinect is used as a real-time image re- is composed of the following hardware: A notebook trieval. Kinect consists of an RGB camera, an IR emit- with a processing capacity of 2.5 GHz, Intel Core-i5. ter, an IR depth sensor, a microphone array and a tilt. A memory capacity of 6 GB of RAM and a Windows 10 The RGB camera can capture three-channel data in a operating system with a 64-bit architecture. Microsoft 1280 x 960 resolution at 12 FPS or a 640 x 480 resolu- Kinect for Xbox 360 is used as a real-time image re- tion at 30 FPS. In our work images consist of a 640 x trieval technology. It generates 30 frames per second 480 resolution at 30 FPS. The valid operating distance and can be used as a RGB camera and also can provide of Kinect is approximately 0.8m to 4m [MSD16]. Due depth data. to its advantages, it has simplified the process of vision System was developed using .Net technology. based object recognition, especially for segmentation Kinect for Windows SDK 1.8.0.0 was used as library phase. between Kinect device and our application. It provides Every pixel generated from Kinect device contains a way to process Kinect signals. An overview of the information of their depth location layer and player system architecture is given at Figure 2. index. By using player index we focus only in pixels that are part of human body [WA12]. In this way all other pixels, not part of human body are excluded. By applying a constant threshold we can obtain the Static Dactyls Recognition Application human hand, since it is the first part of human body towards the Kinect device [GK15a]. Kinect for Windows In order to perform Fourier transform we have to SDK generate a centroid function which is based in hand image contour. Theo Pavlidis is used as a hand contour RGB Camera Depth Sensor tracking algorithm [Pav12]. The segmented hand is transformed in greyscale where each pixel is classified as a white or a black one. After applying Theo Pavlidis algorithm, the resultant image contains only border Figure 2: System Architecture pixels of human hand. 5 Experiment and Results Table 2: Average Recognition Accuracy of Testing To test the proposed system, several experiments were Data Set conducted. Each experiment is based on two aspects: accuracy and computation latency. The first experi- Static True Recognition False Recognition ment measured the accuracy of correct identification Dactyl Rate (%) Rate (%) and classification of static dactyls. Our system is not NJ 87.5 12.5 able to identify and classify dynamic dactyls. It is O 65 35 based only in static ones. Firstly training data set is P 54 46 created. It contains 320 dactyl gestures taken from two Q 52.5 47.5 different signers. Each gesture is performed 5 times R 65 35 from each signers and is represented by 15 Fourier co- RR 85 15 efficients. There are in total 4800 coefficients (15x320). S 84 16 For real-time testing, 4 different signers were used. T 67.5 32.5 Each signer performed 5 gestures for each dactyl sign. TH 54 46 In total they performed 640 experiments. Each ele- U 67.5 32.5 ment in testing data set is compared against all ele- V 97.5 2.5 ments in training data set. The element with lowest X 87.5 12.5 Euclidian distance is considered as a match. The aver- XH 60 40 age recognition accuracy for each static dactyl is given Y 50 50 in the Table 1 and Table 2. Z 52 48 Table 1: Average Recognition Accuracy of Testing dactyls ”D”, ”E”, ”F”, ”N”, ”O” are more confused Data Set ones. The second experiment deals with system perfor- Static True Recognition False Recognition mance. We want to achieve a performance that allows Dactyl Rate (%) Rate (%) the system to be deployed in real-time. For every sign A 100 0 we analyzed the time required for the following phases: B 87.5 12.5 hand segmentation, hand contour tracing, normaliza- C 70 30 tion, centroid function generation, Fourier transforma- D 67 33 tion and gesture classification. Table 5 summarize the DH 72.5 27.5 results. E 70 30 The system needs approximately 12 to 17 ms to F 60 40 process a static dactyl. Most of the overall time is G 75 25 consumed by hand segmentation and gesture classifi- GJ 72 28 cation processes. They occupy approximately 82% of H 87.5 12.5 total time. It can be deployed without any latency in I 82.5 17.5 a real-time system that uses Microsoft Kinect. J 98.75 1.25 K 100 0 6 Conclusion and Future Work L 50 50 LL 52.5 47.5 The aim of this paper is to built a real-time system M 85 15 that is able to recognize static dactyls for Albanian al- N 55 45 phabet by using Microsoft Kinect. Albanian alphabet is composed of 36 dactyls and 32 of them are static. For all static dactyls, the system achieves an average The static dactyls are used as inputs for our system. accuracy rate of 72.32%. Results show that dactyls Kinect device provides a vision based approach and is with the highest accuracy rate are ’A’, ’J’, ’K’ and used as an image retrieval technology. Its main fea- ’V’. Their accuracy rate is above 95%. Dactyls with ture includes depth sensor. For every static dactyl, a the lowest accuracy rate are ’L’, ’Y’ and ’Z’. Their data set with 15 Fourier coefficients was built. In to- accuracy rate is below 52%. tal data set consists of 4800 coefficients. For testing Table 3 and Table 4 give information regarding purpose, 4 different signers were used. Each of them static dactyls confusion percentages. Some of Alba- performed 5 times each of the static dactyls. A total nian dactyls are easily confused with other dactyls of 640 experiments were conducted. For classification due to their similarity. Based on experimental results purpose a similarity distance measures like Euclidian Table 5: Computational Latency Results Table 3: Confusion Dactyls Percentages (%) Computation Static Dactyl Confusion Dactyls Percentages Phases Latency (ms) A {A,100} Hand Segmentation 5.49308 B {B,87.5}; {TH,12.5} Hand Contour Tracing 0.08145 C {C,70}; {E,12}; {X,12}; {Y,6}; Normalization 0.03261 {B,20}; {D,67}; {DH,5}; Centroid Function Generation 0.03705 D {H,5}; {U,3}; Fourier Transformation 2.45306 {DH,72.5}; {E,2.5}; {F,12.5}; Gesture Cassification 6.67130 DH {U,12.5}; Total 14.7686 {C,2.5}; {E,70}; {J,2.5}; {TH,12.5}; E {V,12.5} distance was used. Every element in testing data set {B,12.5}; {DH,12.5}; {F,60}; is compared against each element in training data set. F {TH,2.5}; {U,12.5} The element with the lowest Euclidian distance is con- G {G,75}; {I,12.5}; {J,12.5} sidered as a match. The system is tested against accu- GJ {GJ,72}; {X,8}; {Z,20} racy and performance. Based on experiments results H {B,5}; {F,5}; {H,87.5}; {TH,2.5}; the system achieves an accuracy rate of 72.32%. The I {I,82.5}; {J,15}; {Y,2.5} system needs to compute a static dactyl is 14.05 ms J {I,1.25}; {J,98.75} in average. It can be deployed in a image receiving K {K,100} technology that generates 68 frames per second. L {L,50}; {XH,35}; {Z,15} Future work consists of improving the overall sys- tem performance and accuracy by applying a more re- liable data set. This can be done by including more diverse signers who have high knowledge of Albanian Table 4: Confusion Dactyls Percentages (%) sign language. The future work also consist of adding dynamic dactyls as well as other gestures of Albanian Static Dactyl Confusion Dactyls Percentages sign language. LL {GJ,32.5}; {LL,52.5}; {M,15} References M {E,12.5}; {M,85}; {Y,2.5} [ANA13] ANAD. Gjuha e Shenjave Shqipe 1. {A,12.5}; {N,55}; {O,15}; {Q,15}; N Shoqata Kombëtare Shiptare e Njerëzve që {Z,2.5}; nuk Dëgjojnë, 2013. NJ {GJ,12.5}; {NJ,87.5} {C,2.5}; {N,2.5}; {O,65}; {P,12.5}; [BF12] Salah Bourennane and Caroline Fossati. O {Q,17.5}; Comparison of shape descriptors for hand P {P,54}; {O,23}; {R,23} posture recognition in video. Signal, Image Q {N,25}; {O,22.5}; {Q,52.5} and Video Processing, 6(1):147–157, 2012. R {J,18}; {N,17}; {R,65} RR {J,15}; {RR,85} [BGRS11] Andre LC Barczak, Andrew Gilman, S {S,84}; {Z;16} Napoleon H Reyes, and Teo Susnjak. Anal- T {P,32.5}; {T,67.5} ysis of feature invariance and discrimina- {DH,15}; {P,16}; {Q,15}; tion for hand images: Fourier descriptors TH {TH,54} versus moment invariants. In International U {N,32.5}; {U,67.5} Conference Image and Vision Computing V {E,2.5}; {V,97.5} New Zealand IVCNZ2011, 2011. X {K,12.5}; {X,87.5} {DH,12.5}; {TH,15}; {X,12.5}; [CBM07] Simon Conseil, Salah Bourennane, and Li- XH {XH,60} onel Martin. Comparison of fourier de- Y {A,25}; {J,25}; {Y,50} scriptors and hu moments for hand pos- {DH,20}; {F,14}; {N,14}; ture recognition. In Signal Processing Con- Z {Z,52} ference, 2007 15th European, pages 1960– 1964. IEEE, 2007. [GK15a] Eriglen Gani and Alda Kika. Identifikimi using svm-knn. International Journal of i dores nepermjet teknologjise microsoft Applied Engineering Research, 11(8):5414– kinect. Buletini i Shkencave te Natyres, 5418, 2016. 20:82–90, 2015. [Sud14] Bh Sudantha. A portable tool for deaf and [GK15b] Eriglen Gani and Alda Kika. Review on hearing impaired people, 2014. natural interfaces technologies for design- ing albanian sign language recognition sys- [TL11] Pedro Trindade and Jorge Lobo. Dis- tem. The Third International Conference tributed accelerometers for gesture recog- On: Research and Education Challenges nition and visualization. In Technological Towards the Future, 2015. Innovation for Sustainability, pages 215– 223. Springer, 2011. [GK15c] Archana S Ghotkar and Gajanan K Kharate. Dynamic hand gesture recogni- [Ull11] Fahad Ullah. American sign language tion and novel sentence interpretation al- recognition system for hearing impaired gorithm for indian sign language using mi- people using cartesian genetic program- crosoft kinect sensor. Journal of Pattern ming. In Automation, Robotics and Appli- Recognition Research, 1:24–38, 2015. cations (ICARA), 2011 5th International Conference on, pages 96–99. IEEE, 2011. [GK16] Eriglen Gani and Alda Kika. Albanian sign language (AlbSL) number recogni- [VAC13] Harsh Vardhan Verma, Eshan Aggarwal, tion from both hand’s gestures acquired and Swarup Chandra. Gesture recognition by kinect sensors. International Journal using kinect for sign language translation. of Advanced Computer Science and Appli- In Image Information Processing (ICIIP), cations, 7(7), 2016. 2013 IEEE Second International Confer- ence on, pages 96–100. IEEE, 2013. [LGS08] Yun Liu, Zhijie Gan, and Yu Sun. Static hand gesture recognition and [WA12] Jarrett Webb and James Ashley. Begin- its application based on support vec- ning Kinect Programming with the Mi- tor machines. In Software Engineer- crosoft Kinect SDK. Apress, 2012. ing, Artificial Intelligence, Networking, [WKSE02] Juan Wachs, Uri Kartoun, Helman Stern, and Parallel/Distributed Computing, 2008. and Yael Edan. Real-time hand gesture SNPD’08. Ninth ACIS International Con- telerobotic system using fuzzy c-means ference on, pages 517–521. IEEE, 2008. clustering. In Automation Congress, 2002 [MSD16] MSDN. Kinect for windows sensor compo- Proceedings of the 5th Biannual World, nents and specifications, April 2016. volume 13, pages 403–409. IEEE, 2002. [Pav12] Theodosios Pavlidis. Algorithms for graph- [ZL+ 02] Dengsheng Zhang, Guojun Lu, et al. A ics and image processing. Springer Science comparative study of fourier descriptors & Business Media, 2012. for shape representation and retrieval. In Proc. 5th Asian Conference on Computer [RMP+ 15] J.L. Raheja, M. Minhas, D. Prashanth, Vision. Citeseer, 2002. T. Shah, and A. Chaudhary. Robust ges- [ZY14] Yanmin Zhu and Bo Yuan. Real-time hand ture recognition using kinect: A compari- gesture recognition with kinect for play- son between DTW and HMM. Optik - In- ing racing video games. In 2014 Inter- ternational Journal for Light and Electron national Joint Conference on Neural Net- Optics, 126(11-12):1098–1104, jun 2015. works (IJCNN). Institute of Electrical & [SB13] Kalin Stefanov and Jonas Beskow. A Electronics Engineers (IEEE), jul 2014. kinect corpus of swedish sign language signs. In Proceedings of the 2013 Work- shop on Multimodal Corpora: Beyond Au- dio and Video, 2013. [SSKK16] S Shruthi, KC Sona, and S Kiran Ku- mar. Classification on hand gesture recog- nition and translation from real time video