1. Introduction

Geometrically and Temporally Consistent Visual Annotation for Smart Glasses

Kanade Sumino

Naoya Wakita

Ikuhisa Mitsugami

0 0 Hiroshima City University , JAPAN 3-4-1, Ozuka-Higashi, Asaminami-ku, Hiroshima, 731-3194 , JAPAN

In this study, we propose a wearable face recognition system using commercially available smart glasses. For this system, there are two technical contributions. First, we propose a geometric calibration between the display area on the user's visual ifeld and the camera mounted on the smart glasses for correctly overlaying the visual annotations on the physical world observed by the user's eyes. Secondly, we propose a method for reducing the delay in showing the visual annotation for maintaining geometric and temporal consistency. We developed the whole system and experimentally confirmed that the system could show geometrically and temporally correct annotations.

eol>Wearable system face detection face recognition calibration multi-processing

1. Introduction

Many of you should have experienced situations where you could not recall the name or afiliation of a person who you happened to meet though you knew his/her face. It would be useful if there were a system that superimposes the name and afiliation of the person at his/her face in your view of sight. Such a system is also useful in many situations such as helping elderly people with dementia to recall people around them. In this study, we thus propose a wearable face recognition system using commercially available smart glasses. The system performs face detection and recognition processing and shows visual annotations on a transparent display in the user’s field of view, which enables the user to know the names and attributes of the people in front of you. There are two technical challenges to realizing this system. The ifrst challenge is that the annotations must be shown appropriately at the face in the user’s field of view, for which we propose a geometric calibration method for the display in the field of view and a camera. The second problem is that even the geometric calibration is done the delay of the face detection and visual annotation causes the misalignment of the annotation since the user’s face and the person in front of him/her is always moving. To solve this problem, we propose a multi-process architecture where the face recognition and visual annotation run as separate processes. This architecture significantly reduces the delay and realizes the geometrically and temporally correct visual annotation.

2. System Configuration

Augmented Reality (AR) is a technology that superimposes digital information on the view of sight of a person observing the real world to enable a visually augmented representation of reality. In most existing studies, they often use special devices such as Microsoft HoloLens or video see-through VR goggles. Those devices are useful as they ofer functions for maintaining the geometric and temporal consistency between the annotations and the real world. However, due to their weight and special appearance, they are not suitable for us to wear in our daily lives. For wearing in our daily lives, smart glasses can be good alternatives. Though they usually do not have functions for the geometric and temporal consistency, they are small and lightweight and look like usual glasses, which are important characteristics to use them in reality. In this study, therefore, we use optically transparent smart glasses EPSON Moverio BT-30E [ 1 ]. It has binocular transparent displays that are shown around the center of the user’s field of view and a camera to capture his/her field of view. Figure 1 shows an overview of the system. The device is connected to a PC via USB, and the displays and camera are recognized as an external display and webcam, respectively. show the visual annotation at the face in the user’s field of view. Since the positions of the camera and the eyes are not identical, it is necessary to geometrically calibrate them in advance to realize the positionally correct visual annotation. The following sections describe each step of the proposed system. 3.1. Face detection

For face detection, we apply a Haar-like feature-based

face detector [ 2 ]. When a face is detected within the area corresponding with the display in the user’s field of view, the face is cropped and saved in the system. When multiple people are detected at the same time. the system detects only the person closest to the center of the image.

OpenFace [ 3 ] , which is one of the popular face recognition libraries, is then applied to the cropped face images.

OpenFace calculates similarities between the face image and the images of people stored in a database and returns a confidence level (the range of 0 to 1) for each person. In our system, we experimentally determined the threshold of the confidence level to 0.5. wearing the system to gaze at the landmark while matching each corner of the display to the landmark in his/her sight while the camera captures the landmark, as shown in Figure 3. We then integrate those four cases corre3.2. Geometric Calibration for Visual sponding with the four corners as Figure 4. As shown in Annotation this figure, this process gives four pairs of points on the camera image and the display in the user’s field of view, While the face detection is performed on the images so that it is possible to calculate the homography matrix captured by the camera, the visual annotation for the between those image coordinates. Once the homography detected face should be shown on the display in the user’s is obtained, the position of the face in the camera coordiifeld of view. It is thus required to obtain the relation nate can be transformed into the display coordinate, so between the camera and display coordinates, as shown that the visual annotation can be shown correctly at the in Figure 2. face in the user’s field of view as shown in Figure 5.

To obtain the relation, we propose a homography- Note that this homography-based calibration assumes based calibration method. As a homography matrix is a 1) the four landmarks and the face to be recognized 3 × 3 matrix with a scale uncertainty, it has 8 unknowns, should be located on the same plane, or 2) the centers of which means that four pairs of points are required to the eyes and camera should be colocated. Though they calculate. It is thus often done that the display showing are not fulfilled, as the distance between the eyes and (no fewer than) four points is captured by the camera to the camera is so small compared with the distance to obtain the four pairs of points. In the case of this system, the face, the second assumption is reasonable. Besides, however, it is impossible to capture the display by the considering the first assumption, it is desirable that the camera. To solve this problem, we propose to obtain the landmark should be at the similar depth to the face to be four pairs of points in an indirect way as follows. We first recognized in its actual use. put a landmark point in the real world, and ask a user

Even after the geometric calibration, it would still happen

that the visual annotation is misaligned due to a delay by the face recognition process. For example, as shown in Fig. 6, even when the annotation is drawn at the position where the face was detected, if the person in the real world moves during the process, a gap occurs between Figure 7: Multi-process processing for reducing delays. the annotation and the person in the user’s sight. In this study, we thus propose a method for reducing the delay by separating the whole process into that for the face ing and shows visual annotations on a transparent disrecognition and that for calculating the position of the play in the user’s field of view, which enables the user visual annotation, considering that the face recognition to know the names and attributes of the people in front process takes much longer time than the others and the of you. The main contributions of this system are 1) identity of the person in front of the user never changes the geometric calibration between the camera and the so frequently while the face position changes frame by display in the user’s sight, and 2) the multi-processing frame. Figure 7 shows the idea of the proposed method. for reducing the delay in showing the visual annotation. By separating the face recognition process from the oth- We confirmed the efectiveness of those contributions by ers, the facial annotation can be shown in the high frame actually implementing the system and performing the rate and very little delay. experiments.

Future work includes making the system smaller and lighter for practical use. In the current system, the smart 4. Experiments glasses are connected to a desktop PC, but in practical situations, they must be a wearable mobile PC or a smartphone. Another important issue for practical use is to consider the way to register people to be recognized.

We experimentally evaluated the performance of the proposed method. We implemented the system and asked a participant (user) to keep looking at another person who was moving in front of him. It was confirmed that the visual annotation was always shown at the face in the user’s sight even when the person is moving. Table 1 shows the efect of the proposed method. By applying the multi-process method, the delay is reduced by 90%.

5. Conclusion In this paper, we propose a wearable face recognition system using commercially available smart glasses. The system performs face detection and recognition process

[1] https://www.epson.jp/products/moverio/bt35e/

[2]

Viola , M. Jones, “ Rapid object detection using a boosted cascade of simple features , ” Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 2001 .

[3]

Amos ,

Ludwiczuk , M. Satyanarayanan, “ Openface: A general-purpose face recognition library with mobile applications,” CMU-CS-16-118 , CMU School of Computer Science, Tech. Rep. , 2016 .