=Paper=
{{Paper
|id=None
|storemode=property
|title= Toward Robust Features for Remote Audio-Visual Classroom
|pdfUrl=https://ceur-ws.org/Vol-710/paper38.pdf
|volume=Vol-710
|dblpUrl=https://dblp.org/rec/conf/maics/SchlittenhartWSI11
}}
== Toward Robust Features for Remote Audio-Visual Classroom==
Toward Robust Features for Remote Audio-Visual Classroom
Isaac Schlittenhart Jason Winters Kyle Springer Atsushi Inoue ∗
Eastern Washington University
Cheney, WA 99004 USA
Abstract consequence, real-time remote lectures are frequently con-
sidered infeasible and expensive, despite their potentials and
We present two studies on robustness of feature extractions needs.
for an remote classroom intelligent autopilot: (1) robust fea-
ture extractions and (2) a simple automated calibration of we- Due to the recent growth in Internet communication, there
bcams. For the robust feature extractions, use of quantified is more availability of cost effective webcams, condensed
vectors is studied as feature extractions of fuzzy classifiers in microphones, projectors or large screens, and ordinary PCs
Perceptual State Machine, i.e. our core Computational Intel- (desktop and laptop). In addition to hardware, there has been
ligence model for this intelligent autopilot. The simple auto- a large increase of useful online services and methods of
mated calibration of devices is studied mainly for the sake of delivering remote content, such as remote desktop controls,
maximizing device utility. Those studies have shown promis- video chat, conference calls, etc. Considering such availabil-
ing results for actual use of this intelligent autopilot in ordi- ity of off-the-shelf products, we anticipate to utilize those in
nary classrooms that are not necessarily ideal for teleconfer- order to suppress the implementation costs.
ence lectures.
Since all the products we are employing are ready to use
Keywords: Remote Classroom, Autopilot, Perceptual State
out of the box, the main issue is device control and inte-
Machine, Fuzzy Classifiers.
gration. Our goal is a robust and intelligent autopilot that
utilizes fuzzy sets, so that off-the-shelf products are to be
Introduction maximumly utilized while requiring very little or no calibra-
tion, i.e. virtually free maintenance. In addition, it is ideal
In this paper we present an improvement on robustness of to make the entire remote classroom system (consisting of
an intelligent autopilot for remote audio-visual classrooms. off-the-shelf devices, PCs, software, and this intelligent au-
This intelligent autopilot, that is currently under develop- topilot) compact and portable, e.g. a few components on a
ment, intelligently recognizes students in a remote class- cart. When such a system is available, remote lectures can
room who need their instructor’s attention. It controls vari- be set in any standard classrooms within a matter of minutes
ous audio and visual devices, such as microphones and CCD by anyone (e.g. student assistants) without requiring exten-
cameras, autonomously as deemed appropriate. Currently, sive technical training.
we incorporate a simple Computational Intelligence model, Perceptual State Machine, our Computational Intelligence
so-called Perceptual State Machine, i.e. a finite state ma- model, is essentially a finite state machine that makes use
chine with the use of fuzzy classifiers as its transition func- of fuzzy classifiers as its transition functions(Beaver and
tions, and study mainly on its feature extractions in order to Inoue 2006). Those fuzzy classifiers map between per-
achieve our satisfactory performance. This paper describes ceptual states naturally recognized by human beings (e.g.
our most recent progress on their robustness against lighting ’no session’, ’need attention’ and ’in session’) and inputs
conditions of the remote classrooms that are often consid- (i.e. features) extracted from video and audio streams,
ered problematic for image processing. that are captured through physical sensors such as web-
cams and condensed microphones. All the previous studies
Background and Motivation on feature extractions have focused only on various pixel
Operating standard distance learning remote classrooms re- histograms in certain color ranges mostly for the sake of
quires skilled operators. This often causes the costs of re- simplicity and real-time responses(Beaver and Inoue 2005;
mote classrooms to be prohibitive for smaller institutions Moore et al. 2008). Those studies include palm pixel count
and may result in additional costs to student tuition and the and extension of palm pixel count, frame pixel differenti-
institution. Further, high quality audio and visual devices ation, gesture recognition using hue saturation value, pixel
often demand frequent calibration. This can require special- count/position histograms, counting moving colored pixels,
ized, skilled technicians and generate yet another cost. As a and locating a student via audio amplitude. These feature ex-
tractions performed at a satisfactory level thus held promise
∗
E-mail: inoueatsushij@gmail.com under ideal lighting conditions, e.g. no sunshine coming into
the room as a result of shutting off the window shades.
Anticipated Improvements
The following two anticipations are presented in this paper
for improving the robustness against lighting conditions that
are not necessarily ideal but are rather common in many or-
dinary classrooms. If those anticipations are successful, our
technology advancement is very significant.
Toward robust question detection using quantified vec-
tors. The well known pitfall in those pixel counting feature
extractions concerns scalability and position shifts in the
frame. Assuming the video frame was cropped into sections
of interest what would happen if someone’s foot or hand pro-
truded into the cropped area of interest or the subject shifted
rapidly in the frame? What if the individual in the frame had
a large amount of skin exposed or their skin was a different
hue than expected? The result was a potentially false ques- Figure 1: Sample raised hand range
tion state when no question existed or the lack of detecting a
question at all. Use of quantified vectors presented here po-
tentially solves this problem and seems well suited to larger to no calibration, to be wheeled into a classroom and func-
scale. tion. Through experimentation, we have found that white
balance affects color and, more specifically, white balance
Toward robust images using color and lighting correc- varies from both internal and external light sources and from
tion. Using off-the-shelf components has its benefits but camera hardware. The variation in lighting causes the skin
also adds issues concerning quality control and calibration. color detection algorithms to mistake objects in the room
We found that many of the inexpensive webcams do have au- as skin. This had a very large impact on the ability to de-
tomatic white balance and exposure controls but that these tect a question. Color change in image pixels also suffers
controls can be inadequate in various classroom lighting from other problems that are difficult to address: variation
conditions. Two simple correction methods are discussed in skin color, skin-colored objects in the room, individuals
in this paper that likely improve the quality of video images with large amounts of skin exposed, and shifts in scale and
that are to be fed to the intelligent autopilot. framing of individuals. Clearly, Color change in image pix-
els alone is not robust for recognizing perceptual states of
Question Detection Using Quantified Vectors students in remote classrooms.
The detection of any object through computer vision has With further increasing computational power and recent
been an evolutionary process. The real-time requirement advancements of devices, issues of computational cost has
aspect of this intelligent autopilot system puts some con- become less significant. As a result of this, more informative
straints on processing power. Because of this, more simplis- methods for detecting objects in video frames are to be feasi-
tic methodologies have been employed in detecting ques- ble. In this study, we propose a feature extraction method of
tion states. Initially in Beaver’s work (Beaver and Inoue combining hand and face position data as quantified vectors
2005), a method was proposed by which pixels matching the for better robustness in perceptual state recognition.
grayscale color of the palm were counted. Through the use
of a fuzzy classifier three states were identified: in session, Approach
no session, and question. Beaver’s method worked well un- Even in live classrooms, evaluating if a student has a ques-
der ideally set-up conditions, but such pixel count methods tion is subjective. Since the likelihood of a question is not an
can have difficulty with scale or computing distance from the absolute yes or an absolute no, fuzzy sets are the best suited
camera. A following work(Moore et al. 2008), while still for determining such perceptual states of a classroom.
focusing on pixel counting, includes pixel location extracted From a human perspective, we generally and naturally an-
from vertical and horizontal histograms of pixels in certain alyze the location of the hand in relation to the face. Since
color ranges. Additionally, instead of using only grayscale our minds are assumed to process the position of the hand in
palm color range, a skin detection using hue saturation value relation to the face, we do not process this as exactly com-
has been employed. This method has held promise and puted; but we rather simply know (i.e. perceive) if an indi-
showed selected skin segments well enough when taken un- vidual has a question by observing them. Taking this into
der a white balanced condition. consideration, we utilize the centers of the face and the hand
Although the pixel count methods and skin detection data in such a way that a line can be drawn between them. We
plots have been distinct and held promise, they are plagued simply consider this as a vector. As far as its coordinate is
by stability and reliability issues when considering lighting considered, given the inherent presentation of such a vector
conditions of ordinary classrooms. The premise of the in- in an image composed of rows of pixels, we use polar coor-
telligent autopilot system is for a simple cart, requiring little dinate rather than Cartesian. In doing so, vector coordinates
of the data points can be represented as x-axis distance and
y-axis distance from center to center and their angles and
magnitudes are inherently contained within themselves.
Since the position of the face can be described relative to
the hand straightforwardly using this polar vector, the only
remaining aspect that has to be addressed is how to actually
recognize the hands and faces in the target image. Jones (Vi-
ola and Jones 2001) has suggested a method that has proven
to be highly effective in recognizing objects given a set of
training images. Other studies have shown that Haar-like
classifiers proposed in their studies are superior in recog-
nition rates per CPU cycles than many other conventional
methods (Santana et al. 2008). Since Haar-like classifiers
are based on trained boosted classifiers with image integrals,
our current concerns such as color, white balance and skin-
colored objects in the room no longer impact the feature ex-
tractions (Viola and Jones 2001).
Model Figure 2: Sample data points
A single video frame is assumed to contain an image of the
classroom. The three perceptual states of the classroom can
then be outlined from a logical viewpoint as follows: 3. Collect the quantified vector in order to identify fuzzy
classifiers for the perceptual state recognition.
No session No faces present in the video frame.
In session Faces present in video frame but not hands (if In our experiment, we have used printed images of faces
any) in question positions. and hands then captured video frames of those through a we-
bcam so that vector data points can be simulated as if actual
Question Faces and raised hands present in the video
images of actual students are captured. The pre-configured
frame.
face classifiers have worked quite well in our experiment
First, proximities of faces and hands are presented as vec- while the pre-configured hand classifiers do not. As a result,
tors in Cartesian coordinate such that Haar-like classifiers we have encountered some training overheads for those clas-
find the following coordinates: sifiers (a future work). Some results can be seen in figure 2.
• The centroid of the i-th face Fi (x1, y1). The y-axis displays the range that the hand can be located
from the head vertically. This information combined with
• The centroid of the i-th hand Hi (x2, y2). the x-axis showing the hand face distance horizontally cre-
• The width of the face s, i.e. a scaler. ates a visual map that can be used to define fuzzy partitions.
Then the quantified vector V is constructed in Cartesian As shown, the data groupings are very distinct and the fuzzy
coordinate such that classifiers should well be identified.
V =< x2 − x1 , y2 − y1 > ·s Findings
Finally this is converted into the polar coordinate such that The robustness of using quantified vectors as features holds
V =< r, θ > promise and seems to bypass many of the issues encoun-
p tered with pixel/color methods. The distinct data sets in
where r = |V | = s · (x2 − x1 )2 + (y2 − y1 )2 and θ = figure 2 show that this method is very distinct and will be
−y1
arctan xy22 −x 1
if x2 −x1 > 0 and θ = π2 if x2 −x1 = 0. Clus- able to recognize classroom states to a high degree of accu-
ters of such vectors are summarized (i.e. their histograms racy. However, a new set of issues is introduced. Haar-like
are generated) in order to generate fuzzy sets for the fuzzy classifiers may have trouble with detecting objects if the ob-
classifiers. ject is rotated slightly. This could pose a problem for both
faces and hands. A solution has been suggested by (Bar-
Experiment and Evaluation czak, Johnson, and Messom 2005) and further investigation
An open source software called OpenCV provides the nec- must follow. Another issue is the dependencies of Haar-like
essary tools in order to train Haar-like classifiers. The ex- classifiers upon a number of parameters including sample
perimental procedure follows: size, training parameters, and optimal training image size.
For the time being, default parameter setting appears to be
1. Train Haar-like classifiers for hands and faces using sufficient. It also appears that a larger sample of images are
OpenCV software. Alternatively, there are pre-configured highly demanded for the satisfactory classifier training. If
Haar-like classifiers in OpenCV. this is indeed the case, Haar-like classifiers may not be suit-
2. Use those trained hand Haar-like classifiers in order to able for this intelligent autopilot. A further investigation is
detect faces and hands in a video frame. currently underway on this critical issue.
Figure 4: Image of insufficient light coming into the camera
Figure 3: Image of excessive light coming into the camera
Robust Images Using Color and Lighting
Correction
Despite some extreme lighting conditions such as a direct
sun light coming into a classroom, sensory devices may still
be able to capture images. In theory, our Computational In-
telligence model is capable of recognizing states as long as
the images are captured distinctively enough from noises, Figure 5: Histogram of excessive light coming into the cam-
even if they are not in good quality for human eyes. The era
goal this work is to maximize the recognition performance
as a result of color and light correction. The following two
steps are considered to maximize the image quality coming camera will have a different aperture size. The smaller the
into the system under extreme conditions: aperture, the less light will enter the lens and the bigger the
1. Detect whether the appropriate magnitude of light is there. aperture, the more light will be brought in (Busch 2008).
Since most web cameras have a fixed aperture size the light-
2. If the magniture of light is not appropriate (i.e. excessive ing of the room will have to be adjusted to meet the needs
or insufficient), perform a color correction. of the camera selected for the virtual classroom. Traditional
methods to detect the correct amount of lighting for an im-
Step 1: Lighting Conditions age taken are the use of histograms (Busch 2008). If most of
Lighting conditions in a room can greatly affect the quality the weight of the histogram falls on the left side this means
of an image taken through a digital camera such as a web that there is not enough light in the room to ensure details. If
camera. If the amount of light in a room is insufficient, there the histogram has most of the weight on the right side, this
is a chance that the camera will not pick up on details of indicates there is too much lighting for the room. See 5 and
objects in the room. If there is excessive lighting, chances 6 for examples of the histograms for the images above.
that the picture is bleached out are very high. Two examples Once a histogram of an image is made it is possible to au-
can be seen below of how the amount of lighting can affect tomate the process of having the system alert the user of im-
the image taken through a web cam: figure 3 and figure 4. proper lighting conditions for the lens being used inside the
By looking at those images it is possible to see how badly classroom. OpenCV has several tools to build histograms of
the details of the image can be lost depending on the lighting images (Gary Bradski 2008). These histograms can be built
conditions coming into the camera. and analyzed in real time. The data in the histograms can
It is impossible to correct images that have either too be placed into buckets. Each bucket will contain the count
much or too little light due to the fact that details in the of a certain color. If there is too much black in the image,
image will be missing. This means in the system for the this means there is insufficient light for the camera. If there
virtual classroom there needs to be an automated warning is too much white, it means there is too much light for the
system to alert the user that the amount of lighting coming camera.
into the camera is not appropriate for the system. The right At this point it is unknown how much black and how
amount of lighting depends on the camera being used. Each much white indicates a problem with the lighting in the
lighting conditions of the room, and correct the values of the
image of the room by the values changed on the target.
The target for the test was a piece of white foam board
picked up at an arts and craft store. Given that white reflects
all colors, and we are looking for the color change of the tar-
get and not the color of the target itself this material seemed
to be a reasonable choice (Serway 1996). Next, two cameras
were tested to see if they picked up different color values.
This was done by putting three colored papers (green, red
and blue) in front of each camera and recording the values
they saw. In each instance, camera number 2 recorded two
values more red than camera number 1. It is possible to cal-
Figure 6: Histogram of insufficient light coming into the ibrate the system by adding two values to the red in camera
camera number 1 or by subtracting two values from camera number
2.
The next step is to set up the cameras to get input into
system. Forty images were tested for this purpose; twenty the system. One camera is always going to point at the tar-
of them were examples of insufficient light while the other get; the other camera is always going to point into the room.
twenty contained too much light. It was found that if 30 The captured image will then be color-corrected in accor-
percent of them fell in the white range or black range the dance with the color change caused by the temperature of
lighting conditions were not right for the room. If the room the lighting used in the scene.
contains a lot of black or a lot of white to begin with the To summarize:
30 percent may not be a good threshold for that room. This
means that the threshold will have to be adjusted by the user • There are two cameras.
if needed depending on the objects in the room. • The color difference of the two cameras is tested and cor-
rected.
Result 1: Lighting Conditions • One camera points at a known color target.
To test the system 30 images were used; ten each of exces- • One camera points into the room (audience-facing).
sive, insufficient, and reasonable lighting. Each image was • The color difference between the known color of the tar-
fed into the system to see if it detected the type of the input get and the color detected in the room is used to provide
image. At this stage, ”reasonable” lighting does not mean color correction for the audience-facing camera.
perfect lighting. The lighting in the room just needs to be
the right amount for the system to work reasonably well. Result 2: Color Correction
Of the images tested, all the poor images were detected as
”reasonable”. However, out of the good images, two were Image color correction is subjective and based on human
detected as poor lighting conditions. Given that the purpose perspective. We feel that the results of our color correc-
of this part of the system is just to alert the user that there tion method show promise. The corrected images’ color was
could be a problem, this rate of false positives is acceptable. greatly improved. The brown tint of the images caused by
the temperature of the lighting disappeared and the images
looked more like they were taken in natural sunlight.
Step 2: Color Correction
The next step in improving the image is white balancing the Integrated Test Result
image – color correction. In any image taken the light in the We propose that using color-corrected images likely pro-
scene affects the color of the image. The only type of light duces better results than previously tested methods for the
that is not affected is white light (light that comes from the system. In the original study it has been shown it is possible
sun). All artificial light sources contain color known as tem- to count the number of pixels of skin tone in the image to
perature (Busch 2008). The temperature changes the color see if there is a hand raised or not (Moore et al. 2008). One
of the objects in the scene. The goal of white balance is to problem is that if the temperature of the lighting of the room
take the color out of the objects in the scene and restore the is not properly white balanced, objects in the room such as
color of the objects as if they appeared in natural white light. tables and clothing possibly looks like flesh tones to the sys-
While most decent image processing software packages tem. During the original test it has been found that the best
can white balance an image, the system for the virtual class- accuracy of the system is 86 percent. Our test images have
room needs to do it in real time since the lighting conditions showed an accuracy of 85 percent with good lighting to the
of the room may change while the system is in session by room. With poor lighting that accuracy dropped down to
various lights being turned on or off. Another factor to con- about 64 percent, which indicates the impact proper light-
sider is that two different cameras (even the same model) ing has on the system. After having the system correct the
might capture images of different color values (Gary Brad- images, the good images have not changed but the poor im-
ski 2008). The goal of the experiment is to take an image of ages have had 73 percent accuracy. This means by having
a known color value, see how the image is changed by the the system improve the images it is possible to increase the
accuracy of the system by correcting the lighting conditions the camera, and Emily Schlittenhart for her help editing and
of the room. proofing this paper.
Conclusion References
In this paper, two studies on robustness of feature extractions Barczak, A. L. C.; Johnson, M. J.; and Messom, C. H.
from images are presented. They are necessary in order to 2005. Realtime computation of haar-like features at generic
achieve the goal of our intelligent autopilot to be placed in angles for detection algorithms. In Research Letters in the
ordinary classrooms instead of ideally set-up experimental Information and Mathematical Sciences - ISSN 1175-2777.
environments for many image processing works. Beaver, I., and Inoue, A. 2005. Perceptual Recognition
Use of quantified vectors that are generated by Haar-like of States in Remote Classrooms. In Proceedings of Inter-
classifiers provides a more robust feature extractions that is national Conference of North America Fuzzy Information
virtually immune to many encountered issues about scal- Processing Society (NAFIPS05).
ing and positioning of objects in classroom images. Future Beaver, I., and Inoue, A. 2006. Using Fuzzy Classifiers
works should easily allow for multiple individuals in each for Perceptual State Recognition. In International Confer-
frame since a hand could be mapped to the nearest face while ence on Information Processing and Management of Un-
generating the quantified vectors, whereas multiple individ- certainty in Knowledge-based Systems (IPMU2006).
uals residing in one frame posed a large problem for meth-
ods based on change in colors of pixels. Quantified vectors Busch, D. 2008. Mastering digital SLR Photography.
as robust features would potentially increase the accuracy Thomson Course Technology.
when multiple individuals are present. Gary Bradski, A. K. 2008. Learning OpenCV. OReilly,
Furthermore, it has been shown that it is possible to im- first edition.
prove the video quality through color and lighting correc- Moore, Z. I.; Schlittenhart, I. W.; Simpson, D. M.; Sorna,
tion. This simple method provides an automated solution C. T.; Springer, K. A.; and Inoue, A. 2008. Intelligent
to a problem commonly encountered in the ordinary class- Autopilot for Remote Classroom: Feature Extraction. In
rooms that are not necessarily opted for teleconference lec- Proceedings of Midwest Artificial Intelligence and Cogni-
tures. This should bring us a feasibility that the system may tive Science Conference (MAICS 2008).
be used in any ordinary classrooms (as long as a net connec- Santana, M. C.; Déniz-Suárez, O.; Antón-Canalı́s, L.; and
tion is available) and that the state recognition may achieve Lorenzo-Navarro, J. 2008. Face and Facial Feature De-
a satisfactory level. tection Evaluation - Performance Evaluation of Public Do-
The next step of this intelligent autopilot is mainly about main Haar Detectors for Face and Facial Feature Detection.
implementation and system integration for the first complete In Ranchordas, A., and Araújo, H., eds., VISAPP (2), 167–
prototype. 172. INSTICC - Institute for Systems and Technologies of
Information, Control and Communication.
Acknowledgment Serway, R. A. 1996. Physics for scientists and engineers
This research was conducted as a 10-week course work of with modern physics. Saunders college publishing, fourth
CSCD581 Computational Intelligence at EWU in Winter edition.
2009. The authors would like to thank James Lamphere Viola, P., and Jones, M. 2001. Robust Real-time Object
for an additional system administration support, Brian Kamp Detection. In International Journal of Computer Vision.
and his students for looking at bright lights while waving at