Download - Sign Recognition by Hand Tracking
1
University of Victoria
Faculty of Engineering
Tracking hand motion with a color dotted glove for sign language recognition
ASM SaeedulAlam
Electrical Engineering
September 4, 2012
2
TABLE OF CONTENTS
List of Tables and Figures ............……………………………………………………………………………………….iii
Summary………………………………………………………….………………………………………………………………..iv
1.0 Introduction…………………………………………………………………………………………………………………. 1
2.0 American Sign Language (ASL) ……………………………………………………………………………………..1
3.0 Hand Tracking Technologies…………………………………………………………….……………………………2
3.1 Tracking with interface………………………………………………………………………………………3
3.1.1 Optical Tracking………………………………………….……………………………………3
3.1.1.1 Marker Systems……………………………………………………………………3
3.1.1.2 Silhouette Analysis…………………………………..…………………………..4
3.1.2 Magnetic tracking……………………………………….……………………………………5
3.1.3 Acoustic tracking……………………………………….……………………………….…….6
3.2 Glove Tracking……………………………………………………………..……………………………………6
4.0 Interpreting sign language…………………………………………………………………………….………………6
4.1 using a recurrent neural network…………………………………………………….………………..7
4.2 using hidden Markov model………………………………………………………………….…….…….9
5.0 Proposed Methodology……………………………………………………………….………………….…………..10
5.1 Glove Design………………………………………………………………………………………...…………10
5.1.1 Glove vs. Bare hand tacking………………………………………..….……………….12
5.2 Rasterizing the frame…………………………………………………………….…………………..…….12
5.3 Color and pixel correcting………………………………………………………..…….…………………12
5.4 Indexing the library database………………………………………….……………………………….13
5.5 Matching and tracking……………………………………………………………………………………..14
5.6 Pose estimation and finding the nearest neighbor……………………………….……………14
5.7 Blending nearest neighbor………………………………………………………………………………..15
5.8 Recognizing ASL…………………………………………………………………………..……………………15
6.0 Discussion……………………………………………………………………..…………………………………………….18
7.0 Conclusion…………………………………………………………………………………………………………………….22
8.0 Recommendation………………………………………………………………………………………………………….22
References………………………………………………………………………………………………………………………….23
3
LIST OF TABLES AND FIGURES
FIGURES
Figure 2.0: Usage of ASL………………………………………………………………………………………….……….2
Figure 3.1.1.2: Krueger’s manipulation of graphics by hand………………………………………………5
Figure 4.1.1: Sign language word recognition system by using recurrent neural network....8
Figure 4.1.2: Recurrent Neural Networks……………………………………………………………….…………8
Figure 4.2: The four states HMM used for recognition………………………………………………..…..10
Figure 5.1: Glove Design……………………………………………………………………..……………………………11
Figure 5.1.1: Bare hand estimation and edge detection…………………………………………….………12
Figure 5.6: Hausdorff-like image distance……………………………………………………..…………………..15
Figure 5.8 Interpretation of sign language alphabets………………………………………………..……….16
TABLES
Table 5.3: Estimating neighboring color……………………………………………………………………………13
Table 6.1: Technical details comparison……………………………………………………………………………..18
Table 6.2: Comparison of advantage and limitations………………………………………………….……….20
4
Summary
Sign language is an essential communication toolkit for deaf people. Sign language use is not
associated with a specific ethnicity, location, or even household. Rather, people learn ASL
because they are deaf, hearing impaired or, less commonly, speech impaired, or because they
have family or friends who sign. As sign language is not practiced in all walks of human life, a
disabled person faces difficulties in daily life conversations. To solve this problem, hand and
finger gesture can be tracked and sign language can be recognized using specific computer
system and further translated using a voice output device.
For last two decades hand-tracking systems have been widely used in industrial applications,
virtual reality and medicine fields, but due to their expense and complicity their deployment
has been limited to regular customers. The purpose of this report is to reduce the gap of
communications between deaf and hearing people by developing an inexpensive and simple
hand tracking system that can be used interpreting and translating sign language. This report
focuses on interpreting American Sign Language (ASL) because of its practice all around the
world. Different hand tracking technologies and two possible strategies of interpreting ASL has
been discussed and compared in this report in terms of their advantages and limitations. Based
on the result, the report proposes a consumer hand tracking system using a real-time data
driven pose estimation technique with only a webcam and a polymer glove with a specific
pattern. The glove design and simple algorithms enables to employ a nearest-neighbor
approach to track hands at interactive rates. The tracking motion can be interpreted and
translated using a sign language recognition library.
5
Glossary
Silhouette A silhouette is the image of a person, an object or scene
represented as a solid shape of a single color, usually black, its
edges matching the outline of the subject
Virtual Reality Virtual reality (VR) is a term that applies to computer
simulated environments that can simulate physical presence in
places in the real world, as well as in imaginary worlds.
Neural networks The term neural network was traditionally used to refer to a
network or circuit of biological neurons. The modern usage of the
term often refers to artificial neural networks, which are
composed of artificial neurons or nodes.
Rasterisation Rasterisation is the task of taking an image described in a vector
graphics format (shapes) and converting it into a raster
image (pixels or dots) for output on a video display or printer, or
for storage in a bitmap file format.
Degree of Freedom (DOF) The number of degrees of freedom is the number of values in the
final calculation of a statistic that are free to vary.
Hausdroff distances Hausdroff distance measures how far two subsets of a metric
space are from each other. It turns the set of non-
empty compact subsets of a metric space into a metric space in its
own right. It is named after Felix Hausdorff.
Gaussian radial basis kernel Gaussian radial basis kernel is a real-valued function whose value
depends only on the distance from the origin.
.
6
1.0 Introduction
Conveying meaning using hand shapes, body language and expression, otherwise known as
sign language is an essential tool for visual, hearing and speech impaired people to establish
communication with other parties. Due to its diverse pattern and complexity, a disabled person
who excels at sign language often fails to communicate with non-sign language knowing
listener. As sign language is not practiced in all walks of human life, a disabled person faces
difficulties in daily life conversations. Speech synthesizer or generator (e.g. text to speech) is
widely used by people with visual impairment or reading disabilities, but communicating with a
listener using this type of device is never lucid and also it requires a significant amount of time
to make conversations. Computer recognition of sign language can be used for enabling
communication with hearing, visual or speech impaired people. Articulated finger tracking
systems have been widely used in professional and scientific arenas, but they are rarely
developed for consumer applications because of their price and complexity.
So far hand tracking system has been developed using different technologies, e.g. optical
tracking of LED or infrared reflecting markers, imaged based visual tracking, magnetic tracking,
acoustic tracking, LED gloves or digital data entry glove etc. In this report, different methods of
finger tracking which can be used for computer recognition of sign language would be
discussed. Based on facts and research, the paper would propose a possible idea of a simple but
effective system for real-time tracking hand motion that only requires a webcam and a cloth
glove with color markers placed at a custom pattern. A database library would be created for
the reference and a estimation pose would be deducted to confirm the track.
2.0 American Sign Language (ASL)
American Sign Language (ASL) is the language of choice for most deaf people in the United
States, Canada and Africa [1]. ASL is a sign language in which the hands, arms, head, facial
expression and body language are used to speak without sound. ASL features an entirely
different grammar and vocabulary from normal phonetic languages such as English [2].Although
7
the number of ASL speakers is unknown, there were 2.5 million deaf people in United States in
the year 2000, who were dependable on ASL [1]. Availability of ASL throughout the world is
shown in Figure 2.0.
ASL's grammar allows more edibility in word order than English and sometimes uses
redundancy for emphasis. ASL uses approximately 6000 gestures for common words and
communicates obscure words or proper nouns through finger spelling [2]. Because of ASL’s
availability and ease of use, this report chose ASL as its conducting sign language.
ASL is the
national sign
language.
ASL is used
alongside other
sign languages.
Insignificant
use of ASL
Figure 2.0: Usage of ASL [2]
3.0 Hand TrackingTechnologies
The tracking system is developed focusing on user-data interaction. The objective is to establish
a communication between human and computer interactions and use the virtual data to
synchronize with the hand gestures. Different types of tracking technologies have been used so
far to track the 3D position of hand and to capture finger configuration. The history of hand
tracking goes back to post-WWII development of master slave manipulator arms and during
Renaissance with development of the pantograph [3]. In this section, different types of tracking
8
technologies will be discussed emphasizing on their advantages and limitations. These
technologies can be divided into interface tracking and glove technologies. Tracking with
interface uses optical, magnetic, or acoustic sensing to determine the 3-space position of the
hand. Glove technologies use an electromechanical device fitted over the hand and fingers to
determine hand shape [4].
3.1 Tracking with interface
In this system, hand position can be tracked by following orientation of hand and finger
configuration. Position tracking could be done using following three technologies [5];
Optical tracking, using a single or multiple cameras from a certain distance.
Magnetic Tracking, Radiating a magnetic pulse from a fixed source.
Acoustic Tracking, using triangulation of ultrasonic wave to locate the hand.
3.1.1 Optical Tracking
In optical tracking, small markers are put on the major bone segments of the body. The markers
might emit infrared waves and could be either LEDs or reflecting dots.Single or multiple
cameras are used to capture the motion of the subject along with the markers. The software
system integrates those markers in 2D coordination and triangulates to calculate 3D position
for each marker [6]. Another method uses a single camera to capture the silhouette image of
the subject, which is analyzed to determine positions of the various pans of the body and user
gestures.
3.1.1.1 Marker Systems
Using of flashing infrared LEDs as a marker have been used widely in medical and
entertainment industry.In each hand, makers are placed in each “operative” finger. Cameras
are responsible for capturing each marker and measure its positions. Two types of marker
system have been developed to record the motion of limbs of body;
Infrared LED system such asSelspot® [7], Op-Eye®, and Optotrak®[8].
9
Reflective marker systems such Elite® and Vicon Avalon®[9]. .
Limitations
1. High processing time is required to analyze several camera images and to determine
each markers 3D position. [6]
2. Complex algorithm is needed to infer pose estimation. [7]
3. Multiple cameras needed to accurately distinguish the ambiguities when markers
coincide in visual field.
4. Inability to resolve ambiguities restricts its use to track fingers in interactive application.
3.1.1.2 Silhouette Analysis
Silhouette analysis of an image can easily distinguish body parts such as head, legs, arms and
fingers.Myron Krueger successfully analyzed complex motions in real time by processing
silhouette images using a custom hardware. Based on his technique, he developed a wide
collection of interactions and games without using gloves or goggles. The movements and
actions were integrated into his system called Videoplace® [10].Inspired by Krueger’s work,
Pierre Wellnerdeveloped DigitalDesk® [11]. The idea behind DigitalDesk is to mount a video
camera above a ordinary physical desk, pointing down at the work surface. Processing the
camera output, the system can determine whenthe user’s point (using a LED-tipped pen) or
gestures above a real or projected object. This allows the user to run and edit a projected text
file or a calculator by making gestures (figure).
10
Figure 3.1.1.2: Krueger’s manipulation of graphics by hand, fingertips controlling a spinal curve
[10].
Limitations
1. The consumer grade camera usually has low fps speed (24fps-60fps). This makes difficult to
track rapid moving fingers.
2. Poor resolution (less than 300 dpi) of the consumer grade camera makes difficult to
determine the location point of the fingers as they occlude each other and are occluded by the
hand [5].
3. Complex algorithm and technique is needed to interpret complex real time motions.
3.1.2 Magnetic tracking
Magnetic tracking technology is quite robust and widely used for single or double hand-tracking
[12]. Magnetic tracking uses a source element radiating a magnetic field and a small sensor that
reports its position and orientation with respect to the source. Magnetic systems do not rely on
line-of-sight observation like optical and acoustic systems.But metallic objects in the
environment would distort the magnetic field, giving erroneous readings. They also require
cable attachment to a central device (as do LED and acoustic systems) [5].Polhemus FASTTRAK®
and Ascension TechnologiestrakSTAR® provide various multisource, multi-sensor magnetic
systems that will track a number of points at up to 100 Hz in ranges from 3 to 20 feet [13], [14].
11
3.1.3 Acoustic tracking
Acoustic tracking uses high-frequency sound to triangulate a source within the work area. Most
systems such as Logitech[15] andMattel Power Glove[16] sends out pings from the source
(usually mounted on the hand) received by microphones in the environment. Precise placement
of the microphones allows the system to locate the source in space to within a few millimeters.
These systems rely on line-of-sight between the source and the microphones, and can suffer
from acoustic reflections if surrounded by hard walls or other acoustically reflective surfaces.
Multiple acoustic trackers must operate at non conflicting frequencies, a strategy also used in
magnetic tracking [5].
3.2 Glove Tracking
Motion tracking with gloves instrumented with sensors or gloves which emit or reflect infrared
light performs accurate result [12]. These techniques also give real-time results but are
expensive and may put some constraints on the possible hand movements.Inspired by Rich
Sayre’s work of world’s first data glove *17+, Thomas et aldeveloped an inexpensive, light
weight glove by using flexible tubes with a light source at one end and a photocell at the other
end. He used voltage from each photocell to correlate with finger configuration [5]. In 1983,
Gary Grimes developed Digital Data Entry Glove to recognize sign languages for the first time.
He used a cloth glove with specifically positioned numerous sewn sensors to track finger
movement [18]. . Thomas Zimmerman’s Data glove®[19], Dexterous HandMaster® (DHM) [20]
and VPL®DataGlove [19] are the finest examples of modern glove tracking technologies. The
advantage of using glove technologies are faster response time, minimum environment
restriction, availability in industry and minimum data loss after occlusion identifies. On the
other hand, relying on the software for data resolution and high expense are the only
limitations [6].
4.0 Interpreting sign language
Sign language recognition from static and dynamic hand gestures has been an active area of
research for last two decades. While there are many different types of gestures, the most
12
structured sets belong to the sign languages. In sign language, where each gesture already has
assigned meaning, strong rules of context and grammar are applied to make recognition
tractable. To date, most work on sign language recognition has employed expensive
“datagloves" which tether the user to a stationary machine [21] or computer vision systems
limited to a calibrated area [22].Current successful gesture recognition system is based on
computer vision technology and Virtual Reality (VR) [23]. The VR glove-based gesture
recognition systems use a VR glove to extract a sequence of 3D hand configuration sets which
contain finger orientation angles, and use various structures of neural networks[24] or Hidden
Markov Models (HMM) [25] to recognize 3D motion data as gestures.
4.1 Using a recurrent neural network
An artificial neural network (ANN) can be defined as a hugely parallel distributed processor
consists of simple processing units (figure 4.1), which has a natural tendency for storing
experimental knowledge and available it for use [24]. ANN consists of many interconnected
processing elements (figure) [25] which is used searching for identification and control
gestures, game-playing and decision making, pattern recognition and medical diagnosis [26].
Also ANN has the ability to adaptive self-organizing [25].
Manar [27] used two recurrent neural networks architectures for static hand gesture to
recognize Arabic Sign Language (ArSL); Elman recurrent neural networks and fully recurrent
neural networks [Figure 4.2 ]. Digital camera and a colored glove were used for input image
data. RGB color classification was used to segment the video frames. Thirty segmented features
of the hand image were then extracted and grouped to represent single image. Angles and
distances were measured between the fingertips and the wrist. 900 colored images were used
for training set, and 300 colored images for testing purposes. Results had shown that fully
recurrent neural network system (with recognition rate 95.11%) better than the Elman neural
network (89.67%) [27].
13
Figure 4.1.1: Sign language word recognition system by using recurrent neural network [25]
Fully recurrent neural networks Elman recurrent neural networks
Figure 4.1.2: Recurrent Neural Networks
Data Glove
Verfiying the start point(neural network for
posture recognition)
Sign language
recognition (Recurrent
neural network)
Result
Verifying the sampling endpoint (History)
Start
End
14
4.2 Using Hidden Markov Model
A hidden Markov model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov chain [Figure] with unobserved (hidden) states. A Hidden
Markov Model can be defined by [28]:
A set of states ∁ =𝐶1 + 𝐶2 where 𝐶1 is an initial state and 𝐶2 is a final state.
The transition probability matrix 𝑀 = 𝑚𝑖𝑗 , where 𝑚𝑖𝑗 is the transition probability of
taking the transition from state i to state j.
The output probability matrix 𝑁 = 𝑛𝑗 (𝐴𝑘) for discrete HMM and 𝑁 = 𝑛𝑗 (𝑥) a
continuous HMM where 𝐴𝑘 stands for a discrete observation symbol, and 𝑥 stands for
continuous observations of k-dimensional random vectors.
For a discrete HMM 𝑚𝑖𝑗 and 𝑛𝑗 (𝐴𝑘) have the following properties:
𝑚𝑖𝑗 ≥ 0, 𝑛𝑗 𝐴𝑘 ≥ 0
𝑚𝑖𝑗 = 1
𝑗
𝑛𝑗 𝐴𝑘 = 1
𝑘
If the initial state is of distribution 𝑇 = 𝑇𝑖, an HMM can be written in a compact
notation to represent the complete parameter set of the model
𝜆 = (𝑀, 𝑁, 𝑇)
HMMs are widely used in speech, gesture recognition and signal processing systems. HMMs
provide the algorithm for modeling dynamical 3-D dependencies and correlations between
measurements. The dynamical dependencies are modeled implicitly by a Markov chain with a
specified number of hidden states. The initial state for an HMM can be determined by
estimating how many different states are involved in specifying a sign language. While better
results might be obtained by modifying different states for each sign, a four state HMM with
one skip transition was determined to be sufficient for this task [29] (Figure 4.2).
15
Figure 4.2: the four states HMM used for recognition [29].
Schlenzig et al. [31] used hidden Markov models to recognize “hello," good-bye," and “rotate"
in sign language.Wilson and Bobick [32] explored incorporating multiple representations in
HMM frameworks, and Campbell et. al. [33] used a HMM-based gesture system to recognize 18
T'ai Chi gestures with 98% accuracy.
5.0 Proposed Methodology
This report propose an inexpensive and light weight tracking device which is influenced by B.
Dorner*34+ and Robert’s work *35+. An experiment was set to validate the facts and the data of
this report. The principal method is to infer a pose from a still frame of the hand wearing a
color dotted glove. The glove is designed in a way so that this inference task searches that pose
in library database. The library database is generated by sampling records of natural hand poses
and indexed by rasterizing images of the poses.A (noisy) input image from the camera is first
transformed into a normalized query. It is then compared to each entry in the database
according to a robust distance metric. An evaluation of our data-driven pose estimation
algorithm would show a steady increase in retrieval accuracy with the size of the database. [33]
5.1 Glove Design
The glove design is adequately unique that the inference of the pose of a hand can be acquired
from a single frame captured by a consumer grade camera. The glove is made of simple
transparent polymer. The glove has 16 bright orange (hexadecimal code #FFA500 ) colored
patches at the back, 16 lime (hexadecimal code #00FF00 )colored patches at the front and 5
magenta colored (hexadecimal code #FF00FF) patches at the tip of the finger. The system only
16
looks for this three (#FFA500, #00FF00, #FF00FF ) fully saturated colors to distinguish front,
back side and the fingertips of the hand.#FFA500, #00FF00, #FF00FF has been classified as
master color. The color patches and the pattern on the glove enables quicker and maximum
robust pose estimation with less complex color identification algorithms [36].Orange and lime
patches are connected at side of each fingers which enables us to easily distinguish the side of
the finger. The 3D hand model has 21 degree of freedom (DOF) including 6 DOFs for global
transformation and 4 DOFs per finger.
Back view of glove Front view of glove
Side view of glove
Figure 5.1: Glove Design
17
5.1.1 Glove vs. Bare Hand tracking
In bare-hand pose estimation, two very different poses can map to very similar images. This is a
difficult challenge that requires slower and more complex inference algorithms to address. An
extra step needs to be acquired to obtain the skin data (edge detection)for good results [37].
With gloved hand, very different poses always map to very different images (See Figure 3). This
allows us to use a simple image lookup approach.
Figure 5.1.1: Bare hand estimation and edge detection [17]
5.2 Rasterizing the frame
Typical consumer webcam has 30Hz to 60 Hz refresh rate and 24 to 30 frame per second
capturing ability. An experiment was set up and Sony® Visual Communication 2.0 web-camera
(30fps @ 60Hz refresh rate) was used as a webcam. The video was captured with iPiRecoder®.
The captured video is the collection of captured frame sequences (Ψ).Bilateral filter is used to
reduce noise and to smooth each frame image. Each frame of Ψis then rasterized using Adobe
Image Processor®.
5.3 Color and pixel correcting
The rasterized frames are then converted into primary pixel set Ω, where Ω is a set of 100x100
pixel images. The system will only recognize three distinct colors using color pixel classification;
magenta, lime and orange. Due to light ambience, webcam capturing sensor sensitivity,
converting image format quality, image hue and shadow the captured frame image might lose a
18
significant amount color pixels. To solve this problem all neighbor color close to magenta, lime
and orange would beclassified as glove pixel (Table 5.3). The system would reject any other
color from the frame including the background color. For maximum result, using three colors
other than the glove should be prohibited in the visual area.
Table 5.3: Estimating neighboring color
Master color Neighboring colors
#FFA500 (RGB decimal 255,
165, 0)
Accept all colors ranging from
RGB (205~255, 130~180, 0)
#00FF00 (RGB decimal 0, 255,
0)
Accept all colors ranging from
RGB (0~90, 170~255, 0~90)
#FF00FF (RGB decimal 255, 0,
255)
Accept all colors ranging from
RGB
(170~255, 0~80, 170~255)
After color pixel classification, only two pixel remains; glove pixel and nonglove pixel. Glove
pixels would be cropped and decreased into 40x40 pixel micro images. Let µ denote as Micro
Image Setand it would be classified as hand region. Once the hand region is acquired, it will be
queried with the library database for positive match. Decreasing the number of the pixel into
micro images would optimize further speed querying for the positive match.
5.4 Indexing the library database
The library database is produced sampling all 40x40 pixel natural hand configuration, sign
language alphabet and common hand gestures. This database would be used as a reference
database. An enriched database that covers all natural hand gestures helps the system
toperformeffectively in retrieval accuracy in terms of gestures configuration [34]. In the
experiment, a set of 1000 finger configuration D was sampled using iPiMoCap® system.
19
Members of the D is denoted as d1, d2, d3... dn (n is natural number). A distance metric between
dmand dnis denoted as s(dm,dn). Low-dispersion sampling was used to create a uniform set of
samples D from overcomple collection of finger configurations Ω. A sampling algorithm *35+ is
used to minimize dispersion at each iteration successfully,
The next furthest distance from previous sample 𝑚𝑖+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑚∈Ω𝑚𝑖𝑛𝑛∈𝐷𝑖s(d𝑚 , d𝑛)
Where, 𝐷𝑖 is given samples at i iteration.
5.5 Matching and Tracking
The tracking is done between the frames. The centroids of each of the visible colored patches in
the rasterized frame sequence (Ψ) pose would be calculated. The system would identify the
closest vertex to each centroid. The displacement of each centroid from moving hand then
calculated from the difference between two consecutive frames. Correspondence is then
established betweencentroids from each frame.
5.6 Pose estimation and finding nearest neighbor
The nearest neighbor pixel is found calculating distance metric between two micro images. Only
Hausfroff distances are counted among the pixel points to get a precise divergence. Each micro
image searches the library database for the positive ambiguity. To complete the process, each
micro image and the database images are compared. The divergence from the database to the
query and from the query to the database is calculated. Foreach non-background pixel in one
image, the distance is penalized to the closest pixel of the same color in the other image.
Given Distance metric [33] 𝑠(µ1, µ2) = 1
𝐴1 𝑚𝑖𝑛 𝑢 ,𝑣 ∈𝑈𝑥𝑦 𝑢−𝑥 2+(𝑣−𝑦)2(𝑥 ,𝑦)∈𝐴1
𝑈𝑥𝑦 = (𝑢, 𝑣) -µ1(𝑥 ,𝑦) = µ2(𝑢 ,𝑣)
𝐴1 = (𝑥, 𝑦) -µ1(𝑥 ,𝑦) ≠ 𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑 𝑝𝑖𝑥𝑒𝑙𝑠
𝑠 µ1, µ2 = 𝑠 µ1, µ2 + µ1, µ2
20
Figure 5.6: Hausdorff-like image distance. A database image and a query image are compared
by computing the divergence from the database to the query and from the query to the
database. [33]
5.7 Blending nearest neighbor
In order to maximize smooth tracking, n amount of closest pixels are chosen to blend with the
background pixel. Blending the a certain amount of neighboring pixels in addition to pose
estimation helps to find the distance and thus to calculate the motion more accurately. Let ϒ = a
set of blended ten closest micro images, ϒ is calculated with a Gaussian radial basis kernel [33],
𝑑𝑝 µ = 𝑑1exp(−
𝑠 μ1 ,𝜇 2
𝜎2 )𝑖∈ϒ
exp(−𝑠 μ1 ,𝜇 2
𝜎2 )𝑖∈ϒ
Where 𝜎 is chosen to be the average distance to the neighbors.
5.8 Recognizing ASL
Once the hand region and is determined, pose estimation is completed and nearest neighbors
are blended, the system will query the database library for the positive match. The result from
the experiment are shown in figure 5.8, hand tracking is enabled to recognize sign language
alphabets (J and Z are not shown).
21
Sign
Alphabet
Captured
Frame
Image de-
noising
and
normaliza
tion
Database
candidate
match
(Micro
Image)
Sign
Alphabet
Captured
Frame
Image de-
noising
and
normaliza
tion
Database
candidate
match
(Micro
Image)
22
23
Figure 5.8 Interpretation of sign language alphabets.
6.0 Discussion
The proposed method of hand tracking system is a combination and correlation of optical and
glove tracking. Although, LED markers, magnetic sensors, silhouette analysis and acoustic
tracking exposes robust and smooth tracking and they are used widely in automation, medical
and entertainment industry, the proposed idea restricted use of this technologies because they
require more sophisticated algorithms, expense and time. Thus, it makes completely affordable
by consumers. A detailed comparison of technical data of all hand tracking systems available
along with the proposed prototype system is given in table 6.1
Table 6.1: Technical details comparison [7], [8], [9], [10], [11], [5].
Device/Syst
em name
Tracking
system
Retail price/
manufacturin
g expense
Camera/
sensor
used/DOF
Speed Weight Resolution/
area of
coverage
trakSTAR® Magnetic
Tracking
$50,000.00 6DOF 1000 fps 2 kg 6
Megapixel
Vicon® Optical and $30k-150k 10-24 1000 fps 6kg 1280x1024
24
optical
system
marker
tracking
camera pixel
OptotrakCe
rtus®
Marker
system,
optical
tracking
$70,000.00+ 540 camera
8 sensor
positions
900 18kg+3.4
kg
(system
control )
Capture
region 3x4
meter
marker 512
Ascension
(Polhemus®
)
Magnetic
Tracking
$50,000.00 18 sensors,
six dof’s
120 fps 1.8 kg 3m capture
area
Exoskeleto
n®s (joint
sensors
plus a
gyroscope)
mechanical
system
Electromag
netic
tracking
$40,000.00 180 sensors 500
sample/se
c
5 kg No range
limit, wired
system
Mattel
Power
Glove
Acoustic
tracking
$10,000.00 6 DOF 120
sample/se
c
1.1 kg 2 m radial
aera
Digital Desk
[5]
Optical
tracking,
silhouette
analysis
$5,000.00 2 camera 30fps - Less than 1
m
Videoplace Optical
tracking,
silhouette
analysis
$2000.00 1 camera 24-60 fps - Less than 1
m
MIT Led
glvoe
LED glove
technology
- 16 DOF 100~120
sample/se
0.8 kg -
25
c
Cyber
Glove
22 thin foil
strain
gauges
sewn into
the fabric
glove to
track,
Electromag
netic
$5000.00+ 22 thin foil 300
sample/se
c
0.5 kg
Boujou
silver bullet
Optical
tracking
$40,000.00 10 camera Full frame
120 fps
2 kg 16
megapixel
Proposed
prototype:
Color
dotted
glove
Marker
and Optical
tracking
$105.00
(glove+
webcam+
capturing
software)
1 camera,
glove has 21
DOF
30 fps at
60Hz
refresh
rate
Glove
weights
100
grams
0.1 meter
at 640x360
pixel
A detailed comparison of advantage and limitations are given in table 6.2.
Table 6.2: Comparison of advantage and limitations[7], [8], [9], [10], [11], [5].
Device/ System Name Advantage Limitations
trakSTAR® High rate data,
Highly available in industry
Expense, Occlusion
Vicon optical system (Motion
analysis)
High rate data,
Highly available in industry
Expense, Occlusion, relies on
software for data resolution
OptotrakCertus® Minimum data loss after Low capture rate, small region
26
occlusion identifies, no
environment restriction
Ascension (Polhemus®) No occlusion, orientation
information recorded
Environment restriction, can
be bulky
Exoskeletons (joint sensors
plus a gyroscope) mechanical
system
Fits a rigid body skeleton well,
high data rate
Not accurate in body location
VPL Data Glove Reasonable cost Slow speed capturing
Sayre Glvoe Effective for multi-functional
control
Less gesture
Digital Data enry glove First ASL recognizer Slow processing time
Cyber Glove m Virtual
Technologies. It is
comfortable, easy to use, and
has an accuracy
and precision well suited for
complex gestural work or fine
manipulations
Vicon MX Precise and accurate tracking
Proposed prototype: Color
dotted glove
Light weight, comfortable,
faster than using HHM or
recurrent neural network
because it queries in
database for positive match
rather than processing
topologies.
Slow estimation process time,
limited accuracy due to
inadequate library database
27
Although the proposed system has slow estimation response time because of the rapid access
to database for every frame, it managed to show credibility in respect of expense and
complexity.
7.0 Conclusion
This report introduced a hand-tracking user-input device composed of a single camera and a
polymer glove. The report shows that without using HMM or recurrent neural network, this
system can work effectively. The system is logically balanced and should work effectively in 3-D
manipulation and pose recognition tasks. The system could be improved by installing fine
sensors and inverse kinematics algorithms, but that would restrict the idea of being cost
effective. Because the primary purpose of this report is to deliver a robust and low-cost-user
input.
8.0 Recommendation: The proposed system bears more possible extensions. More cameras can be installed for more
accuracy as long as the hands do not occlude. Our hand movement and finger configuration can
be replaced with LED pens or multi touch interfaces for ease of user experience.
Inverse kinematics [25] and optimal smoothness [33] can be applied for more accuracy of the
detection and tracking system. Camera calibration process can be improved with better sensor
alignment and resolution. The system can also be used in the field of virtual surgery, virtual
games and sports alongside recognition of sign language.
28
References:
1. Judith Holt, Sue Hotto and Kevin Cole, Demographic Aspects of Hearing Impairment:
Questions and Answers, Third Edition, 1994.
2. Karen Nakamura, About ASL, Deaf Resource Library, http:// www.deaflibrary.org.
3. Heinlein, Robert A. , "Science fiction: its nature, faults and virtues", The Science Fiction
Novel, Chicago: Advent, 1959.
4. G.J. Grimes, "Digital Data Entry Glove Interface Device.”, Bell Telephone Laboratories,
Murray Hill. NJ, US Patent 4.414.537, Nov.8.1983.
5. Sturman, D.J., Zeltzer, D. "A survey of glove-based input”, IEEE Computer Graphics and
Applications, (January 1994).
6. J. Rehg, T. Kanade,DigitEyes: Vision-Based Human Hand-Tracking, School of Computer
Science Technical Report CMU-CS-93-220, December 1993.
7. Herman J. and Woltering, Optotrak, Selspot, Gait Measurement in Two-and Three-
Dimensional Space—A Preliminary Report, Cleveland 1994.
8. OptotrakCertus® Motion Capture System, available at:
http://www.ndigital.com/lifesciences/certus-techspecs.php.
9. ViconMX®, available at: http://www.vicon.com/products/sensors.html.
10. Myron Krueger. Artificial Reality 2, Addison-Wesley Professional, 1991.
11. Pierre Wellner, Interecting with paper on the DigitalDesk, Rank Xerox EuroPARC,
Cambridge, UK, Volume 36 Issue 7, Pages 87-96, July 1993.
12. Jannick P. Rolland, YohanBaillot, and Alexei A. Goon, A Survey Of Tracking Technology
For Virtual Environments, Center for Research and Education in Optics and Lasers
(CREOL), University of Central Florida.
13. Polhemus FASTTRAK® official website available at:
http://www.polhemus.com/?page=Motion_Fastrak
14. Ascension Technologies trakSTAR® official website available at http://www.ascension-
tech.com/medical/trakSTAR.php
15. Logitech® video technologies, http://www.logitech.com/en-us/488/455
29
16. A.G.E. Tech,Abrams Gentile Entertainment, 2009.
17. Vitor F. Pamplona, Leandro A. F. Fernandes, JoãoPrauchner, Luciana P. Nedel and
Manuel M. Oliveira, The Image-Based Data Glove, Proceedings of X Symposium on
Virtual Reality (SVR'2008), João Pessoa, 2008. Anais do SVR 2008, Porto Alegre: SBC,
2008, pp. 204–211.
18. Dr. G. Grimes, Digital Data Entry Glove, US Patent 4,414,537 Patented Nov. 8, 1983
19. Tom Zimmermann et al, Dataglove: A hand gesture interface device, 1985.
20. Ken Pimentel, Kevin Teixeira, "Virtual Reality: through the new looking glass",
Intel/Windcrest/McGraw Hill,1993.
21. L. Campbell, D. Becker, A. Azarbayejani, A. Bobick, and A. Pentland, “Invariant features
for 3-D gesture recognition," Intl. Conf. on Face and Gesture Recogn., pp. 157-162, 1996
22. Y. Cui and J. Weng,“Learning-based hand sign recognition." Intl. Work. Auto. Face Gest.
Recog. (IWAFGR),, p. 201-206, 1995.
23. Thad Starner, Joshua Weaver, and Alex Pentland, A Wearable Computer Based American
Sign Language Recognizer, The Media Laboratory, Massachusetts Institute of
Technology, 2001.
24. Marcus Vinicius Lamar, “Hand Gesture Recognition using T-CombNET A Neural Network
Model dedicated to Temporal Information Processing,” Doctoral Thesis, Institute of
Technology, Japan, 2001.
25. AnkitChaudhary, J. L. Raheja, Karen Das, and Sonia Raheja. (2011, Feb). “Intelligent
Approaches to interact with Machines using Hand Gesture Recognition in Natural way A
Survey,” International Journal of Computer Science & Engineering Survey (IJCSES), vol.
2(1).
26. Jian-kang Wu, “Neural networks and Simulation methods,” Marcel Dekker, Inc., USA,
1994. Available
at:http://books.google.co.in/books/about/Neural_networks_and_simulation_methods.
html?id=95iQOxLDdK4C&redir_esc=y
30
27. ManarMaraqa, Raed Abu-Zaiter, “Recognition of Arabic Sign Language (ArSL) Using
Recurrent Neural Networks,” IEEE First International Conference on the Applications of
Digital Information and Web Technologies, p. 478-48, 2008.
28. Tie Yang, YangshengXu, Hidden Markov Model for Gesture Recognition, May 1994.
29. Thad Eugene Starner, Visual Recognition of American Sign language Using Hidden
markov models, 1999.
30. J. Schlenzig, E. Hunter, and R. Jain, “Recursive identification of gesture using hidden
Markov models." Proc. Second Ann. Conf.on Appl. of Comp. Vision, p. 187-194, 1994.
31. A. Wilson and A. Bobick. “Learning visual behavior for gesture analysis." Proc. IEEE
Int'l.Symp. on Comp. Vis, Nov. 1995.
32. C.Y. Suen, M. Berthod, and S. Mori, “Automatic recognition of handprinted characters:
the state of the art,” Proceedings of the IEEE, Vol. 68, No. 4, pp. 469-487, 1980.
33. B. Dorner,Chasing the colour glove: visual hand tracking, 1994.
34. Robert Y. Wang, Jovan Popovic, Real-Time Hand-Tracking with a Color Glove, 2009.
35. White, R., Crane, K., And Forsyth, D. A., Capturing and animating occluded cloth, ACM
Transactions on Graphics, 2008.
36. M. Yuan, F. Farbiz, C.M. Manders, T.K. Yin., “Robust hand tracking using a simple color
classification technique”, The International Journal of Virtual Reality, 8(2), 2009.