|
Gesture
One-sentence summary of this page.
Gesture module
Main idea: gestures can be distinguished by specific position of the hand with respect to the head. Skeleton extractionInput for gesture recognition is usually hand trajectory. Hand can be tracked in various ways. Common way to do it is by using joints: hands, shoulders, elbows, head, etc. In order to do that, we need estimation of joints positions in image. For example, interesting approach we came across is proposed in [9]. It does estimation of joints based on single depth image. It uses decision trees for per pixel classification to body parts and then Mean Shift for estimation of joints. In this project, we are using joint 2D information provided directly from Microsoft Kinect SDK. Feature vector as gesture representationIn each frame, we extract 2D position of:
Gesture is represented by temporal vector A of (x,y) positions with respect to coordinate frame shown below. In each frame, hand (x,y) position is recorded.
Origin of coordinate system is in head. (x,y) of each joint position is first translated by subtracting the head coordinates. To cope with inter-people variations in size, we do normalization with distance between left shoulder and right shoulder (δ).
Gesture segmentationVideo segmentation is determining frame sequence in which one full gesture is contained. Segmentation of the gesture in video should not be mixed with image processing kind of segmentation where we segment image regions. Gesture video segmentation is done by imposing constraint: starting and ending position of the hand should be resting position: both arms lying along the body. It is position shown in figure above Matching new gesture with learned gesturesOnce gesture is segmented, next task is to compute distance between new gesture trajectory and each of the learned gestures we have in database and match gesture with closest one. Any distance (Euclidean, Manhattan, …) which aligns the ith point on one time series with the ith point on the other will produce a poor similarity score. A non-linear (elastic) alignment produces a more intuitive similarity measure, allowing similar shapes to match even if they are out of phase in the time axis and have different number of elements. This is needed because gesture will be done with different pace every time, even if done by the same person, and especially if done by different person.
Therefore, Dynamic Time Warping (DTW) algorithm to compute distance is used. Dynamic Time Warping (DTW) DTW is an algorithm that enables computing distance between two sequences with different number of elements. Lets denote two sequences: Dynamic programming formula used to compute all distances between each two points: Finally, distance between two sequences is obtain as: Finally we choose one with smallest distance to our current gesture. We compute confidence of this winning gesture and send this information to integrator module, if confidence is above predefined threshold.Threshold is empirically set. Please refer to paper [3] for more details about method we follow. Note that we are not using mirroring mentioned in paper, i.e. if gesture is learned by using certain hand, it has to be done with same hand in order to be recognized. We are not doing mirroring because we want to recognize "Left" and "Right" as different commands. | |