Karappa, Virendra (2014-12). Detection of Sign-Language Content in Video through Polar Motion Profiles. Master's Thesis. Thesis uri icon

abstract

  • Locating sign language (SL) videos on video sharing sites (e.g., YouTube) is challenging because search engines generally do not use the visual content of videos for indexing. Instead, indexing is done solely based on textual content (e.g., title, description, metadata etc.). As a result, untagged SL videos do not appear in the search results. In this thesis, we present and evaluate an approach to detect SL content in videos based on their visual content. Our work focuses on detection of SL content and not on transcription. Our approach relies on face detection and background modeling techniques, combined with a head-centric polar representation of hand movements. The approach uses an ensemble of Haar-based face detectors to define regions of interest (ROI) and a probabilistic background model to segment movements in the ROI. The resulting two-dimensional (2D) distribution of foreground pixels in the ROI is then reduced to two 1D polar motion profiles (PMPs) by means of a polar-coordinate transformation. These profiles are then used for classification of SL videos from others. We evaluate three distinct approaches to process information from the PMPs for classification/detection of SL videos. In the first method, we average out the PMPs across all the ROIs to obtain a single PMP vector for each video. These vectors are then used as input features for an SVM classifier. In the second method, we follow the bag-of-words approach of information retrieval to compute a distribution of PMPs (bag-of-PMPs) for each video. In the third method, we perform linear discriminant analysis (LDA) of PMPs and use the distribution of PMPs projected in the LDA space for classification. When evaluated on a dataset comprising of 205 videos (obtained from YouTube), the average PMP approach achieves a precision of 81% and recall of 94%, whereas the bag-of-PMPs approach leads to a precision of 72% and recall of 70%. In contrast to the first two methods, supervised feature extraction by the third method achieves a higher precision (84%) and recall (94%). Though this thesis presents a successful means by which to detect sign language in videos, our approaches do not consider temporal information, only the distribution of profiles for a given video. Future work should consider extracting temporal information from the sequence of PMPs to utilize the dynamic signatures of sign languages and potentially improve retrieval results. The SL detection techniques presented in this thesis may be used as an automatic tagging tool to annotate user-contributed videos in sharing sites such as YouTube, in this way making sign-language content more accessible to members of the deaf community.
  • Locating sign language (SL) videos on video sharing sites (e.g., YouTube) is challenging because search engines generally do not use the visual content of videos for indexing. Instead, indexing is done solely based on textual content (e.g., title, description, metadata etc.). As a result, untagged SL videos do not appear in the search results. In this thesis, we present and evaluate an approach to detect SL content in videos based on their visual content. Our work focuses on detection of SL content and not on transcription. Our approach relies on face detection and background modeling techniques, combined with a head-centric polar representation of hand movements. The approach uses an ensemble of Haar-based face detectors to define regions of interest (ROI) and a probabilistic background model to segment movements in the ROI. The resulting two-dimensional (2D) distribution of foreground pixels in the ROI is then reduced to two 1D polar motion profiles (PMPs) by means of a polar-coordinate transformation. These profiles are then used for classification of SL videos from others.

    We evaluate three distinct approaches to process information from the PMPs for classification/detection of SL videos. In the first method, we average out the PMPs across all the ROIs to obtain a single PMP vector for each video. These vectors are then used as input features for an SVM classifier. In the second method, we follow the bag-of-words approach of information retrieval to compute a distribution of PMPs (bag-of-PMPs) for each video. In the third method, we perform linear discriminant analysis (LDA) of PMPs and use the distribution of PMPs projected in the LDA space for classification. When evaluated on a dataset comprising of 205 videos (obtained from YouTube), the average PMP approach achieves a precision of 81% and recall of 94%, whereas the bag-of-PMPs approach leads to a precision of 72% and recall of 70%. In contrast to the first two methods, supervised feature extraction by the third method achieves a higher precision (84%) and recall (94%).

    Though this thesis presents a successful means by which to detect sign language in videos, our approaches do not consider temporal information, only the distribution of profiles for a given video. Future work should consider extracting temporal information from the sequence of PMPs to utilize the dynamic signatures of sign languages and potentially improve retrieval results. The SL detection techniques presented in this thesis may be used as an automatic tagging tool to annotate user-contributed videos in sharing sites such as YouTube, in this way making sign-language content more accessible to members of the deaf community.

publication date

  • December 2014