I am so happy to be selected for GSOC, and super excited to be working on my project “News shot classification” as a RedHen! While searching for organizations which aligned with my interests, I came across RedHen Lab and found that they were just the right organization for me. RedHen focuses on using multi-modal learning for research in communication studies. The organization primarily works on their newsscape archive, which comprises of a huge number of news broadcasts across various news networks, along with caption data (obtained from CCExtractor). Multi-modal processing involves extracting semantic information from visual, textual, and audio data (all of which is salient in a video).
My task is to analyze news videos at a low level or ‘shot’ level, and classify shots using predominantly visual features, and text when possible. My first meeting with Prof. Steen (UCLA, Communication Studies) helped me gain some real insight on what RedHen is all about. It is an awesome feeling to be part of such a community!
As the community bonding period was coming to an end, I decided to organize my thoughts, make plans and update my progress throughout the summer, here on my blog.
My objective -> Detect and classify news shots primarily on the basis of visually semantic information.
Coming up with labels for the classification -> As most frames in a news broadcast contain a person talking, I need to think of what kind of labels would distinguish shots and return useful information which can be used for further processing.
Possible categories I could try:
- Indoor (for example news studio) and outdoor scene (any kind of field footage).
- Main focus being a talking head (human centric) (e.g. anchor, reporter, meteorologist) or background visual (broll)
- Salient objects/event in the scene (vehicle [accidents], building, mic, gun, fire/natural disaster etc.)
- Presence of a crowd
- Graphics (present for very short duration)
Feature extraction from convolutional neural networks (CNNs):
Normally, a classification task for the above would involve extracting features using HOG, BOVW, SIFT matching, face detection, histogram correlation etc. However, my dataset is large and attempting to train an image recognition system from scratch does not seem possible. Therefore, my goal is to use an existing image recognition system (such as Caffe), in order to extract useful features from my dataset, which can then be used as input to a separate machine learning system.
Read about CNNs here.
generic features (from CNNs) -> strong discriminative classifiers (SVMs) -> labels
Caffe is an open-source neural network library developed by Berkeley, with a focus on image recognition. It can be used to construct and train your own network, or load one of the pretrained models. Currently, I plan to use the pretrained models, as constructing my own NN would still involve a great deal of manual annotation.
To get an brief idea on what kind of features a pretrained model would return on images from news videos, I tried out their web demo, which I found here.
Humans (reporter shot, anchor shot) -> clothing, garment, neck brace etc.
Outdoor scenes with buildings -> Structure, building, formation etc.
Natural disaster, burning house, explosion -> natural elevation, geological formation, volcano
The demo uses theCaffeNet model based on the network architecture for ImageNet. Although, the result was a bit unflattering on my dataset. As most of my images are similar (most of the frame is a person talking), the labels were generic and clubbed amongst different scenes. I noticed that the on-screen text (the ticker where headlines are displayed) could return unwanted features (‘scoreboard’) and may need to be segmented out in the preprocessing.
The last fully connected layer of the CNN (called softmax) which is responsible for the (ImageNet) labels, tries to fit my image to the ImageNet database. Hence, I will need to experiment using features from previous layers, rather than the last FC, which contain more semantic information.
The Caffe distribution comes with a few models, and more can be found at the Model Zoo Wiki page. I will need to play around with different models to see which features will work best for my dataset. As of now, I will experiment with-
1. GoogLeNet (instant image recognition)
2. R-CNN (for object detection)
3. Multilabel classification on PASCAL VOC
4. Places-CNN model MIT (or) Places_CNDS_models on Scene Recognition (for indoor/outdoor)
5. VGG Face CNN descriptor
Other things I plan to try:
– Use Weixin Li’s (PhD student at UCLA and fellow RedHen) face detection pipeline output to compare key frames of shots to determine if the shot is human centric i.e. primarily involving a talking head.
– Color histogram for very short shots and lack of caption data and faces, or low accuracy of label returned by CNN, as indicators for graphics.
As of now, I plan is to see how accurately shots are being classified using only visual features, and later incorporate text features from caption data and on-screen text.
- Go through CSA imageflow and search for clips with a significant field footage.
- Detect shots using pyscenedetect (an open source tool) which implements HOD and content aware detection efficiently.
- Annotate sufficient number of shots using RedHen Rapid Annotator or Elan. Annotate more depending on the classification output (later).
- Experiment with pretrained models on a small set of images.