Google Summer of Code #2

The last month has been quite busy, nevertheless thoroughly enjoyable. From developing my code to studying for the upcoming GRE exam, while making time for friends and trying out new cuisines, I’ve had to learn to manage my time well. I’m back home in LA and the last two days had been catch up time with my family, hence the delay in the post. Nevertheless, this is my midterm update:

Objective: Generate labels using visual features to classify news shots:

Current labels:

  1. Scene_type: indoor/outdoor
  2. Studio shot
  3. Reporter/Correspondent shot
  4. Background roll – background visual used by the anchor or reporter to describe the scene
  5. Graphics
  6. Hybrid – Some shots which have background roll along with reporter, anchor in a split screen
  7. Weather report- Usually occur in most news programs
  8. Sports report

Other labels:

  1. Vehicle
  2. Weapon
  3. Nature
  4. Clothing
  5. Place/building

Tasks done:

  1. Shot detection using pyscenedetect.
  2. Keyframes extraction: To reduce the task of processing every frame in a shot, and increase efficiency, I decided to use keyframes which best represent each shot. One set of keyframes are the iframes (intra coded picture) – reference picture, which represents a full image, detected by ffmpeg.
  3. Face detection using dlib which I used for gender recognition, as well studio shot detection. I will elaborate the use case in the corresponding sections.
  4. Experimented with different BVLC Caffe models: The pretrained caffe models have been trained on a particular type of dataset, but are said to be quite powerful. Hence, I had wanted to see how they would work on our news dataset. I also introduced a minor change in my code to allow batch processing of images using Caffe and speed up the classification process.
  5. AlexNet Places CNN by MIT: The main output that AlexNet Places CNN was originally intended for was a scene label amongst 205 proposed classes. The output label seems alright for some cases but is not consistent for all as our dataset is quite multifarious. For example: A frame has been classified as television_studio in a some frames, and then conference_center for a few. Though it is of some relevance, it isn’t too reliable. Hence, I am not too keen on retaining this output as a label.
    Right now, I have used this model to directly output the scene_type label, which says whether a scene is indoor or outdoor.  A problem exists for those cases which have a projection of an outdoor background, while being indoor (such as a studio background). There doesn’t seem to be a way for a machine to identify the truth when it is being deceived. Apart from scene_type, I have also extracted the SUN scene attributes (such as man-made, working, enclosed area, business) which seem quite reasonable and contain possibly useful information.
    Lastly  I have also extracted the fully connected layers fc7 and fc6 as features to train the SVM classifiers on.
  6. I also came across an age and gender CNN from the model zoo which was trained on a large dataset, with each image comprising of a single face that has been cropped. For it to work on our dataset, I needed to first identify those frames which have only one face and then feed the cropped frame to the gender CNN. I used dlib for face detection and stored the cropped images in a temporary directory which is erased after the task has been completed. However, this relies on the face detection output being absolutely accurate, which is not the case. There were cases where only a single face was detected in the presence of multiple, thus giving the false impression that the frame is primarily a single person of so and so gender. I have not annotated frames on a gender label, hence do not know the accuracy for the same.
  7. Used the model GoogleNet to identify objects salient in a frame, although directly testing our dataset on their model did not give impressive results. This makes sense since the data the model was trained on was predominantly containing a single object, whereas our dataset contains various entities which could easily confuse the CNN. I also considered trying the r-cnn object detector, which requires MATLAB to be configured, but in the end decided against using these models for this purpose, as there are a large number of frames which are inappropriate for this object classification, and hence may return too many false results.

SVM Classification:

For camera_shot_type: (Studio, Reporter, Weather, Sports, Hybrid, Background_roll, Graphics, Problem/Unclassified, Commercial):

  • Total number of frames annotated so far: 8683
  • Using a single SVM in one-vs-rest mode to predict all classes of camera_shot_type
  • Fc7 features of AlexNet Places CNN
  • Entire screen of the keyframes was used (includes news ticker) which may be affecting performance

Dataset:    Train – 10 full length videos (6285 keyframes) from May 2016
Test – 3 full videos (2398 keyframes) from 2014
Accuracy:    79%
Remark:    The test and train data have a big difference in news content due to the time gap.
However, similar programs from same networks existed in train set.

Dataset:    Train – 8 full length videos from May 2016 and 1 full video from 2014
Test – 1 full videos from May 2016 and 2 full video from 2014
Accuracy:    71%
Remark:    The single video from May 16 in test had no data in train from the same network.

Dataset:    Train – 8 videos from May 2016 and 1 from 2014
Test – 1 from May 2016 and 2 from 2014
Accuracy:    82.5%
Remark:    The single video from May 16 in test had data in train from the same news network.

For scene_type labels (indoor/outdoor):

  • Using the direct output of pretrained AlexNet Places CNN model
  • In general, outdoor scenes do not look obvious and clear. However, indoor scenes such as studio shots are, and when high in frequency, return higher accuracy.

Dataset:    2016-06-07_0000_US_CNN_Anderson_Cooper_360_0-3595.mp4
Accuracy:    96.6%
Remark:    This is a talk show and so most shots are talking heads, hence indoor.
Outdoor scenes are not clear cut.
Indoor: 98% , Outdoor: 77.2%

Dataset:    2016-06-07_0100_US_KABC_Eyewitness_News_6PM_0-1793.mp4
Accuracy:    77.5%
Remark:    Indoor: 95% , Outdoor: 71%

Dataset:    2016-06-07_0000_US_FOX-News_The_OReilly_Factor_0-3595.mp4
Accuracy:    93.5%
Remark:    This is a talk show and so most shots are talking heads, hence indoor.
Outdoor scenes are not clear cut.
Indoor: 99.6% , Outdoor: 38%

Dataset:    2016-06-07_0100_US_KCBS_CBS_2_News_at_6_0-1735.mp4
Accuracy:    73%
Remark:    Indoor: 94% , Outdoor: 58%

To do:

  1. Use one SVM for each class to improve accuracy.
  2. Remove news ticker
  3. Try and see how features from other caffe architectures or layers affect performance
  4. Try to include more labels
    Tentative: crowds, nature, vehicle_traffic, accident/man-made_disaster?
  5. Annotate more videos spread across time, cover more networks
  6. Try training on indoor/outdoor features instead of directly outputting label from CNN and check accuracy
  7. Distinguish between reporter live and reporter on-set?

Libraries and packages used:

  1. Pyscenedetect
  2. Caffe and all its dependencies such as numpy, scipy, matplotlib etc
  3. Skimage
  4. Sklearn
  5. Dlib face detection


Tried but decided against using:

Studio shot detection: I wanted to try and see how well an unsupervised method of detection would work, against supervised learning as above. However, this method was very specific to certain shots.

For example, it gives higher accuracy of above 85% in talk shows like CNN Anderson Cooper, or Fox News O’Reily, in which there’s a high frequency of studio, talking heads. It doesn’t work well for shorter programs such as KABC or KCBS etc.
Taking the reference of the paper , I implemented a graph clustering algorithm which detects studio shots under two assumptions. First assumption is that a studio shot occurs throughout the length of the video i.e. its lifetime is above a certain threshold. The second assumption is that the shot recurs for a minimum number of times, also a threshold. These seem to be general assumptions which are almost always satisfied.
How the algorithm works:

    1. The color histogram of every keyframe is computed and stored. Histograms are computed for 16 regions in the image and with 10 equally spaced histogram bins.
    2. Histogram of differences is computed between every two keyframes and stored in an adjacency matrix (the matrix is sparse which speeds up computation). The 8 largest (out of 16) differences in corresponding histograms in two image are rejected (in order to reject the difference caused by small localized changes in the video and to compute the difference in histogram of the generalized setting of the frame).
    3. Create the minimum spanning tree using the keyframes as the vertices. Now the distance between the keyframes or edge weights, which determine how closely the frames are related, is given by the values of the adjacency matrix.
    4. Now, the MST gives us the shortest path which connects all frames, but it does not give us groups of similar frames, or clusters. To form clusters, an edge of the MST needs to be removed. To determine which edge is to be removed, k-means clustering was done on the MST weights, to form two clusters- one with low weights, the other with high weights. The greatest weight in the lower weight cluster was determined as the edge which needs to be removed from the MST to form clusters.
    5. Each cluster represents a set of coherently similar frames. The cluster which represents studio shots is the cluster which satisfies the aforementioned assumptions. Thus the ‘lifetime’ of every cluster is evaluated as well as the frequency of occurrence. The clusters which remain after the pruning process have to undergo a final pruning step.
    6. The final pruning is done using the fact that anchor shots predominantly contain a face. For every accepted cluster, the average number of faces is computed using face detection. Those clusters which return a mean value of more than 1 (meaning at least one face was detected in every frame) are finally labelled as studio shots. A slight problem with this is that the face detection output is not always accurate, and may miss a few faces, resulting in a studio cluster to be pruned. Another case which needs to be resolved is obtaining more than one cluster (usually not more than two, if at all). This could occur when a particular scene (other than anchor) such as an interview, occurs for a prolonged period and satisfies all the above conditions. Also, as of now, the hybrid case (studio background with split screen displaying other content) is being classified as studio, which is not incorrect, but has not been made a separate class yet.

One thought on “Google Summer of Code #2

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s