GSOC : Work Product Submission

Organization: Red Hen Lab

My project was for the Red Hen Lab organization, although it is officially listed under the CCExtractor organization, which is a sister organization of Red Hen Lab.

Table of Contents:

  1. Introduction
  2. Usage Instructions
  3. The Full Pipeline
  4. Example Output
  5. Update-1: YOLO for Person Detection
  6. Update-2: JSON Lines Output Format
  7. Dataset Preparation
  8. Classification Approaches
  9. Performance
  10. Deployment Details
  11. Repository Link
  12. Previous blog posts
  13. Known Issues/Future Work
  14. Future Collaboration
  15. Citations/Licenses


The aim of my project is to classify and analyze news shots of various kinds which are encountered in the news broadcast videos that Red Hen processes everyday. The system extracts visual features (such as the presence of a particular object, scene type, etc.) in the form of class labels, and provides low level semantic information which may be used for further processing. I have also developed a classifier on the basis of camera shot type (such as Newsperson(s), Background roll, Graphics, etc). The developed pipeline has been deployed onto the Case Western Reserve University High Performance Computing cluster, and is processing every new day’s incoming videos and extracting visual features from them.

Return to top

Usage Instructions

Clone the repository from , while ensuring that all dependencies are correctly installed.
A video can be processed as follows:-

python <path_to_video>

The path to the video file is either absolute or relative with respect to the file.
The code will generate an output file with the same name as the video with a .sht file extension (which is a self-coined term for shot file) and .json file in the JSON Lines format.


Required External Files and Models

All the required external files and classifier models can be found here:
The paths to all external files required by the code can be modified in according to the user’s convenience.
Return to top

The Full Pipeline

Output labels:

  • Camera shot type → [ Newsperson(s), Background_roll, Graphic, Weather, Sports ]
  • Object category → [ Vehicle, Natural formation, Building/Structure, Person(s)/Clothing, Weapon, Sports ]
  • Scene type → [ Indoor, Outdoor ]
  • Imagenet labels with probabilities
  • Places205 labels with probabilities
  • Scene attributes

What Camera Shot type labels mean:

  • Newsperson(s) consists of Studio, Reporter/Correspondent, Hybrid (split screen view of studio, reporter, or background roll)
  • Background_roll consists of footage used to describe a scene by a reporter or studio anchor
  • Graphic consists of graphical images such as: branding graphics of the news network (occurs between commercials and starting of a news story and so on), textual graphics present in background roll (quotations or statements)
  • Weather: Graphic images of weather maps or forecasts, may or may not be in the presence of the meteorologist
  • Sports: Sports segment in a news broadcast


The steps followed by the pipeline are:-

  1. Extract intra frames (I frames) using ffmpeg. Intra frames are representative frames of each shot.
  2. Detect shots using pyscenedetect module and extract first and last frame of each shot to add to the list of keyframes.
  3. Crop the news ticker from the bottom of keyframes. This is done so that the text present in the ticker (headlines etc.) does not affect the neural network’s performance.
  4. Obtain scene location labels (scene type and Places205 labels) along with scene attributes.
  5. Extract and save fc7 features extracted from Places205 CNN model.
  6. Train SVM on fc7 features to classify frames into camera shot type labels.
  7. Obtain object category and imagenet labels.
  8. Classify frames using fine tuned caffe net, and get another set of camera shot type labels. The two sets can be used as a crosscheck or validation.
  9. Identify the labels with the highest frequencies within a shot’s set of keyframes and assign the labels to the detected shot boundary.

Return to top

Example Output

timestamp1| timestamp2| Shot_class | Other shot labels with probabilities

Each keyframe has 5 lines corresponding to it (Finetune shot class, SVM shot class, Obj class, Scene location, Scene attributes). Each shot boundary is a single line with the most frequently occuring labels within the keyframes belonging to a shot.

Snippet of the output along with the corresponding keyframe, and true labels written in the form [Camera shot type, Object type, Scene type].

OBJ_SHOT_CLASS=Unclassified| SCENE_TYPE=Outdoor

[Background_roll, Natural formation, Outdoor]

2014051023085.379| 2014051023085.379| FINETUNED_CLASS | Background_roll | (‘Background_roll’, 1.0)
2014051023085.379| 2014051023085.379| SVM_CLASS | Background_roll
2014051023085.379| 2014051023085.379| OBJ_CLASS | Natural formation | (‘geological formation’, 1.75), (‘natural elevation’, 1.0), (‘shore’, 0.96), (‘seashore’, 0.6), (‘promontory’, 0.57)
2014051023085.379| 2014051023085.379| SCENE_LOCATION | Outdoor | (‘dam’, 0.21), (‘river’, 0.2)
2014051023085.379| 2014051023085.379| SCENE_ATTRIBUTES | open area, natural light, sailing/boating, natural, moist

[Sports, Sports, Outdoor]

20140510230372.426| 20140510230372.426| FINETUNED_CLASS | Sports | (‘Sports’, 1.0)
20140510230372.426| 20140510230372.426| SVM_CLASS | Sports
20140510230372.426| 20140510230372.426| OBJ_CLASS | Sports | (‘rugby ball’, 1.9), (‘ball’, 1.89), (‘game equipment’, 1.69), (‘equipment’, 1.19), (‘soccer ball’, 0.61)
20140510230372.426| 20140510230372.426| SCENE_LOCATION | Outdoor | (‘stadium’, 0.99)
20140510230372.426| 20140510230372.426| SCENE_ATTRIBUTES | sports, nohorizon, competing, exercise, open area

[Newsperson(s), Person(s)/Clothing, Outdoor]

20160308200022.119| 20160308200022.119| FINETUNED_CLASS | Newsperson(s) | (‘Newsperson(s)’, 1.0)
20160308200022.119| 20160308200022.119| SVM_CLASS | Newsperson(s)
20160308200022.119| 20160308200022.119| OBJ_CLASS |  Person(s)/Clothing | (‘garment’, 1.04), (‘clothing’, 1.02), (‘sweater’, 1.0), (‘cardigan’, 0.91), (‘covering’, 0.9)
20160308200022.119| 20160308200022.119| SCENE_LOCATION | Indoor
20160308200022.119| 20160308200022.119| SCENE_ATTRIBUTES | nohorizon, natural light, competing, touring, open area

A field reporter can sometimes be misclassified as Background_roll (such as below), due to highly frequent and similar frames such an image still of a person, interviews, speeches, guest speakers- all of which are not Newsperson(s).

[Newsperson(s), Person(s)/Clothing, Outdoor]

20150725130452.375| 20150725130452.375| FINETUNED_CLASS | Background_roll | (‘Background_roll’, 0.986)
20150725130452.375| 20150725130452.375| SVM_CLASS | Background_roll
20150725130452.375| 20150725130452.375| OBJ_CLASS |  Person(s)/Clothing | (‘clothing’, 0.65), (‘garment’, 0.62), (‘consumer goods’, 0.58), (‘covering’, 0.58), (‘commodity’, 0.55)
20150725130452.375| 20150725130452.375| SCENE_LOCATION | Outdoor
20150725130452.375| 20150725130452.375| SCENE_ATTRIBUTES | natural light, man-made, far-away horizon, warm, leaves

[Background_roll, Vehicle, Outdoor]

20160308200045.311| 20160308200045.311| FINETUNED_CLASS | Background_roll | (‘Background_roll’, 1.0)
20160308200045.311| 20160308200045.311| SVM_CLASS | Background_roll
20160308200045.311| 20160308200045.311| OBJ_CLASS |  Vehicle | (‘car’, 2.23), (‘motor vehicle’, 2.18), (‘self-propelled vehicle’, 1.97), (‘wheeled vehicle’, 1.72), (‘vehicle’, 1.39)
20160308200045.311| 20160308200045.311| SCENE_LOCATION | Outdoor | (‘parking_lot’, 0.45)
20160308200045.311| 20160308200045.311| SCENE_ATTRIBUTES | man-made, nohorizon, glossy, electric lighting, working

[Background_roll, Building/Structure, Outdoor]

20160308200040.621| 20160308200040.621| FINETUNED_CLASS | Background_roll | (‘Background_roll’, 1.0)
20160308200040.621| 20160308200040.621| SVM_CLASS | Background_roll
20160308200040.621| 20160308200040.621| OBJ_CLASS |  Building/Structure | (‘structure’, 0.82), (‘mercantile establishment’, 0.66), (‘place of business’, 0.65), (‘shop’, 0.63), (‘establishment’, 0.63)
20160308200040.621| 20160308200040.621| SCENE_LOCATION | Outdoor | (‘hospital’, 0.35)
20160308200040.621|20160308200040.621| SCENE_ATTRIBUTES | man-made, open area, natural light, nohorizon, horizontal components

[Newsperson(s), Person(s)/Clothing, Indoor]

201507025130001.202| 201507025130001.202| FINETUNED_CLASS | Newsperson(s) | (‘Newsperson(s)’, 1.0)
201507025130001.202| 201507025130001.2022| SVM_CLASS | Newsperson(s)
201507025130001.202| 201507025130001.202| OBJ_CLASS |  Person(s)/Clothing | (‘clothing’, 1.29), (‘garment’, 1.28), (‘consumer goods’, 1.13), (‘commodity’, 1.07), (‘covering’, 1.04)
201507025130001.202| 201507025130001.202| SCENE_LOCATION | Indoor | (‘conference_center’, 0.61)
201507025130001.202| 201507025130001.202| SCENE_ATTRIBUTES | man-made, cloth, enclosed area, business, working

[Newsperson(s), Person(s)/Clothing, Undefined]
20150404190732.544| 20150404190732.544| FINETUNED_CLASS | Newsperson(s) | (‘Newsperson(s)’, 1.0)
20150404190732.544| 20150404190732.544| SVM_CLASS | Newsperson(s)
20150404190732.544| 20150404190732.544| OBJ_CLASS |  Person(s)/Clothing | (‘clothing’, 1.21), (‘garment’, 1.21), (‘consumer goods’, 1.07), (‘commodity’, 1.01), (‘covering’, 0.97)
20150404190732.544| 20150404190732.544| SCENE_LOCATION | Indoor | (‘conference_center’, 0.22)
20150404190732.544| 20150404190732.544| SCENE_ATTRIBUTES | nohorizon, enclosed area, cloth, man-made, studying

[Background_roll, Unclassified, Indoor]

2016060701057.934|2016060701057.934| FINETUNED_CLASS | Background_roll | (‘Background_roll’, 1.0)
2016060701057.934|2016060701057.934| SVM_CLASS | Background_roll
2016060701057.934| 2016060701057.934| OBJ_CLASS |  Unclassified (‘furniture’, 0.62), (‘furnishing’, 0.58), (‘desk’, 0.58), (‘table’, 0.54), (‘structure’, 0.35)
2016060701057.934| 2016060701057.934| SCENE_LOCATION | Indoor | (‘office’, 0.25)
2016060701057.934| 2016060701057.934| SCENE_ATTRIBUTES | nohorizon, enclosed area, cloth, working, paper

[Graphic, Unclassified, None]

20160607010062.230| 20160607010062.230| FINETUNED_CLASS | Graphic | (‘Graphic’, 1.0)
20160607010062.230| 20160607010062.230| SVM_CLASS | Graphic
20160607010062.230| 20160607010062.230| OBJ_CLASS | Unclassified | (‘device’, 0.36), (‘fastener’, 0.29), (‘covering’, 0.28), (‘restraint’, 0.28), (‘equipment’, 0.22)
2016060701062.230| 2016060701062.230| SCENE_LOCATION | Indoor | (‘bar’, 0.31), (‘music_studio’, 0.29)
2016060701062.230| 2016060701062.230| SCENE_ATTRIBUTES | nohorizon, enclosed area, man-made, farming, using tools

[Backgrod_roll, Weapon,Indoor]

20160621060193.656| 20160621060193.656| FINETUNED_CLASS | Background_roll | (‘Background_roll’, 1.0)
20160621060193.656| 20160621060193.656| SVM_CLASS | Background_roll
20160621060193.656| 20160621060193.656| OBJ_CLASS |  Weapon | (‘revolver’, 1.65), (‘pistol’, 1.63), (‘firearm’, 1.57), (‘gun’, 1.47), (‘weapon’, 1.3)
20160621060193.656| 20160621060193.656| SCENE_LOCATION | Indoor
20160621060193.656| 20160621060193.656| SCENE_ATTRIBUTES | enclosed area, nohorizon, wood, man-made, electric lighting

Sometimes, the FINETUNED_CLASS and SVM_CLASS may disagree. As seen, the finetuned output is more accurate than SVM.

[Weather, Unclassified, Indoor]

20150725130712.533| 20150725130712.533| FINETUNED_CLASS | Weather | (‘Weather’, 0.999)
20150725130712.533| 20150725130712.533| SVM_CLASS | Background_roll
20150725130712.533| 20150725130712.533| OBJ_CLASS |  Unclassified | (‘elasmobranch’, 0.97), (‘ray’, 0.97), (‘cartilaginous fish’, 0.95), (‘electric ray’, 0.8), (‘fish’, 0.79)
20150725130712.533| 20150725130712.533| SCENE_LOCATION | Indoor | (‘office’, 0.25)
20150725130712.533| 20150725130712.533| SCENE_ATTRIBUTES | nohorizon, warm, vegetation, man-made, electric lighting

[Weather, Unclassified, Indoor]

20160621060428.835| 20160621060428.835| FINETUNED_CLASS | Weather | (‘Weather’, 1.0)
20160621060428.835| 20160621060428.835| SVM_CLASS | Weather
20160621060428.835| 20160621060428.835| OBJ_CLASS |  Unclassified | (‘musical instrument’, 0.48), (‘clothing’, 0.34), (‘covering’, 0.34), (‘device’, 0.32), (‘consumer goods’, 0.32)
20160621060428.835| 20160621060428.835| SCENE_LOCATION | Indoor | (‘office’, 0.25)
20160621060428.835| 20160621060428.835| SCENE_ATTRIBUTES | nohorizon, warm, vegetation, man-made, electric lighting

Return to top

Update-1: YOLO for Person Detection

Tbe Caffe model for YOLO (based on Darknet) was added later in the pipeline to focus on person detection within shots. See YOLO: Real-Time Object Detection

Return to top

Update-2: JSON Lines Output Format

The initial output file was stored as a .sht file with the format:
(YYYYMMDDHHMM.S) timestamp1| timestamp2| Shot_class | Other shot labels with probabilities

We now use JSON Lines by converting .sht to .json file formatted as follows:

Shots: An array of shots
[ {Shot1}, {Shot2}, {Shot3}, … ]
Shot: Each shot is a dictionary structure
‘SCENE_TYPE’: Label,
“FRAMES”: [ {Frame1}, {Frame2}, … ],
‘START’: Value,
‘END’: Value
Frames: An array of frames
[ {Frame1}, {Frame2}, {Frame3}, … ]
Frame: Each frame is a dictionary structure

“FINETUNED_CLASS”: { ‘Probable Label’: Value, ‘Other Labels’: { ‘label’: probability } },
“SVM_CLASS”: Label,
“OBJ_CLASS”: { ‘Probable Label’: Value, ‘Other Labels’: { ‘label’: probability } },
“SCENE_LOCATION”: { ‘Probable Label’: Value, ‘Other Labels’: { ‘label’: probability } },
“SCENE_ATTRIBUTES”: [ label1, label2, label3, … ],
‘Count’: Value, ‘Person#X’: { ‘Probability’: Value, ‘Position’: { ‘y’: val, ‘x’: val, ‘w’: val, ‘h’: val } }

Return to top

Dataset Preparation

The dataset was compiled from some videos obtained from Red Hen Lab’s NewsScape database. A selection of full length news broadcasts from a variety of news networks, different shows of a network, and from different periods of time, were put together.
The train dataset composes of 29 videos (approx. 25 hours of data). The test set contains 10 videos, of which no two are from the same show, as it was crucial to ensure that the learning model has never seen the test data.
The annotations were done by a single annotator (me) with the help of Red Hen Lab’s Rapid Annotator:

A total of 21,101 images (keyframes) were annotated for training, of which 12,407 images were ‘Commercials’ and redundant. Hence, the actual train data consisted of 8,694 images.
Similarly, out of 7,303 images (including commercials), 3,339 images were used for testing.
The train set is imbalanced due to the fact that some classes such as Newsperson(s) and Background_roll occur much more frequently than Weather, or Sports, etc. However, as the same class distribution is present in the actual data to be predicted on, this did not seem to really be an issue.

The list of videos can be found here:
Return to top

Tried Approaches for Classification

Graph Clustering

My first attempt was to follow an unsupervised learning method such as graph clustering. After initial observation of the data, it had seemed like supervised learning would not work very well. Taking the reference of the research paper , I implemented a graph clustering algorithm.
However, this method was discarded as it was limited to the nature of one class and a supervised learning approach was taken up.

Linear SVMs

One of the methods of transfer learning is to use a convolutional neural network as a feature extractor and then train a linear classifier such as SVM on the dataset. The model I used was of PlacesCNN from the BVLC Model Zoo ( which was based on AlexNet architecture.
By removing the last fully connected layer (this layer’s outputs are the 1000 class scores for a different task such as ImageNet), I extracted a 4096 feature vector for each keyframe image from the fc7 layer (penultimate layer).

By combining different classes of high intraclass similarities, I trained an SVM for each combination.
– 5 class SVM in OneVsRest mode → [ Newsperson(s), Background_roll, Graphic, Weather, Sports ]
– 5 class SVM in OneVsOne mode
– 3 class SVM in OneVsRest mode → [ Newsperson(s), Background_roll, Graphic]
A comparison of how the models fared is given in the Performance Analysis section.
Finally, the 5class SVM model in OneVsRest mode is used in the pipeline.

Fine tuning a Convolutional Neural Network

Screenshot from 2016-08-23 05:30:54.png

The other transfer learning method employed was fine tuning weights of the pre-trained reference Caffe network. I fine tuned the network by replacing only the last layer, as the earlier features of a CNN contain more generic features, but later layers become more specific to the details of the classes contained in the original dataset.
Again, a combination of classes were taken to create three fine tuned models:
– 8 class →  [ Studio, Reporter, Hybrid, Talking_head, Background_roll, Graphic, Weather, Sports ]
– 5 class → [ Newsperson(s), Background_roll, Graphic, Weather, Sports ]
– 3 class → [ Newsperson(s), Background_roll, Graphic]
Each model was fine tuned for 100,000 iterations, which took around 4 hours of processing time each, on a K40 GPU.
Return to top

Performance Analysis

Usually when the class distribution is imbalanced, as in this case, accuracy is not used as it gives high scores to models which just predict the most frequent class. Hence, metrics such as Precision, Recall and F-score were computed class-wise.

Performance measures were calculated for the 8 camera shot type labels (such as ‘Newsperson(s)’, ‘Hybrid’ etc.), scene type (‘Indoor’, ‘Outdoor’), and for the object type categories (such as ‘Vehicle’, ‘Natural formation’ etc.). However, no performance measures were calculated for places205 labels, imagenet labels, or scene attributes as these were a direct result of pre-trained models output and hence being impossible to annotate all possible labels on the test data to check performance. Instead, probabilities of each label were output so as to assist in deciding the reliability of the label.

The detailed results for each label type for which I have trained a classifier can be seen below:-

Fine tuned caffenet shot type:

SVM Shot type:

Scene type and Object category:

Why were some classes combined?

Classes such as Studio, Reporter, and Hybrid were visually similar in the sense that they all consisted of talking human heads with varying backgrounds, but were also logically one category i.e Newsperson(s).

This visual similarity occurred in Background_roll as well, in the case of interviews or guest speakers. Hence these talking heads in the Background_roll class could not logically be combined as Newsperson(s), leading to some definite misclassification.

Sometimes, it was a tricky situation between Newsperson(s) and another label such as Sports or Weather, as both seemed valid.

I also tried combining Weather into the Graphic class, however performance wasn’t significantly better. Hence I decided to preserve the diversity of classes and keep them as separate classes.

Studio frames which had backgrounds projected as outdoor environments, were misleading with respect to scene type (Outdoor instead of Indoor).


Return to top

Chart Analysis:



The chart is a comparison of three finetuned models. The 5class fine tuned model was marginally better, and hence used in the final pipeline.

Comparing the SVM models, the 5class SVM in OVR mode gave higher performance measures and was incorporated into the pipeline.


By comparing the two selected models incorporated in the pipeline, we can see that the 5class finetuned Caffe net performs significantly better than the 5class SVM.

Final Results
Metric 5class Finetune 3class Finetune 8class Finetune 5class SVM OVR 3class SVM OVR 5class SVM OVO Object Type Scene Type
Overall Precision 86.51 86.05 79.56 78.63 78.22 79.22 72.15 87.27
Overall Recall 84.23 84.76 74.74 74.56 71.47 69.97 66.48 86.52
Overall F-score 85.34 85.39 76.28 76.42 74.57 73.39 69.13 86.85

Return to top

Deployment Details

The entire pipeline has been deployed on the Case Western Reserve University High Performance Computing cluster. A challenging part of working on a cluster environment was setting up all the required dependencies such as Caffe without root privileges. A lot of issues such as conflicting dependencies and various errors while compilation made the setup process time consuming, while also being a great learning process.

I wrote scripts which interface with SLURM, the resource allocation manager that the cluster uses. The scripts copy the day’s incoming news videos to the cluster and then submit batch jobs that process the videos and create .sht (and .json) output files. On average, the entire day’s videos (around 150 hours of data) get processed in approximately 3-4 hours.


You can process news videos on Case HPC in two ways:

  1. Process a list of videos using -l flag:
    Run  ./ -l <list>.txt
    <list>.txt contains YYYY-MM-DD_HOUR_NETWORKNAME.mp4 (only basenames of files)
  2. Process a particular day’s worth of news videos using -d flag:
    Run  ./ -d YYYY/MM/DD

Edit the variable VIDEO_DST in to change the path of the processed video files.
Return to top

Link to Repository / Commits

The project repository is at:

All the commits can be seen at:
Return to top

Links to Blog Posts

I have written two blog posts about GSOC during the program:-
Return to top

Known Issues / Future Work Needed

  • The current system’s performance is good for videos which have some similarity with the training data used, primarily comprising of US networks, and certain programs.
  • Scene type labels (indoor, outdoor) were output for every shot, irrespective of shot type. Technically, scene type would be most applicable to Newsperson(s), Background_roll shots, however this sort of conditioning wasn’t done so as to make the two sets of labels independent of each other’s performance. Similarly, imagenet labels have been output for every shot, and may not be relevant for some shot classes (such as for Graphic or Weather).
  • The system works best for videos which follow a definite program structure. Work needs to be done to expand the dataset quantitatively and in terms of diversity of shows/programs.
  • Commercials need to be identified before using .sht (or .json) output, as labels such as Camera Shot type are not applicable.

Return to top

Carrying on working with Red Hen

It was an excellent experience working for Red Hen Lab this summer, and learning so much. I will most certainly continue to maintain and improve my project, and keep contributing to the organization.
Return to top


  1. Places205-AlexNet model:
    B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva
    Learning Deep Features for Scene Recognition using Places Database.
    Advances in Neural Information Processing Systems 27 (NIPS) spotlight, 2014.
  2. GoogleNet model:
    Szegedy et al., Going Deeper with Convolutions, CoRR 2014
    Used BVLC Googlenet model, trained by S. Guadarama.
  3. Reference Caffenet model:
    AlexNet trained on ILSVRC 2012, with a minor variation from the version as described in ImageNet classification with deep convolutional neural networks by Krizhevsky et al. in NIPS 2012. Model trained by J. Donahue.
  4. CWRU HPC:
    This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University.
  5. Red Hen Lab NewsScape Dataset:
    This work made use of the NewsScape dataset and the facilities of the Distributed Little Red Hen Lab, co-directed by Francis Steen and Mark Turner.

Return to top


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s