Organization: Red Hen Lab
My project was for the Red Hen Lab organization, although it is officially listed under the CCExtractor organization, which is a sister organization of Red Hen Lab.
Table of Contents:
- Usage Instructions
- The Full Pipeline
- Example Output
- Dataset Preparation
- Classification Approaches
- Deployment Details
- Repository Link
- Previous blog posts
- Known Issues/Future Work
- Future Collaboration
The aim of my project is to classify and analyze news shots of various kinds which are encountered in the news broadcast videos that Red Hen processes everyday. The system extracts visual features (such as the presence of a particular object, scene type, etc.) in the form of class labels, and provides low level semantic information which may be used for further processing. I have also developed a classifier on the basis of camera shot type (such as Newsperson(s), Background roll, Graphics, etc). The developed pipeline has been deployed onto the Case Western Reserve University High Performance Computing cluster, and is processing every new day’s incoming videos and extracting visual features from them.
Clone the repository from https://github.com/gshruti95/news-shot-classification , while ensuring that all dependencies are correctly installed.
A video can be processed as follows:-
python main.py <path-to-videofile>
The path to the video file is either absolute or relative with respect to the main.py file.
The code will generate an output file with the same name as the video and a .vis file extension (which is a self-coined term for visual feature file).
- Python (https://www.python.org/downloads/) : The language of the project. The code has been tested with Python 2.7.8 and 2.7.12. It should work with any recent version of Python 2. Python 3 and any other versions are your own adventure.
- Caffe (https://github.com/BVLC/caffe) : Neural network framework. Needs to be complied with Python support (pycaffe). The code has been tested with Caffe rc3, and should work with the GitHub version.
- FFMpeg (https://github.com/FFmpeg/FFmpeg) : For video processing. The code has been tested with v2.8.2 and v3.1.0, and should work with the GitHub version.
- PySceneDetect (https://github.com/Breakthrough/PySceneDetect) : For shot detection. The code has been tested with v0.3.5.
- Scikit-Learn (http://scikit-learn.org/stable/install.html) : For various classifiers. The pip installation of scikit-learn should work.
Required External Files and Models
All the required external files and classifier models can be found here:
The paths to all external files required by the code can be modified in
path_params.py according to the user’s convenience.
Return to top
The Full Pipeline
- Camera shot type → [ Newsperson(s), Background_roll, Graphic, Weather, Sports ]
- Object category → [ Vehicle, Natural formation, Building/Structure, Person(s)/Clothing, Weapon, Sports ]
- Scene type → [ Indoor, Outdoor ]
- Imagenet labels with probabilities
- Places205 labels with probabilities
- Scene attributes
What Camera Shot type labels mean:
- Newsperson(s) consists of Studio, Reporter/Correspondent, Hybrid (split screen view of studio, reporter, or background roll)
- Background_roll consists of footage used to describe a scene by a reporter or studio anchor
- Graphic consists of graphical images such as: branding graphics of the news network (occurs between commercials and starting of a news story and so on), textual graphics present in background roll (quotations or statements)
- Weather: Graphic images of weather maps or forecasts, may or may not be in the presence of the meteorologist
- Sports: Sports segment in a news broadcast
The steps followed by the pipeline are:-
- Extract intra frames (I frames) using ffmpeg. Intra frames are representative frames of each shot.
- Detect shots using pyscenedetect module and extract first and last frame of each shot to add to the list of keyframes.
- Crop the news ticker from the bottom of keyframes. This is done so that the text present in the ticker (headlines etc.) does not affect the neural network’s performance.
- Obtain scene location labels (scene type and Places205 labels) along with scene attributes.
- Extract and save fc7 features extracted from Places205 CNN model.
- Train SVM on fc7 features to classify frames into camera shot type labels.
- Obtain object category and imagenet labels.
- Classify frames using fine tuned caffe net, and get another set of camera shot type labels. The two sets can be used as a crosscheck or validation.
- Identify the labels with the highest frequencies within a shot’s set of keyframes and assign the labels to the detected shot boundary.
timestamp1| timestamp2| Shot_class | Other shot labels with probabilities
Each keyframe has 5 lines corresponding to it (Finetune shot class, SVM shot class, Obj class, Scene location, Scene attributes). Each shot boundary is a single line with the most frequently occuring labels within the keyframes belonging to a shot.
Snippet of the output along with the corresponding keyframe, and true labels written in the form [Camera shot type, Object type, Scene type].
2015020216002.369| 2015020216006.006| SHOT_DETECTED >> | Finetuned_Shot_Class= Background_roll | SVM_Shot_Class= Background_roll | Obj_Class= Unclassified | Scene_Type= Outdoor
2014051023085.379| 2014051023085.379| FINETUNED_SHOT_CLASS | Background_roll | (‘Background_roll’, 1.0)
2014051023085.379| 2014051023085.379| SVM_SHOT_CLASS | Background_roll
2014051023085.379| 2014051023085.379| OBJ_CLASS | Natural formation | (‘geological formation’, 1.75), (‘natural elevation’, 1.0), (‘shore’, 0.96), (‘seashore’, 0.6), (‘promontory’, 0.57)
2014051023085.379| 2014051023085.379| SCENE_LOCATION | Outdoor | (‘dam’, 0.21), (‘river’, 0.2)
2014051023085.379| 2014051023085.379| SCENE_ATTRIBUTES | open area, natural light, sailing/boating, natural, moist
20140510230372.426| 20140510230372.426| FINETUNED_SHOT_CLASS | Sports | (‘Sports’, 1.0)
20140510230372.426| 20140510230372.426| SVM_SHOT_CLASS | Sports
20140510230372.426| 20140510230372.426| OBJ_CLASS | Sports | (‘rugby ball’, 1.9), (‘ball’, 1.89), (‘game equipment’, 1.69), (‘equipment’, 1.19), (‘soccer ball’, 0.61)
20140510230372.426| 20140510230372.426| SCENE_LOCATION | Outdoor | (‘stadium’, 0.99)
20140510230372.426| 20140510230372.426| SCENE_ATTRIBUTES | sports, nohorizon, competing, exercise, open area
20160308200022.119| 20160308200022.119| FINETUNED_SHOT_CLASS | Newsperson(s) | (‘Newsperson(s)’, 1.0)
20160308200022.119| 20160308200022.119| SVM_SHOT_CLASS | Newsperson(s)
20160308200022.119| 20160308200022.119| OBJ_CLASS | Person(s)/Clothing | (‘garment’, 1.04), (‘clothing’, 1.02), (‘sweater’, 1.0), (‘cardigan’, 0.91), (‘covering’, 0.9)
20160308200022.119| 20160308200022.119| SCENE_LOCATION | Indoor
20160308200022.119| 20160308200022.119| SCENE_ATTRIBUTES | nohorizon, natural light, competing, touring, open area
A field reporter can sometimes be misclassified as Background_roll (such as below), due to highly frequent and similar frames such an image still of a person, interviews, speeches, guest speakers- all of which are not Newsperson(s).
20150725130452.375| 20150725130452.375| FINETUNED_SHOT_CLASS | Background_roll | (‘Background_roll’, 0.986)
20150725130452.375| 20150725130452.375| SVM_SHOT_CLASS | Background_roll
20150725130452.375| 20150725130452.375| OBJ_CLASS | Person(s)/Clothing | (‘clothing’, 0.65), (‘garment’, 0.62), (‘consumer goods’, 0.58), (‘covering’, 0.58), (‘commodity’, 0.55)
20150725130452.375| 20150725130452.375| SCENE_LOCATION | Outdoor
20150725130452.375| 20150725130452.375| SCENE_ATTRIBUTES | natural light, man-made, far-away horizon, warm, leaves
20160308200045.311| 20160308200045.311| FINETUNED_SHOT_CLASS | Background_roll | (‘Background_roll’, 1.0)
20160308200045.311| 20160308200045.311| SVM_SHOT_CLASS | Background_roll
20160308200045.311| 20160308200045.311| OBJ_CLASS | Vehicle | (‘car’, 2.23), (‘motor vehicle’, 2.18), (‘self-propelled vehicle’, 1.97), (‘wheeled vehicle’, 1.72), (‘vehicle’, 1.39)
20160308200045.311| 20160308200045.311| SCENE_LOCATION | Outdoor | (‘parking_lot’, 0.45)
20160308200045.311| 20160308200045.311| SCENE_ATTRIBUTES | man-made, nohorizon, glossy, electric lighting, working
20160308200040.621| 20160308200040.621| FINETUNED_SHOT_CLASS | Background_roll | (‘Background_roll’, 1.0)
20160308200040.621| 20160308200040.621| SVM_SHOT_CLASS | Background_roll
20160308200040.621| 20160308200040.621| OBJ_CLASS | Building/Structure | (‘structure’, 0.82), (‘mercantile establishment’, 0.66), (‘place of business’, 0.65), (‘shop’, 0.63), (‘establishment’, 0.63)
20160308200040.621| 20160308200040.621| SCENE_LOCATION | Outdoor | (‘hospital’, 0.35)
20160308200040.621|20160308200040.621| SCENE_ATTRIBUTES | man-made, open area, natural light, nohorizon, horizontal components
201507025130001.202| 201507025130001.202| FINETUNED_SHOT_CLASS | Newsperson(s) | (‘Newsperson(s)’, 1.0)
201507025130001.202| 201507025130001.2022| SVM_SHOT_CLASS | Newsperson(s)
201507025130001.202| 201507025130001.202| OBJ_CLASS | Person(s)/Clothing | (‘clothing’, 1.29), (‘garment’, 1.28), (‘consumer goods’, 1.13), (‘commodity’, 1.07), (‘covering’, 1.04)
201507025130001.202| 201507025130001.202| SCENE_LOCATION | Indoor | (‘conference_center’, 0.61)
201507025130001.202| 201507025130001.202| SCENE_ATTRIBUTES | man-made, cloth, enclosed area, business, working
20150404190732.544| 20150404190732.544| FINETUNED_SHOT_CLASS | Newsperson(s) | (‘Newsperson(s)’, 1.0)
20150404190732.544| 20150404190732.544| SVM_SHOT_CLASS | Newsperson(s)
20150404190732.544| 20150404190732.544| OBJ_CLASS | Person(s)/Clothing | (‘clothing’, 1.21), (‘garment’, 1.21), (‘consumer goods’, 1.07), (‘commodity’, 1.01), (‘covering’, 0.97)
20150404190732.544| 20150404190732.544| SCENE_LOCATION | Indoor | (‘conference_center’, 0.22)
20150404190732.544| 20150404190732.544| SCENE_ATTRIBUTES | nohorizon, enclosed area, cloth, man-made, studying
2016060701057.934|2016060701057.934| FINETUNED_SHOT_CLASS | Background_roll | (‘Background_roll’, 1.0)
2016060701057.934|2016060701057.934| SVM_SHOT_CLASS | Background_roll
2016060701057.934| 2016060701057.934| OBJ_CLASS | Unclassified (‘furniture’, 0.62), (‘furnishing’, 0.58), (‘desk’, 0.58), (‘table’, 0.54), (‘structure’, 0.35)
2016060701057.934| 2016060701057.934| SCENE_LOCATION | Indoor | (‘office’, 0.25)
2016060701057.934| 2016060701057.934| SCENE_ATTRIBUTES | nohorizon, enclosed area, cloth, working, paper
20160607010062.230| 20160607010062.230| FINETUNED_SHOT_CLASS | Graphic | (‘Graphic’, 1.0)
20160607010062.230| 20160607010062.230| SVM_SHOT_CLASS | Graphic
20160607010062.230| 20160607010062.230| OBJ_CLASS | Unclassified | (‘device’, 0.36), (‘fastener’, 0.29), (‘covering’, 0.28), (‘restraint’, 0.28), (‘equipment’, 0.22)
2016060701062.230| 2016060701062.230| SCENE_LOCATION | Indoor | (‘bar’, 0.31), (‘music_studio’, 0.29)
2016060701062.230| 2016060701062.230| SCENE_ATTRIBUTES | nohorizon, enclosed area, man-made, farming, using tools
20160621060193.656| 20160621060193.656| FINETUNED_SHOT_CLASS | Background_roll | (‘Background_roll’, 1.0)
20160621060193.656| 20160621060193.656| SVM_SHOT_CLASS | Background_roll
20160621060193.656| 20160621060193.656| OBJ_CLASS | Weapon | (‘revolver’, 1.65), (‘pistol’, 1.63), (‘firearm’, 1.57), (‘gun’, 1.47), (‘weapon’, 1.3)
20160621060193.656| 20160621060193.656| SCENE_LOCATION | Indoor
20160621060193.656| 20160621060193.656| SCENE_ATTRIBUTES | enclosed area, nohorizon, wood, man-made, electric lighting
Sometimes, the FINETUNED_SHOT_CLASS and SVM_SHOT_CLASS may disagree. As seen, the finetuned output is more accurate than SVM.
20150725130712.533| 20150725130712.533| FINETUNED_SHOT_CLASS | Weather | (‘Weather’, 0.999)
20150725130712.533| 20150725130712.533| SVM_SHOT_CLASS | Background_roll
20150725130712.533| 20150725130712.533| OBJ_CLASS | Unclassified | (‘elasmobranch’, 0.97), (‘ray’, 0.97), (‘cartilaginous fish’, 0.95), (‘electric ray’, 0.8), (‘fish’, 0.79)
20150725130712.533| 20150725130712.533| SCENE_LOCATION | Indoor | (‘office’, 0.25)
20150725130712.533| 20150725130712.533| SCENE_ATTRIBUTES | nohorizon, warm, vegetation, man-made, electric lighting
20160621060428.835| 20160621060428.835| FINETUNED_SHOT_CLASS | Weather | (‘Weather’, 1.0)
20160621060428.835| 20160621060428.835| SVM_SHOT_CLASS | Weather
20160621060428.835| 20160621060428.835| OBJ_CLASS | Unclassified | (‘musical instrument’, 0.48), (‘clothing’, 0.34), (‘covering’, 0.34), (‘device’, 0.32), (‘consumer goods’, 0.32)
20160621060428.835| 20160621060428.835| SCENE_LOCATION | Indoor | (‘office’, 0.25)
20160621060428.835| 20160621060428.835| SCENE_ATTRIBUTES | nohorizon, warm, vegetation, man-made, electric lighting
The dataset was compiled from some videos obtained from Red Hen Lab’s NewsScape database. A selection of full length news broadcasts from a variety of news networks, different shows of a network, and from different periods of time, were put together.
The train dataset composes of 29 videos (approx. 25 hours of data). The test set contains 10 videos, of which no two are from the same show, as it was crucial to ensure that the learning model has never seen the test data.
The annotations were done by a single annotator (me) with the help of Red Hen Lab’s Rapid Annotator: https://github.com/RedHenLab/RapidAnnotator
A total of 21,101 images (keyframes) were annotated for training, of which 12,407 images were ‘Commercials’ and redundant. Hence, the actual train data consisted of 8,694 images.
Similarly, out of 7,303 images (including commercials), 3,339 images were used for testing.
The train set is imbalanced due to the fact that some classes such as Newsperson(s) and Background_roll occur much more frequently than Weather, or Sports, etc. However, as the same class distribution is present in the actual data to be predicted on, this did not seem to really be an issue.
The list of videos can be found here:
Return to top
Tried Approaches for Classification
My first attempt was to follow an unsupervised learning method such as graph clustering. After initial observation of the data, it had seemed like supervised learning would not work very well. Taking the reference of the research paper http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1699433&tag=1 , I implemented a graph clustering algorithm.
However, this method was discarded as it was limited to the nature of one class and a supervised learning approach was taken up.
One of the methods of transfer learning is to use a convolutional neural network as a feature extractor and then train a linear classifier such as SVM on the dataset. The model I used was of PlacesCNN from the BVLC Model Zoo (https://github.com/BVLC/caffe/wiki/Model-Zoo) which was based on AlexNet architecture.
By removing the last fully connected layer (this layer’s outputs are the 1000 class scores for a different task such as ImageNet), I extracted a 4096 feature vector for each keyframe image from the fc7 layer (penultimate layer).
By combining different classes of high intraclass similarities, I trained an SVM for each combination.
– 5 class SVM in OneVsRest mode → [ Newsperson(s), Background_roll, Graphic, Weather, Sports ]
– 5 class SVM in OneVsOne mode
– 3 class SVM in OneVsRest mode → [ Newsperson(s), Background_roll, Graphic]
A comparison of how the models fared is given in the Performance Analysis section.
Finally, the 5class SVM model in OneVsRest mode is used in the pipeline.
Fine tuning a Convolutional Neural Network
The other transfer learning method employed was fine tuning weights of the pre-trained reference Caffe network. I fine tuned the network by replacing only the last layer, as the earlier features of a CNN contain more generic features, but later layers become more specific to the details of the classes contained in the original dataset.
Again, a combination of classes were taken to create three fine tuned models:
– 8 class → [ Studio, Reporter, Hybrid, Talking_head, Background_roll, Graphic, Weather, Sports ]
– 5 class → [ Newsperson(s), Background_roll, Graphic, Weather, Sports ]
– 3 class → [ Newsperson(s), Background_roll, Graphic]
Each model was fine tuned for 100,000 iterations, which took around 4 hours of processing time each, on a K40 GPU.
Return to top
Usually when the class distribution is imbalanced, as in this case, accuracy is not used as it gives high scores to models which just predict the most frequent class. Hence, metrics such as Precision, Recall and F-score were computed class-wise.
Performance measures were calculated for the 8 camera shot type labels (such as ‘Newsperson(s)’, ‘Hybrid’ etc.), scene type (‘Indoor’, ‘Outdoor’), and for the object type categories (such as ‘Vehicle’, ‘Natural formation’ etc.). However, no performance measures were calculated for places205 labels, imagenet labels, or scene attributes as these were a direct result of pre-trained models output and hence being impossible to annotate all possible labels on the test data to check performance. Instead, probabilities of each label were output so as to assist in deciding the reliability of the label.
The detailed results for each label type for which I have trained a classifier can be seen below:-
Fine tuned caffenet shot type:
Scene type and Object category:
Why were some classes combined?
Classes such as Studio, Reporter, and Hybrid were visually similar in the sense that they all consisted of talking human heads with varying backgrounds, but were also logically one category i.e Newsperson(s).
This visual similarity occurred in Background_roll as well, in the case of interviews or guest speakers. Hence these talking heads in the Background_roll class could not logically be combined as Newsperson(s), leading to some definite misclassification.
Sometimes, it was a tricky situation between Newsperson(s) and another label such as Sports or Weather, as both seemed valid.
I also tried combining Weather into the Graphic class, however performance wasn’t significantly better. Hence I decided to preserve the diversity of classes and keep them as separate classes.
Studio frames which had backgrounds projected as outdoor environments, were misleading with respect to scene type (Outdoor instead of Indoor).
The chart is a comparison of three finetuned models. The 5class fine tuned model was marginally better, and hence used in the final pipeline.
Comparing the SVM models, the 5class SVM in OVR mode gave higher performance measures and was incorporated into the pipeline.
By comparing the two selected models incorporated in the pipeline, we can see that the 5class finetuned Caffe net performs significantly better than the 5class SVM.
|Metric||5class Finetune||3class Finetune||8class Finetune||5class SVM OVR||3class SVM OVR||5class SVM OVO||Object Type||Scene Type|
The entire pipeline has been deployed on the Case Western Reserve University High Performance Computing cluster. A challenging part of working on a cluster environment was setting up all the required dependencies such as Caffe without root privileges. A lot of issues such as conflicting dependencies and various errors while compilation made the setup process time consuming, while also being a great learning process.
I wrote scripts which interface with SLURM, the resource allocation manager that the cluster uses. The scripts copy the day’s incoming news videos to the cluster and then submit batch jobs that process the videos and create .vis output files. On average, the entire day’s videos (around 150 hours of data) get processed in approximately 3-4 hours.
Return to top
Link to Repository / Commits
The project repository is at:
All the commits can be seen at:
Return to top
Links to Blog Posts
I have written two blog posts about GSOC during the program:-
Return to top
Known Issues / Future Work Needed
- The current system’s performance is good for videos which have some similarity with the training data used, primarily comprising of US networks, and certain programs.
- Scene type labels (indoor, outdoor) were output for every shot, irrespective of shot type. Technically, scene type would be most applicable to Newsperson(s), Background_roll shots, however this sort of conditioning wasn’t done so as to make the two sets of labels independent of each other’s performance. Similarly, imagenet labels have been output for every shot, and may not be relevant for some shot classes (such as for Graphic or Weather).
- The system works best for videos which follow a definite program structure. Work needs to be done to expand the dataset quantitatively and in terms of diversity of shows/programs.
- Commercials need to be identified before using .vis output, as labels such as Camera Shot type are not applicable.
Carrying on working with Red Hen
It was an excellent experience working for Red Hen Lab this summer, and learning so much. I will most certainly continue to maintain and improve my project, and keep contributing to the organization.
Return to top
- Places205-AlexNet model:
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva
Learning Deep Features for Scene Recognition using Places Database.
Advances in Neural Information Processing Systems 27 (NIPS) spotlight, 2014.
- GoogleNet model:
Szegedy et al., Going Deeper with Convolutions, CoRR 2014
Used BVLC Googlenet model, trained by S. Guadarama.
- Reference Caffenet model:
AlexNet trained on ILSVRC 2012, with a minor variation from the version as described in ImageNet classification with deep convolutional neural networks by Krizhevsky et al. in NIPS 2012. Model trained by J. Donahue.
- CWRU HPC:
This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University.
- Red Hen Lab NewsScape Dataset:
This work made use of the NewsScape dataset and the facilities of the Distributed Little Red Hen Lab, co-directed by Francis Steen and Mark Turner.