this is my blog

Calculating Clusters
at 14:47 | General, Thesis.

Today I finished the code for subsampling from descriptors. This is needed, because it would take way too long to cluster on all descriptors.

Together with Ivo, I copied all the code to the Kameleon server here in Science Park. This server has 72GB internal memory, dual Quad core CPU's and 3 GPU cores. Obviously this will process the data a little bit faster than my Macbook Pro.

We fixed some small bugs in my code, copied all the Hollywood2 data to the server as well and the script to cluster all data is currently running. We use the same settings as the settings in the paper by Ivan Laptev (Learning realistic actions human actions from movies).

  • 100k features, sampled from the training videos
  • 4000 clusters

Hopefully we will end up with similar results as Ivan Laptev, so can we can easily make a comparison. Next up, I will start working on the code to create visual words using the calculated clusters.

After a few weeks of reading to get familiar with the project I am working on, it is finally time to start working on the software. Ivo already retrieved the necessary Space-Time Interest Points (STIP) using Ivo Laptev's software. I will be using these descriptors to build a visual vocabulary for the video fragments. This method is best explained in the article by Josef Sivic and Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos.

Basically all Space-Time descriptors from all video fragments are put together in a bag. A subset of these descriptors is clustered using k-means. Based on these clusters, a histogram (the so called visual word) can be created for each video sample. It is created by taking each descriptor on the sample and accumulating the Euclidean distances to all cluster centers.

As the Space-Time descriptors are based on changes in time, a lot of descriptors will be found when going from one scene to another (so called shot boundaries). As the descriptors on these boundaries will not be relevant for action detection, they will have to be filtered. Luckily, the Hollywood2 dataset contains the frame numbers for each shot boundary in the video fragments. As each descriptor lies on several frames, some code is required to find all the ones that need filtering.

Yesterday I have created the code for this filtering. It is written in Matlab, which will be the main programming tool for this project. As I want to show off the syntax coloring on my website, I will post a small piece of the code I created.

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
        % Get the frame numbers of the descriptors
        frameNums       = descStruct.t;
        
        % Get the start and end frame, using tau2
        frameStart      = descStruct.t - descStruct.tau2;
        frameEnd        = descStruct.t + descStruct.tau2;
        
        n_descriptors   = size(frameNums, 1);
        n_boundaries    = size(boundaries, 2);
        
        % Make sure all matrices are the same size
        frameStartAll   = repmat(frameStart, [1 n_boundaries]);
        frameEndAll     = repmat(frameEnd, [1 n_boundaries]);
        boundariesAll   = repmat(boundaries, [n_descriptors 1]);
        
        % Substract the boundaries from the framestart, if the desciptor 
        % has been found on a boundary frame, start and end will be from
        % positive to negative or one of them is 0
        frameStartFinal = frameStartAll - boundariesAll;
        frameEndFinal   = frameEndAll - boundariesAll;
        
        % Find the descriptors that do not lie on the boundary
        indices =  (sum((frameStartFinal .* frameEndFinal <= 0), 2) == 0);

This code shows the filtering process. It first determines the frame start and end of the descriptor. Next, it substracts the frame numbers of the shot boundaries. Lastly, every descriptor that has a start lower or equal to zero and an end higher or equal to zero will be a descriptor that has to be filtered.

Master thesis: Action Recognition in Movies
at 12:53 | General, Thesis.

In this first blog entry I will briefly explain the topic of my Master thesis. I started this thesis around a month ago and it is scheduled to take up around 6 months of my time.

The topic for my Master thesis is: "Action Recognition in Movies". The aim of this research is to learn realistic human actions in diverse and realistic video settings. Actions in this context can be kissing, answering a phone, getting out of a car, etc. You can think of many purposes for this kind of action recognition. For instance it would be possible to search for specific movies in which particular actions occur. Such a query could then return the video's together with the time of the action. Another use might be in surveillance, where particular actions, like fighting could be of interest. If these actions are detected automatically, it could save time and costs.

Action: kissing Action: pick up phone Action: getting out of a car

I will work together with my supervisor Ivo Everts to build the software for this recognition. The software will rely of previous work, combining it, hopefully resulting in a better performance.

The dataset that we will be using is the same dataset Ivan Laptev used in his work on action recognition in movies (Hollywood2). It consists of a large set of short video fragments from a variety of Hollywood movies. The dataset can be found on his website: http://www.irisa.fr/vista/actions/hollywood2/.

For previous work in this subject, you can read the following articles. These articles show that most work is done in detecting the context of a video or detecting actions in movies. I will be combining both methods, context and action recognition to try and build a more robust system.