this is my blog

Feature Vectors & Classifying Data
at 12:12 | General, Thesis.

In the previous entry I talked about clustering the descriptors. Even though the clustering algorithm was run on a really fast machine with loads of memory, it was still running after two days. It was not even finished with clustering the HoG descriptors, while it still had to do the HoF descriptors afterwards (which takes at least the same amount of time). As we I had to continue with my work a faster clustering algorithm was necessary.

Luckily Ivo had a solution, careful seeding. More is explained in the paper by Arthur and Vassilvitskii: The Advantages of Careful Seeding. This proved to be a good temporary solution. The clustering was now done in less than 5 minutes (note that we don't know if this is optimal, but it useable non the less).

After the initial clustering was done, it was time for the creation of feature vectors for each video sample. To do this, we simply go through all HoG and HoF descriptors in the video sample and measure the distance to each cluster (Euclidean distance in our case). Then for each cluster we sum up the amount of descriptors that have the closest distance to it. Finally we normalize the all the values, based on the number of descriptors. All the values for each cluster together form the feature vector for the video sample. The following code block contains the code to calculate these feature vectors.

calcVisualWordHist
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function vwHist = calcVisualWordHist(descriptors, centroids)
 
    % Calculate necessary sizes
    nCentroids = size(centroids, 1);
    nDescs = size(descriptors, 1);
    
    % Calculate all distances for each descriptor to each cluster
    distances = zeros([nCentroids nDescs]);
    for i=1:nDescs
        distances(:, i) = sqrt(sum(bsxfun(@minus, centroids, descriptors(i, :)) .^ 2, 2));
    end
    
    % Make histogram and normalize
    [minVals minIdx] = min(distances); %#ok
    vwHist = histc(minIdx, 1:nCentroids)';
    vwHist = vwHist ./ nDescs;
 
end

As there are 1700 video samples (each containing a few thousand descriptors), it would once again take to long to run the script on my own laptop. Instead Ivo made use of this opportunity to teach me how to work with the national computer cluster Lisa. One can simply write a bash script and submit it as a job to Lisa, which will run the script as soon as it has an available spot on the cluster. For each script you have to specify how much time and memory you will be needing. So we split up the script into 18 pieces, all calculating the feature vector for up to 100 samples at a time. This took about 3 hours to complete.

The Hollywood2 dataset also contains labels for each video sample, describing wether a certain action is performed in the video. Using these labels and the feature vectors we created, it was possible to create a classifier and to see how well it performs on our data. As we do not have the specific frames where each action is performed, the classifier will simply be able to specify if a certain action happened somewhere in a given video sample. It will not be able to tell us the exact location of the action. Although we obviously want to know at what point in the video the action is taking place, it is is nevertheless a good starting point to know in which video the action is taking place. The results from the first classification experiments look quite promising. The errors are quite low for most actions (under 0.1).

The next step is to build a multi-class classifier to distinguish between different actions. Also I will be experimenting with methods for finding the specific frames where the action is taking place. However, the latter problem does require me to go through the whole dataset, as this information is not in the Hollywood2 dataset by default (a daunting task...).

Calculating Clusters
at 14:47 | General, Thesis.

Today I finished the code for subsampling from descriptors. This is needed, because it would take way too long to cluster on all descriptors.

Together with Ivo, I copied all the code to the Kameleon server here in Science Park. This server has 72GB internal memory, dual Quad core CPU's and 3 GPU cores. Obviously this will process the data a little bit faster than my Macbook Pro.

We fixed some small bugs in my code, copied all the Hollywood2 data to the server as well and the script to cluster all data is currently running. We use the same settings as the settings in the paper by Ivan Laptev (Learning realistic actions human actions from movies).

  • 100k features, sampled from the training videos
  • 4000 clusters

Hopefully we will end up with similar results as Ivan Laptev, so can we can easily make a comparison. Next up, I will start working on the code to create visual words using the calculated clusters.

After a few weeks of reading to get familiar with the project I am working on, it is finally time to start working on the software. Ivo already retrieved the necessary Space-Time Interest Points (STIP) using Ivo Laptev's software. I will be using these descriptors to build a visual vocabulary for the video fragments. This method is best explained in the article by Josef Sivic and Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos.

Basically all Space-Time descriptors from all video fragments are put together in a bag. A subset of these descriptors is clustered using k-means. Based on these clusters, a histogram (the so called visual word) can be created for each video sample. It is created by taking each descriptor on the sample and accumulating the Euclidean distances to all cluster centers.

As the Space-Time descriptors are based on changes in time, a lot of descriptors will be found when going from one scene to another (so called shot boundaries). As the descriptors on these boundaries will not be relevant for action detection, they will have to be filtered. Luckily, the Hollywood2 dataset contains the frame numbers for each shot boundary in the video fragments. As each descriptor lies on several frames, some code is required to find all the ones that need filtering.

Yesterday I have created the code for this filtering. It is written in Matlab, which will be the main programming tool for this project. As I want to show off the syntax coloring on my website, I will post a small piece of the code I created.

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
        % Get the frame numbers of the descriptors
        frameNums       = descStruct.t;
        
        % Get the start and end frame, using tau2
        frameStart      = descStruct.t - descStruct.tau2;
        frameEnd        = descStruct.t + descStruct.tau2;
        
        n_descriptors   = size(frameNums, 1);
        n_boundaries    = size(boundaries, 2);
        
        % Make sure all matrices are the same size
        frameStartAll   = repmat(frameStart, [1 n_boundaries]);
        frameEndAll     = repmat(frameEnd, [1 n_boundaries]);
        boundariesAll   = repmat(boundaries, [n_descriptors 1]);
        
        % Substract the boundaries from the framestart, if the desciptor 
        % has been found on a boundary frame, start and end will be from
        % positive to negative or one of them is 0
        frameStartFinal = frameStartAll - boundariesAll;
        frameEndFinal   = frameEndAll - boundariesAll;
        
        % Find the descriptors that do not lie on the boundary
        indices =  (sum((frameStartFinal .* frameEndFinal <= 0), 2) == 0);

This code shows the filtering process. It first determines the frame start and end of the descriptor. Next, it substracts the frame numbers of the shot boundaries. Lastly, every descriptor that has a start lower or equal to zero and an end higher or equal to zero will be a descriptor that has to be filtered.

Master thesis: Action Recognition in Movies
at 12:53 | General, Thesis.

In this first blog entry I will briefly explain the topic of my Master thesis. I started this thesis around a month ago and it is scheduled to take up around 6 months of my time.

The topic for my Master thesis is: "Action Recognition in Movies". The aim of this research is to learn realistic human actions in diverse and realistic video settings. Actions in this context can be kissing, answering a phone, getting out of a car, etc. You can think of many purposes for this kind of action recognition. For instance it would be possible to search for specific movies in which particular actions occur. Such a query could then return the video's together with the time of the action. Another use might be in surveillance, where particular actions, like fighting could be of interest. If these actions are detected automatically, it could save time and costs.

Action: kissing Action: pick up phone Action: getting out of a car

I will work together with my supervisor Ivo Everts to build the software for this recognition. The software will rely of previous work, combining it, hopefully resulting in a better performance.

The dataset that we will be using is the same dataset Ivan Laptev used in his work on action recognition in movies (Hollywood2). It consists of a large set of short video fragments from a variety of Hollywood movies. The dataset can be found on his website: http://www.irisa.fr/vista/actions/hollywood2/.

For previous work in this subject, you can read the following articles. These articles show that most work is done in detecting the context of a video or detecting actions in movies. I will be combining both methods, context and action recognition to try and build a more robust system.