Blog
This week I am trying to finish the new labeling for the Hollywood2 data. Instead of automatic labeling, which was done by Ivan Laptev, I am now manually labeling all 1700 videos. Obviously this will take up quite some time, but it will be very useful for further experiments. Also I will make this data openly available, for use in the future.
To aid me in labeling, I created a simple GUI application in Matlab. With this tool I can easily go through all frames of a video, rewind, pause etc. To save an action I simply have to press the button for the particular action. Using this tool I should be able to finish the labeling before the end of the week.

Since my last entry I haven't done many new things. Mainly because I have been running more experiments with different settings, but also because I have been working a bit more on my iPhone program, which I am working on for a course called Game Programming (I will probably make a blog entry on this as well).
Yesterday however, I finished some code for checking performance on the results of splitting the test data into smaller pieces of video. The splitting was done on the shot boundaries that come with the Hollywood2 dataset. As we do not have specific labels for this split data, I can not say anything specific about the performance. What I can do is check random videos and verify the results.
So what I did was writing a script for picking a random video and plotting the likelihood for every action for each part of the video (divided using the shot boundaries). The results look promising, for most videos I looked at, the actions are identified with the highest likelihood in the right shot. However, this is not true for every video and sometimes, even though there's no action taking place in the video, it still has high likelihoods for some actions. As the labels for each video are generated by an automated process, described in Learning realistic human actions from movies, it is likely that errors exist. Because this can become quite a problem if you want to have a more specific location of an action in a video, in the coming days I will be looking at a way of labeling the data by hand. This might take some time, but will be useful for further experiments.
Just to show one of the promising results I got yesterday, here's a short video, from Raising Arizona (1987), and the likelihoods for each shot. As you can see in the video, you can see a man running in the second, fourth and sixth shot of the video. The likelihood results clearly show the same pattern.

Today we finally got the first results that are comparable to the ones from Ivan Laptev's paper. Interestingly enough, our results seem significantly better. This should not be the case, as we followed the same path Laptev used in his paper with the same dataset and same values for the different algorithms involved.
Obviously somewhere along the way, something must be different. Ivo and I will have to find out how this is possible and at least come up with a logical explanation.
The next step for me will be creating code for splitting up the dataset into smaller time frames. It will then be possible to see if the trained classifier will also be able to find the correct actions in smaller time frames. This is useful for finding the actual position of the action in a movie (not just simply indicating wether an action occurs in a given movie).
In the previous entry I talked about clustering the descriptors. Even though the clustering algorithm was run on a really fast machine with loads of memory, it was still running after two days. It was not even finished with clustering the HoG descriptors, while it still had to do the HoF descriptors afterwards (which takes at least the same amount of time). As we I had to continue with my work a faster clustering algorithm was necessary.
Luckily Ivo had a solution, careful seeding. More is explained in the paper by Arthur and Vassilvitskii: The Advantages of Careful Seeding. This proved to be a good temporary solution. The clustering was now done in less than 5 minutes (note that we don't know if this is optimal, but it useable non the less).
After the initial clustering was done, it was time for the creation of feature vectors for each video sample. To do this, we simply go through all HoG and HoF descriptors in the video sample and measure the distance to each cluster (Euclidean distance in our case). Then for each cluster we sum up the amount of descriptors that have the closest distance to it. Finally we normalize the all the values, based on the number of descriptors. All the values for each cluster together form the feature vector for the video sample. The following code block contains the code to calculate these feature vectors.
calcVisualWordHist
- function vwHist = calcVisualWordHist(descriptors, centroids)
- % Calculate necessary sizes
- % Calculate all distances for each descriptor to each cluster
- for i=1:nDescs
- end
- % Make histogram and normalize
- vwHist = vwHist ./ nDescs;
- end
As there are 1700 video samples (each containing a few thousand descriptors), it would once again take to long to run the script on my own laptop. Instead Ivo made use of this opportunity to teach me how to work with the national computer cluster Lisa. One can simply write a bash script and submit it as a job to Lisa, which will run the script as soon as it has an available spot on the cluster. For each script you have to specify how much time and memory you will be needing. So we split up the script into 18 pieces, all calculating the feature vector for up to 100 samples at a time. This took about 3 hours to complete.
The Hollywood2 dataset also contains labels for each video sample, describing wether a certain action is performed in the video. Using these labels and the feature vectors we created, it was possible to create a classifier and to see how well it performs on our data. As we do not have the specific frames where each action is performed, the classifier will simply be able to specify if a certain action happened somewhere in a given video sample. It will not be able to tell us the exact location of the action. Although we obviously want to know at what point in the video the action is taking place, it is is nevertheless a good starting point to know in which video the action is taking place. The results from the first classification experiments look quite promising. The errors are quite low for most actions (under 0.1).
The next step is to build a multi-class classifier to distinguish between different actions. Also I will be experimenting with methods for finding the specific frames where the action is taking place. However, the latter problem does require me to go through the whole dataset, as this information is not in the Hollywood2 dataset by default (a daunting task...).
Today I finished the code for subsampling from descriptors. This is needed, because it would take way too long to cluster on all descriptors.
Together with Ivo, I copied all the code to the Kameleon server here in Science Park. This server has 72GB internal memory, dual Quad core CPU's and 3 GPU cores. Obviously this will process the data a little bit faster than my Macbook Pro.
We fixed some small bugs in my code, copied all the Hollywood2 data to the server as well and the script to cluster all data is currently running. We use the same settings as the settings in the paper by Ivan Laptev (Learning realistic actions human actions from movies).
- 100k features, sampled from the training videos
- 4000 clusters
Hopefully we will end up with similar results as Ivan Laptev, so can we can easily make a comparison. Next up, I will start working on the code to create visual words using the calculated clusters.