Limits...
Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition.

Gu F, Flórez-Revuelta F, Monekosso D, Remagnino P - Sensors (Basel) (2015)

Bottom Line: Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance.According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature.It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

View Article: PubMed Central - PubMed

Affiliation: School of Computing and Information Systems, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE, UK. F.Gu@kingston.ac.uk.

ABSTRACT
Multi-view action recognition has gained a great interest in video surveillance, human computer interaction, and multimedia retrieval, where multiple cameras of different types are deployed to provide a complementary field of views. Fusion of multiple camera views evidently leads to more robust decisions on both tracking multiple targets and analysing complex human activities, especially where there are occlusions. In this paper, we incorporate the marginalised stacked denoising autoencoders (mSDA) algorithm to further improve the bag of words (BoWs) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage. Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance. According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature. It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

No MeSH data available.


Related in: MedlinePlus

An example of all the camera views in the ACT42 dataset.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4541930&req=5

f3-sensors-15-17209: An example of all the camera views in the ACT42 dataset.

Mentions: The ACT42 dataset is recorded to overcome the bottleneck of existing action recognition approaches, by providing a framework for investigating both colour (RGB) and depth information and handling action variants across multiple viewpoints [39]. All the videos are recorded in a living room environment, by 4 Microsoft Kinect cameras with a resolution of 640 × 480 at 30 frames per second, as shown in Figure 3. Please note due to the scope of this work, we only use the RGB information in our experiments, but not the depth information at all. There are 24 actors performing 14 daily actions, which results totally 6844 action instances. However only a subset of the whole dataset containing 2648 action instances is used in this work, as provided publically by the authors of [39].


Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition.

Gu F, Flórez-Revuelta F, Monekosso D, Remagnino P - Sensors (Basel) (2015)

An example of all the camera views in the ACT42 dataset.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4541930&req=5

f3-sensors-15-17209: An example of all the camera views in the ACT42 dataset.
Mentions: The ACT42 dataset is recorded to overcome the bottleneck of existing action recognition approaches, by providing a framework for investigating both colour (RGB) and depth information and handling action variants across multiple viewpoints [39]. All the videos are recorded in a living room environment, by 4 Microsoft Kinect cameras with a resolution of 640 × 480 at 30 frames per second, as shown in Figure 3. Please note due to the scope of this work, we only use the RGB information in our experiments, but not the depth information at all. There are 24 actors performing 14 daily actions, which results totally 6844 action instances. However only a subset of the whole dataset containing 2648 action instances is used in this work, as provided publically by the authors of [39].

Bottom Line: Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance.According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature.It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

View Article: PubMed Central - PubMed

Affiliation: School of Computing and Information Systems, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE, UK. F.Gu@kingston.ac.uk.

ABSTRACT
Multi-view action recognition has gained a great interest in video surveillance, human computer interaction, and multimedia retrieval, where multiple cameras of different types are deployed to provide a complementary field of views. Fusion of multiple camera views evidently leads to more robust decisions on both tracking multiple targets and analysing complex human activities, especially where there are occlusions. In this paper, we incorporate the marginalised stacked denoising autoencoders (mSDA) algorithm to further improve the bag of words (BoWs) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage. Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance. According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature. It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

No MeSH data available.


Related in: MedlinePlus