Limits...
Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition.

Gu F, Flórez-Revuelta F, Monekosso D, Remagnino P - Sensors (Basel) (2015)

Bottom Line: Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance.According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature.It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

View Article: PubMed Central - PubMed

Affiliation: School of Computing and Information Systems, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE, UK. F.Gu@kingston.ac.uk.

ABSTRACT
Multi-view action recognition has gained a great interest in video surveillance, human computer interaction, and multimedia retrieval, where multiple cameras of different types are deployed to provide a complementary field of views. Fusion of multiple camera views evidently leads to more robust decisions on both tracking multiple targets and analysing complex human activities, especially where there are occlusions. In this paper, we incorporate the marginalised stacked denoising autoencoders (mSDA) algorithm to further improve the bag of words (BoWs) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage. Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance. According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature. It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

No MeSH data available.


Related in: MedlinePlus

Performance comparisons on the ACT42 dataset, of proposed methods, with respect to the size of BoWs (from 4 K to 20 K when the number of layers of mSDA is 5) and the number of layers of mSDA (from 1 to 5 when the size of BoWs is 4 K), in terms of average recognition rates. (a) The size of BoWs; (b) The number of layers of mSDA.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4541930&req=5

f6-sensors-15-17209: Performance comparisons on the ACT42 dataset, of proposed methods, with respect to the size of BoWs (from 4 K to 20 K when the number of layers of mSDA is 5) and the number of layers of mSDA (from 1 to 5 when the size of BoWs is 4 K), in terms of average recognition rates. (a) The size of BoWs; (b) The number of layers of mSDA.

Mentions: Figure 6 demonstrates the evaluation of the codebook size of BoWs and the number of layers of mSDA on the ACT42 dataset. Similar trends can be also observed here. On one hand, the increase in the codebook size results in minor improvements initially, however the recognition performances drop as the size is close to 20 K. Again more sparse and less discriminative representation, resulted from a larger codebook size. On the other hand, the increase in the number of layers of mSDA leads to slight but consistent improvements. The higher dimensional representation is denser and more robust with respect to discrimination between different action classes. The results of all the compared methods on the ACT42 datasest are shown in Table 3, where the top half consists of the state-of-the-art algorithms in the literature, majority of which utilise both the RGB and depth data in the experiments. While the bottom half lists the methods described in this work, where only the RGB colour data is used in the experiments. Another significant difference is much increased dimensionality of the systems using the mSDA representation, due to the deep network structure and concatenated nature of the transformation process. However, since all the classifiers used are based on SVM models which are known good at handling high dimensional data, minimum impacts are caused on either the classification performance or computational complexity. In any case, although our proposed methods merely use the RGB data, thanks to richer visual features of the IDT descriptor and more robust and useful representation of mSDA, they outperform the state-of-the-art algorithms. Similarly, the recognition performance of the MKL algorithm is significantly better than the simple fusion strategies, and the systems with the mSDA representations noticeably outperform those using only the BoWs representations. The best performance is achieved by the MKL algorithm using the IDT descriptor and mSDA representation, reaching 0.857 at 33 FPS.


Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition.

Gu F, Flórez-Revuelta F, Monekosso D, Remagnino P - Sensors (Basel) (2015)

Performance comparisons on the ACT42 dataset, of proposed methods, with respect to the size of BoWs (from 4 K to 20 K when the number of layers of mSDA is 5) and the number of layers of mSDA (from 1 to 5 when the size of BoWs is 4 K), in terms of average recognition rates. (a) The size of BoWs; (b) The number of layers of mSDA.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4541930&req=5

f6-sensors-15-17209: Performance comparisons on the ACT42 dataset, of proposed methods, with respect to the size of BoWs (from 4 K to 20 K when the number of layers of mSDA is 5) and the number of layers of mSDA (from 1 to 5 when the size of BoWs is 4 K), in terms of average recognition rates. (a) The size of BoWs; (b) The number of layers of mSDA.
Mentions: Figure 6 demonstrates the evaluation of the codebook size of BoWs and the number of layers of mSDA on the ACT42 dataset. Similar trends can be also observed here. On one hand, the increase in the codebook size results in minor improvements initially, however the recognition performances drop as the size is close to 20 K. Again more sparse and less discriminative representation, resulted from a larger codebook size. On the other hand, the increase in the number of layers of mSDA leads to slight but consistent improvements. The higher dimensional representation is denser and more robust with respect to discrimination between different action classes. The results of all the compared methods on the ACT42 datasest are shown in Table 3, where the top half consists of the state-of-the-art algorithms in the literature, majority of which utilise both the RGB and depth data in the experiments. While the bottom half lists the methods described in this work, where only the RGB colour data is used in the experiments. Another significant difference is much increased dimensionality of the systems using the mSDA representation, due to the deep network structure and concatenated nature of the transformation process. However, since all the classifiers used are based on SVM models which are known good at handling high dimensional data, minimum impacts are caused on either the classification performance or computational complexity. In any case, although our proposed methods merely use the RGB data, thanks to richer visual features of the IDT descriptor and more robust and useful representation of mSDA, they outperform the state-of-the-art algorithms. Similarly, the recognition performance of the MKL algorithm is significantly better than the simple fusion strategies, and the systems with the mSDA representations noticeably outperform those using only the BoWs representations. The best performance is achieved by the MKL algorithm using the IDT descriptor and mSDA representation, reaching 0.857 at 33 FPS.

Bottom Line: Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance.According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature.It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

View Article: PubMed Central - PubMed

Affiliation: School of Computing and Information Systems, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE, UK. F.Gu@kingston.ac.uk.

ABSTRACT
Multi-view action recognition has gained a great interest in video surveillance, human computer interaction, and multimedia retrieval, where multiple cameras of different types are deployed to provide a complementary field of views. Fusion of multiple camera views evidently leads to more robust decisions on both tracking multiple targets and analysing complex human activities, especially where there are occlusions. In this paper, we incorporate the marginalised stacked denoising autoencoders (mSDA) algorithm to further improve the bag of words (BoWs) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage. Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance. According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature. It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

No MeSH data available.


Related in: MedlinePlus