Single-column CNN for crowd counting with pixel-wise attention mechanism

  • Bisheng WangEmail author
  • Guo CaoEmail author
  • Yanfeng Shang
  • Licun Zhou
  • Youqiang Zhang
  • Xuesong Li
Original Article


This paper presents a novel method for accurate people counting in highly dense crowd images. The proposed method consists of three modules: extracting foreground regions (EF), pixel-wise attention mechanism (PAM) and single-column density map estimator (S-DME). EF can suppress the disturbance of complex background efficiently with a fully convolutional network, PAM performs pixel-wise classification of crowd images to generate high-quality local crowd density maps, and S-DME is a carefully designed single-column network that can learn more representative features with much fewer parameters. In addition, two new evaluation metrics are introduced to get a comprehensive understanding of the performance of different modules in our algorithm. Experiments demonstrate that our approach can get the state-of-the-art results on several challenging datasets including our dataset with highly cluttered environments and various camera perspectives.


Crowd counting CNN Pixel-wise attention mechanism FCN 


Compliance with ethical standards

Conflict of interest

The authors declare no conflict of interest.


  1. 1.
    Zhou B, Tang X, Wang X (2015) Learning collective crowd behaviors with dynamic pedestrian-agents. Int J Comput Vis 111(1):50–68CrossRefGoogle Scholar
  2. 2.
    Huang L, Chen T, Wang Y, Yuan H (2015) Congestion detection of pedestrians using the velocity entropy: a case study of love parade 2010 disaster. Phys A Stat Mech Appl 440:200–209CrossRefGoogle Scholar
  3. 3.
    Li W, Mahadevan V, Vasconcelos N (2014) Anomaly detection and localization in crowded scenes. IEEE Trans Pattern Anal Mach Intell 36(1):18–32CrossRefGoogle Scholar
  4. 4.
    Chaker R, Al Aghbari Z, Junejo IN (2017) Social network model for crowd anomaly detection and localization. Pattern Recognit 61:266–281CrossRefGoogle Scholar
  5. 5.
    Benabbas Y, Ihaddadene N, Djeraba C (2011) Motion pattern extraction and event detection for automatic visual surveillance. EURASIP J Image Video Process 2011(1):1–15CrossRefGoogle Scholar
  6. 6.
    Onoro-Rubio D, L’opez-Sastre RJ (2016) Towards perspective-free object counting with deep learning. In: ECCVGoogle Scholar
  7. 7.
    French G, Fisher M, Mackiewicz M, Needle C (2015) Convolutional neural networks for counting fish in fisheries surveillance video. In: British machine vision conference workshop, BMVA PressGoogle Scholar
  8. 8.
    Chen K, Loy CC, Gong S, Xiang T (2012) Feature mining for localized crowd counting. In: European conference on computer visionGoogle Scholar
  9. 9.
    Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. In: Computer vision and pattern recognition (CVPR)Google Scholar
  10. 10.
    Zhang Y, Zhou D, Chen S, Gao S, Ma Y (2016) Single-image crowd counting via multi-column convolutional neural network. In: Computer vision and pattern recognition (CVPR)Google Scholar
  11. 11.
    Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained partbased models. In: PAMIGoogle Scholar
  12. 12.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPSGoogle Scholar
  13. 13.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
  14. 14.
    He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Computer vision and pattern recognition (CVPR)Google Scholar
  15. 15.
    Sindagi VA, Patel VM (2017) Generating high-quality crowd density maps using contextual pyramid CNNs. In: ICCVGoogle Scholar
  16. 16.
    Onoro-Rubio D, Lopez-Sastre RJ (2016) Towards perspective-free object counting with deep learning. In: ECCVGoogle Scholar
  17. 17.
    Sam DB, Surya S, Babu RV (2017) Switching convolutional neural network for crowd counting. In: CVPRGoogle Scholar
  18. 18.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842v1
  19. 19.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPRGoogle Scholar
  20. 20.
    Girshick R (2015) Fast R-CNN. In: ICCVGoogle Scholar
  21. 21.
    Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. In: PAMIGoogle Scholar
  22. 22.
    Zhang H, Ji Y, Huang W, Liu L (2018) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl 2018:1–20Google Scholar
  23. 23.
    Zhang H, Cao X, Ho JKL, Chow TWS (2018) Object-level video advertising: an optimization framework. IEEE Trans Ind Inform 13(2):520–531CrossRefGoogle Scholar
  24. 24.
    Mostajabi M, Yadollahpour P, Shakhnarovich G (2014) Feedforward semantic segmentation with zoom-out features. Arxiv preprint arxiv:1412.0774
  25. 25.
    Dai J, He K, Sun J (2015) Convolutional feature masking for joint object and stuff segmentation. In: CVPRGoogle Scholar
  26. 26.
    Hariharan B, Arbelaez P, Girshick R, Malik J (2014) Simultaneous detection and segmentation. In: ECCVGoogle Scholar
  27. 27.
    Hariharan B, Arbelaez P, Girshick R, Malik J (2015) Hyper-columns for object segmentation and fine-grained localization. In: CVPRGoogle Scholar
  28. 28.
    Jiang F, Grigorev A, Rho S, Tian Z, Fu Y, Jifara W, Adil K, Liu S (2017) Medical image semantic segmentation based on deep learning. Neural Comput Appl 2017(8):1–7Google Scholar
  29. 29.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Computer vision and pattern recognition (CVPR)Google Scholar
  30. 30.
    Chen L-C, Papandreou G, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLRGoogle Scholar
  31. 31.
    Dalal N, Triggs B (2015) Histograms of oriented gradients for human detection. In: CVPRGoogle Scholar
  32. 32.
    Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154CrossRefGoogle Scholar
  33. 33.
    Wu B, Nevatia R (2005) Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors. In: ICCVGoogle Scholar
  34. 34.
    Li M, Zhang Z, Huang K, Tan T (2008) Estimating the number of people in crowd scenes by mid based foreground segmentation and head-shoulder detection. In: Pattern recognitionGoogle Scholar
  35. 35.
    Huang S, Xi Li, Zhang Z, Wu F, Gao S, Ji R, Han J (2017) Body structure aware deep crowd counting. IEEE Trans Image Process 27(3):1049–1059MathSciNetCrossRefGoogle Scholar
  36. 36.
    Ryan D, Denman S, Fookes C, Sridharan S (2009) Crowd counting using multiple local features. Digit Image Comput Tech Appl 63(6):81–88Google Scholar
  37. 37.
    Wang C, Zhang H, Yang L, Liu S, Cao (2015) Deep people counting in extremely dense crowds. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM New York, pp 1299–1302Google Scholar
  38. 38.
    Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: CVPRGoogle Scholar
  39. 39.
    Kong T, Yao A, Chen Y, Sun F (2016) HyperNet: towards accurate region proposal generation and joint object detection. In: CVPRGoogle Scholar
  40. 40.
    Leng J, Liu Y (2018) An enhanced SSD with feature fusion and visual reasoning for object detection. Neural Comput Appl 2018(2):1–10Google Scholar
  41. 41.
    Wang C, Zhang H, Yang L, Liu S, Cao X (2015) Deep people counting in extremely dense crowds. In: ACM International Conference on MultimediaGoogle Scholar
  42. 42.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093
  43. 43.
    Sindagi VA, Patel VM (2017) CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: AVSSGoogle Scholar
  44. 44.
    Liu J, Gao C, Meng D, Hauptmann AG (2018) DecideNet: counting varying density crowds through attention guided detection and density estimation. In: CVPRGoogle Scholar

Copyright information

© The Natural Computing Applications Forum 2018

Authors and Affiliations

  1. 1.Nanjing University of Science and TechnologyNan Jing CityChina
  2. 2.The Third Research Institute of the Ministry of Public SecurityShanghaiChina

Personalised recommendations