Leaderboard for Our CVPR-2017 Workshop Challenge

The challenge is ended. We have received 63 submissions during that time period. Following is the final leader-board, where the parsing challenge is ranked by Mean IoU(%) and the pose challenge is ranked by PCKh. We have omitted results without clear description.

Human Parsing Challenge

Ranking Method Pixel accuracy Mean accuracy Mean IoU Frequency weighted IoU Details Submit Time
3 WhiskNet 86.16 57.95 47.74 76.45 Details 2017-06-03 14:22:05
2 Self-Supervised Neural Aggregation Networks 87.29 63.35 52.26 78.25 Details 2017-06-04 13:23:59
4 BUPTMM-Parsing 84.93 55.62 45.44 74.60 Details 2017-06-04 14:54:06
1 VSNet-SLab+Samsung 87.06 66.73 54.13 77.98 Details 2017-06-04 15:14:38


Method Contributors Description
WhiskNet Haoshu Fang, Yuwing Tai, Cewu Lu It has been demonstrated that multi-scale features are useful to improve the performance of semantic segmentation. However, without careful design of network architectures, deep models such as ResNet-101 cannot fully utilize the atrous convolution structure proposed in [1] to leverage the advantage of multi-scale features. In this work, we propose 'WhiskNet', which utilizes building blocks of ResNet, to extract and incorporate very deep multi-scale features into a single network model. Moreover, 'WhiskNet' adds an extra 'Multi-atrous-convolution' for each scale which achieves excellent performance when merging multi-scale features. [1]Attention to Scale: Scale-aware Semantic Image Segmentation Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu and Alan L Yuille CVPR 2016
Self-Supervised Neural Aggregation Networks ZHAO Jian (NUS & NUDT), NIE Xuecheng (NUS), XIAO Huaxin (NUS & NUDT), CHEN Yunpeng (NUS), LI Jianshu (NUS), YAN Shuicheng (NUS & Qihoo360 AI Institute) (The first 3 authors are with equal contributions.) We present a Self-Supervised Neural Aggregation Network (SS-NAN) for human parsing. SS-NAN adaptively learns to aggregate the multi-scale features at each pixel "address". In order to further improve the feature discriminative capacity, a self-supervised joint loss is adopted as an auxiliary learning strategy, which imposes human joint structures into parsing results without resorting to extra supervision. The proposed SS-NAN is end-to-end trainable. SS-NAN can be integrated into any advanced neural networks to help aggregate features regarding the importance at different positions and scales and incorporate rich high-level knowledge regarding human joint structures from a global perspective, which in turn improve the parsing results. Moreover, to further boost the overall performance of SS-NAN for human parsing, we also leverage a robust multi-view strategy with different state-of-the-art backbone models.
BUPTMM-Parsing Peng Cheng, Xiaodong Liu, Peiye Liu, Wu Liu We revised and finetuned the Attention+SSL [1] and Attention to Scale [2] on LIP training set. Then we combined the two models with different fusion strategies. [1] "Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing", Ke Gong, Xiaodan Liang, Xiaohui Shen, Liang Lin, CVPR 2017. [2] Attention to Scale: Scale-aware Semantic Image Segmentation Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu and Alan L Yuille CVPR 2016
VSNet-SLab+Samsung Lejian Ren[1],Renda Bao[1],Yao Sun[1],Si Liu[1] and Yinglu Liu[2],Yanli Li[2],Junjun Xiong[2] [1]IIE,CAS [2]Beijing Samsung Telecom R&D Center We have proposed a view-specific contextual human parsing method. It has two core contributions. (1) The model has a cascade structure including a view classifier and the corresponding human parsing model w.r.t the specific view. The view classifier predicts whether the human is in frontal or back view. The view groundtruth is automatically generated by analyzing the parsing groundtruth with human knowledge. We observe that the IoUs of left/right legs, left/right shoes are significantly boosted in the validation set. (2) We train a category classifier to estimate the labels of the images[1]. The classification results serve as the context of the parsing and boost the performances. Two human parsing models based on RefineNet[2] and PSPnet [3] are implemented. The best results were obtained by combining them. No extra datasets were used. [1] Human Parsing With Contextualized Convolutional Neural Network, Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang, Liang Lin, Shuicheng Yan. TPAMI, 2016 [2] RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation, Guosheng Lin, Anton Milan, Chunhua Shen, Ian Reid, CVPR 2017 [3] Pyramid Scene Parsing Network, Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia. CVPR 2017

Human Pose Challenge

Ranking Method PCKh Details Submit Time
1 NTHU-Pose 87.400 Details 2017-06-02 03:06:07
2 Pyramid Stream Network (Multi-Model) 82.100 Details 2017-06-03 08:03:30
4 Hybrid Pose Matchine 77.200 Details 2017-06-04 13:38:59
3 BUPTMM-POSE 80.200 Details 2017-06-04 14:53:20


Method Contributors Description
NTHU-Pose Self Adversarial Training for Human Pose Estimation, Chia-Jung Chou, Jui-Ting Chien, and Hwann-Tzong Chen, National Tsing Hua University We adapt Boundary Equilibrium GAN as our learning model in which we set up two stacked hourglass networks, one as the generator and the other as the discriminator. The generator is used as a pose estimator after the training is done. The discriminator distinguishes ground-truth heatmaps from generated ones, and back-propagates the adversarial loss to the generator. This process enables the generator to learn the plausible human body configurations. The entire model is trained from scratch using only the LIP training data.
Pyramid Stream Network (Multi-Model) NIE Xuecheng (NUS), ZHAO Jian (NUS & NUDT), XIAO Huaxin (NUS & NUDT), CHEN Yunpeng (NUS), LI Jianshu (NUS), YAN Shuicheng (NUS & Qihoo360 AI Institute) (The first 3 authors are with equal contributions.) Pyramid Stream Network (PSN) is composed of a stream of pyramid units, which predict body joint confidence maps at different resolutions. The major adavantages of PSN lie in two aspects: (1) Exploiting contextual information to iteratively refine the confidence maps by learning implicit spatial relationships between different body joints; (2) Combining high-resolution, semantically weak features with low-resolution, semantically strong features via a top-down pathway and lateral connections.
Hybrid Pose Matchine Yue Liao*[1], Ruihe Qian*[1], Si Liu[1], Yao Sun[1] and Yinglu Liu[2],Yanli Li[2],Junjun Xiong[2] [1]IIE,CAS [2]Beijing Samsung Telecom R&D Center We have proposed a hybrid method of inferring the pose of humans. We first extract human bounding boxes from the pose ground truth. Then we train a Faster R-CNN [1] based human detector to infer the human box in testing phase. Then Convolutional Pose Machines with Part Affinity Fields [2] and Stacked Hourglass Networks [3] are applied to estimate the poses. And the results are merged. No extra data are used in our method. [1]Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.Shaoqing Ren, Kaiming He,et al.TPAMI2016 [2] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.Cao Z, Simon T, Wei S E, et al.CVPR2017. [3]Stacked Hourglass Networks for Human Pose Estimation. Newell et al., POCV 2016
BUPTMM-POSE Wu Liu, Huadong Ma, Peng Cheng, Cheng Zhang, Haoran Lv, Xiongxiong Dong We revised and finetuned Stacked Hourglass [1] and Convolutional Pose Machines [2] on LIP training set, then combined them with different fusion strategies. [1] Stacked Hourglass Networks for Human Pose Estimation, Alejandro Newell, Kaiyu Yang, and Jia Deng, ECCV 2016; [2] Convolutional pose machines, Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh, CVPR 2016.