Human Versus Machine: Establishing a Human Baseline for Multimodal Location Estimation

Choi, Jaeyoung; Lei, Howard; Ekambaram, Venkatesan; Kelm, Pascal; Gottlieb, Luke; Sikora, Thomas; Ramchandran, Kannan; Friedland, Gerald

doi:10.1007/978-3-319-09861-6_9

Jaeyoung Choi³,
Howard Lei^3,4,
Venkatesan Ekambaram⁵,
Pascal Kelm⁶,
Luke Gottlieb³,
Thomas Sikora⁶,
Kannan Ramchandran⁵ &
…
Gerald Friedland³

923 Accesses

Abstract

In recent years, the problem of video location estimation (i.e., estimating the longitude/latitude coordinates of a video without GPS information) has been approached with diverse methods and ideas in the research community and significant improvements have been made. So far, however, systems have only been compared against each other and no systematic study on human performance has been conducted. Based on a human-subject study with 11,900 experiments, this article presents a human baseline for location estimation for different combinations of modalities (audio, audio/video, audio/video/text). Furthermore, this article compares state-of-the-art location estimation systems with the human baseline. Although the overall performance of humans’ multimodal video location estimation is better than current machine learning approaches, the difference is quite small: For 41 % of the test set, the machine’s accuracy was superior to the humans. We present case studies and discuss why machines did better for some videos and not for others. Our analysis suggests new directions and priorities for future work on the improvement of location inference algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Camera Pose Estimation and Localization with Active Audio Sensing

Performance Measures and a Data Set for Multi-target, Multi-camera Tracking

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Notes

1.
http://multimediaeval.org/.
2.
http://www.flickr.com.

References

S. Chatzichristofis, Y. Boutalis, CEDD: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval, Computer Vision Systems (Springer, Berlin, 2008), pp. 312–322
Google Scholar
S. Chatzichristofis, Y. Boutalis, Fcth: Fuzzy color and texture histogram-a low level feature for accurate image retrieval, in Ninth International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS’08, pp. 191–196. IEEE (2008)
Google Scholar
J. Choi, G. Friedland, V. Ekambaram, K. Ramchandran, Multimodal location estimation of consumer media: dealing with sparse training data, in 2012 IEEE International Conference on Multimedia and Expo (ICME). pp. 43–48, IEEE (2012)
Google Scholar
J. Choi, H. Lei, V. Ekambaram, P. Kelm, L. Gottlieb, T. Sikora, K. Ramchandran, G. Friedland, Human vs machine: Establishing a human baseline for multimodal location estimation, in Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, pp. 867–876. ACM, New York, USA (2013)
Google Scholar
L. Gottlieb, J. Choi, G. Friedland, P. Kelm, T. Sikora. Pushing the limits of Mechanical Turk: qualifying the crowd for video geo-location, in Proceedings of the 2012 ACM Workshop on Crowdsourcing for Multimedia (CrowdMM) (2012)
Google Scholar
A. Hatch, S. Kajarekar, A. Stolcke, Within-class covariance normalization for SVM-based speaker recognition, in Proceedings of ISCA Interspeech, vol. 4 (2006)
Google Scholar
J. Hays, A. Efros, IM2GPS: estimating geographic information from a single image, in IEEE CVPR 2008, pp. 1–8 (2008)
Google Scholar
S. Ioffe, Probabilistic linear discriminant analysis, Computer Vision-ECCV (Springer, Berlin, 2006), pp. 531–542
Google Scholar
P.G. Ipeirotis, Analyzing the Amazon Mechanical Turk marketplace. XRDS 17(2), 16–21 (2010)
Google Scholar
D. Karger, S. Oh, D. Shah, Budget-optimal crowdsourcing using low-rank matrix approximations, in 49th Annual Allerton Conference Communication, Control, and Computing (Allerton) 2011, pp. 284–291, September 2011
Google Scholar
P. Kelm, S. Schmiedeke, T. Sikora, A hierarchical, multi-modal approach for placing videos on the map using millions of Flickr photographs, in Proceedings of SBNMA ’11, pp. 15–20. ACM, New York, USA (2011)
Google Scholar
A. Kittur, E. H. Chi, B. Suh, Crowdsourcing user studies with Mechanical Turk, in Proceedings of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, CHI ’08, pp. 453–456. ACM, New York, USA (2008)
Google Scholar
M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, C. Wartena, V. Murdock, G. Friedland, R. Ordelman, G. J. Jones, Automatic tagging and geo-tagging in video collections and communities, in ACM International Conference on Multimedia Retrieval (ICMR 2011), pp. 51:1-51:8, April 2011
Google Scholar
H. Lei, J. Choi, G. Friedland, City-Identification on Flickr Videos Using Acoustic Features. Technical report, ICSI Technical Report TR-11-001, 2011
Google Scholar
D.M. Mount, S. Arya, ANN: A library for approximate nearest neighbor searching, in CGC 2nd Annual Fall Workshop on Computational Geometry, pp. 153 (1997)
Google Scholar
A. Oliva, A. Torralba, Building the gist of a scene: the role of global image features in recognition. Prog. Brain Res. 155, 23–36 (2006)
Article Google Scholar
M.C. Palmer, Calculation of distance traveled by fishing vessels using GPS positional data: a theoretical evaluation of the sources of error. Fish. Res. 89(1), 57–64 (2008)
Google Scholar
B. Russell, A. Torralba, K. Murphy, W. Freeman, LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vision 77, 157–173 (2008). doi:10.1007/s11263-007-0090-8
M. Soufifar, M. Kockmann, L. Burget, O. Plchot, O. Glembek, T. Svendsen, iVector approach to phonotactic language recognition, in Proceedings of Interspeech, pp. 2913–2916 (2011)
Google Scholar
H. Tamura, S. Mori, T. Yamawaki, Textural features corresponding to visual perception. IEEE Trans. Syst. Man Cybern. 8(6), 460–473 (1978)
Article Google Scholar
M. Wainwright, M. Jordan, Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

International Computer Science Institute, Berkeley, CA, USA
Jaeyoung Choi, Howard Lei, Luke Gottlieb & Gerald Friedland
California State University, East Bay, CA, USA
Howard Lei
University of California at Berkeley, Berkeley, CA, USA
Venkatesan Ekambaram & Kannan Ramchandran
Technische Universität, Berlin, Germany
Pascal Kelm & Thomas Sikora

Authors

Jaeyoung Choi
View author publications
You can also search for this author in PubMed Google Scholar
Howard Lei
View author publications
You can also search for this author in PubMed Google Scholar
Venkatesan Ekambaram
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Kelm
View author publications
You can also search for this author in PubMed Google Scholar
Luke Gottlieb
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Sikora
View author publications
You can also search for this author in PubMed Google Scholar
Kannan Ramchandran
View author publications
You can also search for this author in PubMed Google Scholar
Gerald Friedland
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaeyoung Choi .

Editor information

Editors and Affiliations

International Computer Science Institute, Berkeley, California, USA
Jaeyoung Choi
International Computer Science Institute, Berkeley, California, USA
Gerald Friedland

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Choi, J. et al. (2015). Human Versus Machine: Establishing a Human Baseline for Multimodal Location Estimation. In: Choi, J., Friedland, G. (eds) Multimodal Location Estimation of Videos and Images. Springer, Cham. https://doi.org/10.1007/978-3-319-09861-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-09861-6_9
Published: 05 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09860-9
Online ISBN: 978-3-319-09861-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Human Versus Machine: Establishing a Human Baseline for Multimodal Location Estimation

Abstract

Access this chapter

Similar content being viewed by others

Camera Pose Estimation and Localization with Active Audio Sensing

Performance Measures and a Data Set for Multi-target, Multi-camera Tracking

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Human Versus Machine: Establishing a Human Baseline for Multimodal Location Estimation

Abstract

Access this chapter

Similar content being viewed by others

Camera Pose Estimation and Localization with Active Audio Sensing

Performance Measures and a Data Set for Multi-target, Multi-camera Tracking

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation