Cross-Modal Recipe Retrieval: How to Cook this Dish?

Chen, Jingjing; Pang, Lei; Ngo, Chong-Wah

doi:10.1007/978-3-319-51811-4_48

Jingjing Chen¹⁸,
Lei Pang¹⁸ &
Chong-Wah Ngo¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10132))

Included in the following conference series:

International Conference on Multimedia Modeling

3577 Accesses
35 Citations

Abstract

In social media users like to share food pictures. One intelligent feature, potentially attractive to amateur chefs, is the recommendation of recipe along with food. Having this feature, unfortunately, is still technically challenging. First, the current technology in food recognition can only scale up to few hundreds of categories, which are yet to be practical for recognizing ten of thousands of food categories. Second, even one food category can have variants of recipes that differ in ingredient composition. Finding the best-match recipe requires knowledge of ingredients, which is a fine-grained recognition problem. In this paper, we consider the problem from the viewpoint of cross-modality analysis. Given a large number of image and recipe pairs acquired from the Internet, a joint space is learnt to locally capture the ingredient correspondence from images and recipes. As learning happens at the region level for image and ingredient level for recipe, the model has ability to generalize recognition to unseen food categories. Furthermore, the embedded multi-modal ingredient feature sheds light on the retrieval of best-match recipes. On an in-house dataset, our model can double the retrieval performance of DeViSE, a popular cross-modality model but not considering region information during learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.xiachufang.com.

References

Meyers, A., Johnston, N., Rathod, V., Korattikara, A., Gorban, A., Silberman, N., Guadarrama, S., Papandreou, G., Huang, J., Murphy, K.P.: Im2calories: towards an automated mobile vision food diary. In: ICCV, pp. 1233–1241 (2015)
Google Scholar
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101-mining discriminative components with random forests. In: ECCV, pp. 446–461 (2014)
Google Scholar
Matsuda, Y., Hoashi, H., Yanai, K.: Recognition of multiple-food images by detecting candidate regions. In: ICME (2012)
Google Scholar
Beijbom, O., Joshi, N., Morris, D., Saponas, S., Khullar, S.: Menu-match: restaurant-specific food logging from images. In: WACV, pp. 844–851 (2015)
Google Scholar
Kawano, Y., Yanai, K.: Foodcam-256: a large-scale real-time mobile food recognitionsystem employing high-dimensional features and compression of classifier weights. In: ACM MM, pp. 761–762 (2014)
Google Scholar
Chen, J., Ngo, C.-W.: Deep-based ingredient recognition for cooking recipe retrieval. In: ACM MM (2016)
Google Scholar
Kitamura, K., Yamasaki, T., Aizawa, K.: Food log by analyzing food images. In: ACM MM, pp. 999–1000 (2008)
Google Scholar
Aizawa, K., Ogawa, M.: Foodlog: multimedia tool for healthcare applications. IEEE Multimedia 22(2), 4–8 (2015)
Article Google Scholar
Zhang, W., Qian, Y., Siddiquie, B., Divakaran, A., Sawhney, H.: Snap-n-eat: food recognition and nutrition estimation on a smartphone. J. Diab. Sci. Technol. 9(3), 525–533 (2015)
Article Google Scholar
Ruihan, X., Herranz, L., Jiang, S., Wang, S., Song, X., Jain, R.: Geolocalized modeling for dish recognition. TMM 17(8), 1187–1199 (2015)
Google Scholar
Probst, Y., Nguyen, D.T., Rollo, M., Li, W.: mhealth diet and nutrition guidance. mHealth (2015)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274 (2015)
Xie, H., Yu, L., Li, Q.: A hybrid semantic item model for recipe search by example. In: IEEE International Symposium on Multimedia (ISM), pp. 254–259 (2010)
Google Scholar
Wang, X., Kumar, D., Thome, N., Cord, M., Precioso, F.: Recipe recognition with large multimodal food dataset. In: ICMEW, pp. 1–6 (2015)
Google Scholar
Su, H., Lin, T.-W., Li, C.-T., Shan, M.-K., Chang, J.: Automatic recipe cuisine classification by ingredients. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive, Ubiquitous Computing, pp. 565–570. Adjunct Publication (2014)
Google Scholar
Matsunaga, H., Doman, K., Hirayama, T., Ide, I., Deguchi, D., Murase, H.: Tastes and textures estimation of foods based on the analysis of its ingredients list and image. In: Murino, V., Puppo, E., Sona, D., Cristani, M., Sansone, C. (eds.) ICIAP 2015. LNCS, vol. 9281, pp. 326–333. Springer, Heidelberg (2015). doi:10.1007/978-3-319-23222-5_40
Chapter Google Scholar
Maruyama, T., Kawano, Y., Yanai, K.: Real-time mobile recipe recommendation system using food ingredient recognition. In: Proceedings of the ACM International Workshop on Interactive Multimedia on Mobile and Portable Devices, pp. 27–34 (2012)
Google Scholar
Yamakata, Y., Imahori, S., Maeta, H., Mori, S.: A method for extracting major workflow composed of ingredients, tools and actions from cooking procedural text. In: 8th Workshop on Multimediafor Cooking and Eating Activities (2016)
Google Scholar
Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM MM, pp. 251–260 (2010)
Google Scholar
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T. et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
Google Scholar
Karpathy, A., Joulin, A., Li, F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS, pp. 1889–1897 (2014)
Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Article MATH Google Scholar
Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 34–51. Springer, Heidelberg (2006). doi:10.1007/11752790_2
Chapter Google Scholar
Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2), 210–233 (2014)
Article Google Scholar
Andrew, G., Arora, R., Bilmes, J.A., Livescu, K.: Deep canonical correlation analysis. In: ICML, pp. 1247–1255 (2013)
Google Scholar
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: CVPR, pp. 3441–3450 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)
Google Scholar
Mikolov, T., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (No. 61272290), and the National Hi-Tech Research and Development Program (863 Program) of China under Grant 2014AA015102.

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
Jingjing Chen, Lei Pang & Chong-Wah Ngo

Authors

Jingjing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Pang
View author publications
You can also search for this author in PubMed Google Scholar
Chong-Wah Ngo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chong-Wah Ngo .

Editor information

Editors and Affiliations

CNRS–IRISA, Rennes, France
Laurent Amsaleg
Reykjavík University, Reykjavik, Iceland
Gylfi Þór Guðmundsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
Reykjavik University, Reykjavik, Ireland
Björn Þór Jónsson
National Institute of Informatics, Tokyo, Japan
Shin’ichi Satoh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, J., Pang, L., Ngo, CW. (2017). Cross-Modal Recipe Retrieval: How to Cook this Dish?. In: Amsaleg, L., Guðmundsson, G., Gurrin, C., Jónsson, B., Satoh, S. (eds) MultiMedia Modeling. MMM 2017. Lecture Notes in Computer Science(), vol 10132. Springer, Cham. https://doi.org/10.1007/978-3-319-51811-4_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-51811-4_48
Published: 31 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51810-7
Online ISBN: 978-3-319-51811-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics