Skip to main content

Cross-Modal Recipe Retrieval: How to Cook this Dish?

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10132))

Included in the following conference series:

Abstract

In social media users like to share food pictures. One intelligent feature, potentially attractive to amateur chefs, is the recommendation of recipe along with food. Having this feature, unfortunately, is still technically challenging. First, the current technology in food recognition can only scale up to few hundreds of categories, which are yet to be practical for recognizing ten of thousands of food categories. Second, even one food category can have variants of recipes that differ in ingredient composition. Finding the best-match recipe requires knowledge of ingredients, which is a fine-grained recognition problem. In this paper, we consider the problem from the viewpoint of cross-modality analysis. Given a large number of image and recipe pairs acquired from the Internet, a joint space is learnt to locally capture the ingredient correspondence from images and recipes. As learning happens at the region level for image and ingredient level for recipe, the model has ability to generalize recognition to unseen food categories. Furthermore, the embedded multi-modal ingredient feature sheds light on the retrieval of best-match recipes. On an in-house dataset, our model can double the retrieval performance of DeViSE, a popular cross-modality model but not considering region information during learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.xiachufang.com.

References

  1. Meyers, A., Johnston, N., Rathod, V., Korattikara, A., Gorban, A., Silberman, N., Guadarrama, S., Papandreou, G., Huang, J., Murphy, K.P.: Im2calories: towards an automated mobile vision food diary. In: ICCV, pp. 1233–1241 (2015)

    Google Scholar 

  2. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101-mining discriminative components with random forests. In: ECCV, pp. 446–461 (2014)

    Google Scholar 

  3. Matsuda, Y., Hoashi, H., Yanai, K.: Recognition of multiple-food images by detecting candidate regions. In: ICME (2012)

    Google Scholar 

  4. Beijbom, O., Joshi, N., Morris, D., Saponas, S., Khullar, S.: Menu-match: restaurant-specific food logging from images. In: WACV, pp. 844–851 (2015)

    Google Scholar 

  5. Kawano, Y., Yanai, K.: Foodcam-256: a large-scale real-time mobile food recognitionsystem employing high-dimensional features and compression of classifier weights. In: ACM MM, pp. 761–762 (2014)

    Google Scholar 

  6. Chen, J., Ngo, C.-W.: Deep-based ingredient recognition for cooking recipe retrieval. In: ACM MM (2016)

    Google Scholar 

  7. Kitamura, K., Yamasaki, T., Aizawa, K.: Food log by analyzing food images. In: ACM MM, pp. 999–1000 (2008)

    Google Scholar 

  8. Aizawa, K., Ogawa, M.: Foodlog: multimedia tool for healthcare applications. IEEE Multimedia 22(2), 4–8 (2015)

    Article  Google Scholar 

  9. Zhang, W., Qian, Y., Siddiquie, B., Divakaran, A., Sawhney, H.: Snap-n-eat: food recognition and nutrition estimation on a smartphone. J. Diab. Sci. Technol. 9(3), 525–533 (2015)

    Article  Google Scholar 

  10. Ruihan, X., Herranz, L., Jiang, S., Wang, S., Song, X., Jain, R.: Geolocalized modeling for dish recognition. TMM 17(8), 1187–1199 (2015)

    Google Scholar 

  11. Probst, Y., Nguyen, D.T., Rollo, M., Li, W.: mhealth diet and nutrition guidance. mHealth (2015)

    Google Scholar 

  12. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274 (2015)

  13. Xie, H., Yu, L., Li, Q.: A hybrid semantic item model for recipe search by example. In: IEEE International Symposium on Multimedia (ISM), pp. 254–259 (2010)

    Google Scholar 

  14. Wang, X., Kumar, D., Thome, N., Cord, M., Precioso, F.: Recipe recognition with large multimodal food dataset. In: ICMEW, pp. 1–6 (2015)

    Google Scholar 

  15. Su, H., Lin, T.-W., Li, C.-T., Shan, M.-K., Chang, J.: Automatic recipe cuisine classification by ingredients. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive, Ubiquitous Computing, pp. 565–570. Adjunct Publication (2014)

    Google Scholar 

  16. Matsunaga, H., Doman, K., Hirayama, T., Ide, I., Deguchi, D., Murase, H.: Tastes and textures estimation of foods based on the analysis of its ingredients list and image. In: Murino, V., Puppo, E., Sona, D., Cristani, M., Sansone, C. (eds.) ICIAP 2015. LNCS, vol. 9281, pp. 326–333. Springer, Heidelberg (2015). doi:10.1007/978-3-319-23222-5_40

    Chapter  Google Scholar 

  17. Maruyama, T., Kawano, Y., Yanai, K.: Real-time mobile recipe recommendation system using food ingredient recognition. In: Proceedings of the ACM International Workshop on Interactive Multimedia on Mobile and Portable Devices, pp. 27–34 (2012)

    Google Scholar 

  18. Yamakata, Y., Imahori, S., Maeta, H., Mori, S.: A method for extracting major workflow composed of ingredients, tools and actions from cooking procedural text. In: 8th Workshop on Multimediafor Cooking and Eating Activities (2016)

    Google Scholar 

  19. Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM MM, pp. 251–260 (2010)

    Google Scholar 

  20. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T. et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)

    Google Scholar 

  21. Karpathy, A., Joulin, A., Li, F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS, pp. 1889–1897 (2014)

    Google Scholar 

  22. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)

    Article  MATH  Google Scholar 

  23. Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 34–51. Springer, Heidelberg (2006). doi:10.1007/11752790_2

    Chapter  Google Scholar 

  24. Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2), 210–233 (2014)

    Article  Google Scholar 

  25. Andrew, G., Arora, R., Bilmes, J.A., Livescu, K.: Deep canonical correlation analysis. In: ICML, pp. 1247–1255 (2013)

    Google Scholar 

  26. Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: CVPR, pp. 3441–3450 (2015)

    Google Scholar 

  27. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)

    Google Scholar 

  28. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  29. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)

    Google Scholar 

  30. Mikolov, T., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (No. 61272290), and the National Hi-Tech Research and Development Program (863 Program) of China under Grant 2014AA015102.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chong-Wah Ngo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chen, J., Pang, L., Ngo, CW. (2017). Cross-Modal Recipe Retrieval: How to Cook this Dish?. In: Amsaleg, L., Guðmundsson, G., Gurrin, C., Jónsson, B., Satoh, S. (eds) MultiMedia Modeling. MMM 2017. Lecture Notes in Computer Science(), vol 10132. Springer, Cham. https://doi.org/10.1007/978-3-319-51811-4_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-51811-4_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-51810-7

  • Online ISBN: 978-3-319-51811-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics