Multimedia Tools and Applications

, Volume 78, Issue 2, pp 2401–2425 | Cite as

Screen recording segmentation to scenes for eye-tracking analysis

  • Jakub SimkoEmail author
  • Jakub Vrba


In usability studies involving eye-tracking, quantitative analysis of gaze data requires the information about so called scene occurrences. Scene ocurrences are time segments during which the application user interface remains more-less static, so gaze events (e.g., fixations) can be mapped to the particular areas of interest (user interface elements). The scene occurrences typically start and end by user interface changes such as page-to-page transitions, menu expansions, overlay propmts, etc. Normally, one would record such changes programmatically through application logging, yet in many studies, this is not possible. For example, in an early-prototype mobile-app testing, only a camera recording of a smart device screen is often available as evidence. In such cases, analysts must manually annotate the recordings. To reduce the need for manual annotation of scene occurrences, we present an image processing method for segmenting user interface video recordings. The method exploits specific properties of user interface recordings, which greatly differ from real world video shots (for which many segmentation methods exist). The core of our method lies in the use of SSIM and SIFT similarity metrics used on video frames (with several pre-processing and filtering procedures). The main advantage of our method is, that it requires no training data apart from single screenshot example for each scene (to which the recording frames are compared). The method is also able to work with user finger overlays, which are always present in mobile device recordings. We evaluate the accuracy of our method over recordings from several real-life studies and compare it with other image similarity techniques.


Image processing Video segmentation Scenes User interface User experience studies Eye-tracking 



This work was partially supported by the Scientific Grant Agency of the Slovak Republic, grant No. VG 1/0646/15, the Slovak Research and Development Agency under the contract No. APVV-15-0508 and was created with the support of the Ministry of Education, Science, Research and Sport of the Slovak Republic within the Research and Development Operational Programme for the project ”University Science Park of STU Bratislava”, ITMS 26240220084, co-funded by the ERDF.


  1. 1.
    Banovic N, Grossman T, Matejka J, Fitzmaurice G (2012) Waken: reverse engineering usage information and interface structure from software videos. In: Proceedings of the 25th annual ACM symposium on user interface software and technology, UIST ’12. ACM, New York, pp 83–92,, (to appear in print)
  2. 2.
    Bao L, Li J, Xing Z, Wang X, Zhou B (2015) scvripper: video scraping tool for modeling developers’ behavior using interaction data. In: Proceedings of the 37th international conference on software engineering - volume 2, ICSE ’15. IEEE Press, Piscataway, pp 673–676
  3. 3.
    Chang TH, Yeh T, Miller RC (2010) Gui testing using computer vision. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10. ACM, New York, pp 1535–1544,, (to appear in print)
  4. 4.
    Ciresan DC, Meier U, Masci J, Maria Gambardella L, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. In: IJCAI Proceedings-international joint conference on artificial intelligence, vol 22, Barcelona, p 1237Google Scholar
  5. 5.
    Denoue L, Carter S, Cooper M (2016) Docugram: turning screen recordings into documents. In: Proceedings of the 2016 ACM symposium on document engineering, DocEng ’16. ACM, New York, pp 185–188,, (to appear in print)
  6. 6.
    Dixon M, Fogarty J (2010) Prefab: implementing advanced behaviors using pixel-based reverse engineering of interface structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’10. ACM, New York, pp 1525–1534,, (to appear in print)
  7. 7.
    Dixon M, Laput G, Fogarty J (2014) Pixel-based methods for widget state and style in a runtime implementation of sliding widgets. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’14. ACM, New York, pp 2231–2240,, (to appear in print)
  8. 8.
    Duchowski AT (2007) Eye tracking methodology: theory and practice. Springer-Verlag New York, Inc., SecaucuszbMATHGoogle Scholar
  9. 9.
    Givens P, Chakarov A, Sankaranarayanan S, Yeh T (2013) Exploring the internal state of user interfaces by combining computer vision techniques with grammatical inference. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, Piscataway, pp 1165–1168.
  10. 10.
    Haralick RM, Sternberg SR, Zhuang X (1987) Image analysis using mathematical morphology. IEEE Trans Pattern Anal Mach Intell 4:532–550CrossRefGoogle Scholar
  11. 11.
    Holmqvist K, Nyström M, Andersson R, Dewhurst R, Jarodzka H, van de Weijer J (2011) Eye tracking: a comprehensive guide to methods and measures. OUP Oxford.
  12. 12.
    Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML Deep learning workshop, vol 2Google Scholar
  13. 13.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. MathSciNetCrossRefGoogle Scholar
  14. 14.
    Mendi E, Bayrak C (2010) Shot boundary detection and key frame extraction using salient region detection and structural similarity. In: Proceedings of the 48th annual southeast regional conference, ACM SE ’10. ACM, New York, pp 66:1–66:4,, (to appear in print)
  15. 15.
    Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66CrossRefGoogle Scholar
  16. 16.
    Pongnumkul S, Dontcheva M, Li W, Wang J, Bourdev L, Avidan S, Cohen MF (2011) Pause-and-play: automatically linking screencast video tutorials with applications. In: Proceedings of the 24th annual ACM symposium on user interface software and technology, UIST ’11. ACM, New York, pp 135–144,, (to appear in print)
  17. 17.
    Priya GGL, Domnic S (2010) Video cut detection using dominant color features. In: Proceedings of the first international conference on intelligent interactive technologies and multimedia, IITM ’10. ACM, New York, pp 130–134,, (to appear in print)
  18. 18.
    Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 512–519Google Scholar
  19. 19.
    Tahaghoghi SMM, Williams HE, Thom JA, Volkmer T (2005) Video cut detection using frame windows. In: Proceedings of the twenty-eighth Australasian conference on computer science - volume 38, ACSC ’05. Australian Computer Society, Inc., Darlinghurst, pp 193–199.
  20. 20.
    Tao D, Cheng J, Song M, Lin X (2016) Manifold ranking-based matrix factorization for saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1122–1134MathSciNetCrossRefGoogle Scholar
  21. 21.
    Tonomura Y, Akutsu A, Otsuji K, Sadakata T (1993) Videomap and videospaceicon: tools for anatomizing video content. In: Proceedings of the INTERACT ’93 and CHI ’93 conference on human factors in computing systems, CHI ’93. ACM, New York, pp 131–136,, (to appear in print)
  22. 22.
    Truong BT, Dorai C, Venkatesh S (2000) New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: Proceedings of the eighth ACM international conference on multimedia, MULTIMEDIA ’00. ACM, New York, pp 219–227,, (to appear in print)
  23. 23.
    Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. (2016) Matching networks for one shot learning. In: Advances in neural information processing systems, pp 3630–3638Google Scholar
  24. 24.
    Wang R, Tao D (2016) Non-local auto-encoder with collaborative stabilization for image restoration. IEEE Trans Image Process 25(5):2117–2129. MathSciNetCrossRefGoogle Scholar
  25. 25.
    Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. CrossRefGoogle Scholar
  26. 26.
    Yang X, Liu W, Tao D, Cheng J (2017) Canonical correlation analysis networks for two-view image recognition. Inf Sci 385(C):338–352. CrossRefGoogle Scholar
  27. 27.
    Yeh T, Chang TH, Miller RC (2009) Sikuli: using gui screenshots for search and automation. In: Proceedings of the 22Nd Annual ACM symposium on user interface software and technology, UIST ’09. ACM, New York, pp 183–192,, (to appear in print)

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Slovak University of Technology in BratislavaBratislavaSlovakia

Personalised recommendations