Skip to main content

2003: Designing, Playing, and Performing with a Vision-Based Mouth Interface

  • Chapter
  • First Online:
Book cover A NIME Reader

Part of the book series: Current Research in Systematic Musicology ((CRSM,volume 3))

Abstract

The role of the face and mouth in speech production as well as non-verbal communication suggests the use of facial action to control musical sound. Here we document work on the Mouthesizer, a system which uses a headworn miniature camera and computer vision algorithm to extract shape parameters from the mouth opening and output these as MIDI control changes. We report our experiences with various gesture-to-sound mappings and musical applications, and describe a live performance which used the Mouthesizer interface.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.davidrokeby.com/vns.html.

  2. 2.

    http://steim.org/2012/01/bigeye-1-1-4/.

  3. 3.

    https://www.google.com/atap/project-soli/.

References

  • Behne, K.-E., & Woellner, C. (2011). Seeing or hearing the pianists? A synopsis of an early audiovisual perception experiment and a replication. Musicae Scientiae, 15(3), 324–342.

    Google Scholar 

  • Camurri, A., Hashimoto, S., Ricchetti, M., Ricci, A., Suzuki, K., Trocca, R., et al. (2000b). Eyesweb–towards gesture and affect recognition in dance/music interactive systems. Computer Music Journal, 24(1), 57–69.

    Article  Google Scholar 

  • Card, S. K., Mackinlay, J. D., & Robertson, G. G. (1991). A morphological analysis of the design space of input devices. ACM Transactions on Information Systems, 9(2), 99–122.

    Article  Google Scholar 

  • Chan, C.-H., & Lyons, M. J. (2008). Mouthbrush: A multimodal interface for sketching and painting. International Journal of Computer Science, 1(1), 40–57.

    Google Scholar 

  • Cook, P. R. R., & Scavone, G. (1999). The synthesis toolkit (STK). In Proceedings of the International Computer Music Conference (pp. 164–166).

    Google Scholar 

  • Dobrian, C., & Koppelman, D. (2006). The E in NIME: Musical expression with new computer interfaces. In Proceedings of the International Conference on New Interfaces for Musical Expression, Paris, France.

    Google Scholar 

  • Duda, R. O., Hart, P. E., et al. (1973). Pattern Classification and Scene Analysis (Vol. 3). New York: Wiley.

    Google Scholar 

  • Dudley, H. (1939). Remaking speech. The Journal of the Acoustical Society of America, 11(2), 169–177.

    Article  Google Scholar 

  • Fels, S., & Mase, K. (1999). Iamascope: A graphical musical instrument. Computers & Graphics, 23(2), 277–286.

    Article  Google Scholar 

  • Funk, M., Kuwabara, K., & Lyons, M. J. (2005). Sonification of facial actions for musical expression. Proceedings of the International Conference on New Interfaces for Musical Expression (pp. 127–131). Canada: Vancouver.

    Google Scholar 

  • Gillespie, B. (1999). Haptic manipulation. In P. Cook (Ed.), Music Cognition and Computerized Sound: An Introduction to Psychoacoustics (pp. 247–260). Cambridge, MA: MIT Press.

    Google Scholar 

  • Hunt, A., Wanderley, M. M., & Paradis, M. (2003). The importance of parameter mapping in electronic instrument design. Journal of New Music Research, 32(4), 429–440.

    Article  Google Scholar 

  • Lyons, M. J. (2004). Facial gesture interfaces for expression and communication. In IEEE International Conference on Systems, Man and Cybernetics (Vol. 1, pp. 598–603). IEEE.

    Google Scholar 

  • Lyons, M. J., & Tetsutani, N. (2001). Facing the music: a facial action controlled musical interface. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (pp. 309–310). ACM.

    Google Scholar 

  • Lyons, M. J., Campbell, R., Plante, A., Coleman, M., Kamachi, M., & Akamatsu, S. (2000). The Noh mask effect: Vertical viewpoint dependence of facial expression perception. In Proceedings of the Royal Society of London B: Biological Sciences (Vol. 267, pp. 2239–2245). The Royal Society.

    Google Scholar 

  • Lyons, M. J., Haehnel, M., & N. T., (2003). Designing, playing, and performing with a vision-based mouth interface. Proceedings of the International Conference on New Interfaces for Musical Expression (pp. 116–121). Canada: Montreal.

    Google Scholar 

  • Lyons, M. J., Budynek, J., & Akamatsu, S. (1999). Automatic classification of single facial images. IEEE Transactions on Pattern Analysis & Machine Intelligence, 21(12), 1357–1362.

    Article  Google Scholar 

  • McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746–748.

    Google Scholar 

  • Morikawa, C., & Lyons., M. J. (2013). Design and evaluation of vision-based head and face tracking interfaces for assistive input. In Assistive Technologies and Computer Access for Motor Disabilities (pp. 180–205). IGI Global.

    Google Scholar 

  • Ng, K. (2002). Interactive gesture music performance interface. In Proceedings of the International Conference on New Interfaces for Musical Expression, Dublin, Ireland.

    Google Scholar 

  • Orio, N. (1997). A gesture interface controlled by the oral cavity. Proceedings of the International Computer Music Conference (pp. 141–144). Greece: Thessaloniki.

    Google Scholar 

  • Pantic, M., & Rothkrantz, L. J. (2000). Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1424–1445.

    Article  Google Scholar 

  • Paradiso, J., & Sparacino, F. (1997). Optical tracking for music and dance performance. Optical 3-D Measurement Techniques IV (pp. 11–18).

    Google Scholar 

  • Penfield, W., & Rasmussen, T. (1950). The cerebral cortex of man: A clinical study of localization of function. New York: Macmillan.

    Google Scholar 

  • Poepel, C., Feitsch, J., Strobel, M., & Geiger, C. (2014). Design and evaluation of a gesture controlled singing voice installation. In Proceedings of the International Conference on New Interfaces for Musical Expression (pp. 359–362), London, UK.

    Google Scholar 

  • Poupyrev, I., Berry, R., Kurumisawa, J., Nakao, K., Billinghurst, M., Airola, C., et al. (2000). Augmented groove: Collaborative jamming in augmented reality. In ACM SIGGRAPH Conference Abstracts and Applications (p. 77).

    Google Scholar 

  • Rizzolatti, G., Fadiga, L., Gallese, V., & Fogassi, L. (1996). Premotor cortex and the recognition of motor actions. Cognitive Brain Research, 3(2), 131–141.

    Article  Google Scholar 

  • Silva, D., Gamhewage, C., Smyth, T., & Lyons, M. J. (2004). A novel face-tracking mouth controller and its application to interacting with bioacoustic models. Proceedings of the International Conference on New Interfaces for Musical Expression (pp. 169–172). Japan: Hamamatsu.

    Google Scholar 

  • Tarabella, L. (2000). Gestural and visual approaches to performance (pp. 604–615). IRCAM, Paris: Trends in Gestural Control of Music CD-ROM.

    Google Scholar 

  • Varela, F. J., Thompson, E., & Rosch, E. (1992). The embodied mind. CogNet.

    Google Scholar 

  • Vogt, F., McCaig, G., Ali, M. A., & Fels, S. (2002). Tongue’n’groove: An ultrasound based music controller. Proceedings of the International Conference on New Interfaces for Musical Expression (pp. 60–64). Ireland: Dublin.

    Google Scholar 

Download references

Acknowledgements

Thank you to Palle Dahlstedt, Sidney Fels, Steven Jones, Axel Mulder, Ivan Poupyrev, Ichiro Umata, Jordan Wynnychuk, and Tomoko Yonezawa for stimulating and helpful interactions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. Lyons .

Editor information

Editors and Affiliations

Appendices

Author Commentary: Tales of the Mouthesizer—Facial Movements and Musical Expression

Michael J. Lyons

While working on automatic face recognition in the mid-nineties, I became interested in exploring facial movements in the context of real-time human computer interaction. This was partly in reaction to the dominant tendency of face recognition researchers to posit artificial agency as a primary research goal, implicitly defining the ideal ‘user’ as a passive object surveilled by a machine. I was also influenced by the then emerging activity in embodied human computer interfaces which challenged the keyboard-mouse interaction style. A proposal to explore facial gesture interfaces was funded by the Annenberg Center at the University of Southern California. We recorded actors’ facial movements, detected and tracked facial features, and coded and analyzed local texture displacements using a biologically-inspired multi-scale Gabor filter representation. An unexpected and exciting discovery on facial expression representation led me to take a multi-year detour away from intentional gesture interfaces.

By mid-2000, I was again acutely feeling the limitations of the artificial agency paradigm tacit also in most facial expression research, I resumed work on the facial gesture interface project. Hoping to explore seamless, real-time, and actively expressive interaction I combined the project with my long-term interest in electronic music and began to develop a musical interface based on facial movements. I gradually reduced the complexity of our facial expression system, finally obtaining acceptable latency with a version which required the user to visually register their face with a virtual frame. The system was further simplified by analyzing just the shadow area of the open mouth, a fairly robust feature directly influenced by mouth movements. Adding a head-worn camera eliminated the need for active face registration. The Mouthesizer was born and, soon afterwards, was demonstrated as a guitar and keyboard effects controller at the annual ATR Labs Open House.

A collaboration with artist Jordan Wynnychuk led to more complex mappings and the first public performance with the Mouthesizer. The experience encouraged us to continue the project. Jordan developed a new hardware prototype, the aluminum half-mask used in the NIME 2004 club performance that Cornelius Poepel mentions in his commentary. A hand-held version was developed for a musical video game. Concurrently, I explored facial gesture interfaces in other contexts: text-entry, augmented digital painting, and a hands-free keyboard and cursor control system (Chan and Lyons 2008; Lyons 2004; Morikawa and Lyons 2013). The studies were not only exploratory, but involved measurements of controllability using custom evaluation tasks. To our surprise, single parameter control (mouth area) was observed possible with a signal-noise ratio exceeding 60 dB (Chan and Lyons 2008; Morikawa and Lyons 2013)!

We returned to expressive audio-visual play with a study of a mouth-activated bio-acoustic simulation (Silva et al. 2004), that allowed a player to use mouth movements to sing like a bird—specifically a crow. Visiting research student Mathias Funk joined the facial gesture interface project and we collaborated on a biologically inspired system that combined a face detector, used to periodically saccade to the face, with optical flow estimation, allowing one to play a sampler by moving various facial features (Funk et al. 2005). The work was demonstrated at NIME 2005, used in live dance performances, and for special effects in a technology-augmented theatre production.

As sensor technologies and machine learning advance, we can expect to see powerful new approaches to gauge facial movements non-invasively. For example, the recent radar sensors from Project SoliFootnote 3 may be suitable for use as a facial gesture interface. Close-range depth imaging also seems promising. Machine learning should be useful for leveraging existing expertise involved in facial expression and speech production, and should lead to intriguing new approaches to expressive performance, just as the Mouthesizer allowed us to map mouth movements to acoustic and emotionally expressive effects. The complex facial sensory-motor anatomy offers a still largely unexplored territory for scientific and artistic experiments in embodied musical expression.

Expert Commentary: Musical Control and Musical Expression

Cornelius Poepel

As a musician I am focusing on the question of musical expression. I have long been interested to expand artistic expressivity through the use of technology and computation. During my visits to several NIME conferences the concerts were of high importance for me, especially those performances where I could hear and see the instruments in action that had been described in papers.

I remember listening to a NIME club night concert in 2004 at Hamamatsu. A performer was playing the Mouthesizer (Lyons and M. Haehnel 2003). A camera tracking the mouth was mounted inside an aluminum case which the performer had fixed in front of his mouth. A video screen displayed one window showing the video tracked mouth, another window showing an Ableton live set, and a third window showing artificial visual objects. In my subjective perception the whole setup resembled a mixture of Star Wars’ Darth Vader, C-3PO, and Luke Skywalker. I loved to watch this performance following the varying shapes of the mouth while listening to linked variations in the resulting sound.

The face and the facial gestures undeniably play a central role in communication. Music psychologists have shown that visual factors can play an important role in the perceived quality of musical expression (Behne and Woellner 2011). One may say that the perceived musical expression is created not only by the acoustic outcome, it is created by a mixture of acoustic as well as other (e.g. visual) stimuli.

Thus, the idea of using the expressivity we know from facial gestures has a powerful potential. The question is how this expressivity can be mapped to music, to the input parameters of a synthesis or an effects unit.

Lyons et al. (2003) have blazed a trail with their work and a decade later papers still use this work for reference. The broad analysis of related research presented in their paper, the detailed explanation of the implementation of their idea, the incorporation of experiences coming from users and a live performance, laid a base that was and is valuable many years later, even now.

From my point of view, I would say that the way in which to use the expressive power of facial gestures for musical expression is still an open question. I was involved in the development of a singing voice synthesis installation that incorporated a mouth tracking system to control a voice synthesizer (Poepel et al. 2014). Since the facial gestures drive a voice synthesizer, the resulting sounds corresponded to human expectations.

In comparison, the Mouthesizer as I saw it at the NIME 2004 club concert did not allow the display of the complete facial expression. Nose, mouth, cheek and chin were covered by the aluminum case mounted on the performer’s face. Thus the visual part of the facial expressivity was reduced to the part one could see on the video screen.

Dobrian and Koppelman have presented their thoughts and findings on the ‘E ’ in NIME (Dobrian and Koppelman 2006). They are convinced that control should not be equated with expression. They distinguish between control of a tool or medium and the expression a performer puts into the control mechanism of a sound generator. How can this conviction be of use in constructing a vision-based mouth-interface?

What kinds of expression come to my mind when I imagine faces? I see expressions like sadness, happiness, anger, fear, effort etc. In the singing voice installation (Poepel et al. 2014) it was possible to play with facial gestures addressing those issues since the mouth tracking was used to generate the vowels. Pitch e.g. was controlled by the arms. Thus the facial gesture had a degree of freedom for visual expression that was not directly coupled to the sound.

Thanks to Lyons et al. the exciting research-field of facial gestures and musical expression has been opened. The map has several blank regions for further exploration. Consider the following two possibilities. The first one is to explore how the already known forms of facial expression can be coupled to a music generator producing music corresponding with emotions and communicative contents. The second one is the question how many of the facial movements should not control the sound and should be left to freely communicate via facial expression with the audience in order to enhance the audio-visually perceived musical expression (cf. Behne and Woellner 2011). This incorporates the question which parts of the face should be kept visible when a performance with a vision-based mouth interface is done.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Lyons, M.J., Hähnel, M., Tetsutani, N. (2017). 2003: Designing, Playing, and Performing with a Vision-Based Mouth Interface. In: Jensenius, A., Lyons, M. (eds) A NIME Reader. Current Research in Systematic Musicology, vol 3. Springer, Cham. https://doi.org/10.1007/978-3-319-47214-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47214-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47213-3

  • Online ISBN: 978-3-319-47214-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics