1 Introduction

The usage of touch-sensitive interfaces has rapidly increased over the last 10 years, partially due to many successful applications for smartphones and tablets. Another reason is the enhanced interaction capabilities of touchscreens in comparison with the mouse. For example, their multi-touch capability allows the device to recognise more than one point of contact. Gesture-based communication can be realized easily using touchscreens. Additional interface elements, such as buttons, knobs, sliders, can be individually arranged depending on the application. These aspects make devices with touch-sensitive surfaces very interesting for music-based applications. Virtual musical instruments as well as audio mixing and music composition applications benefit strongly from this trend. There are various apps which try to simulate existing musical instruments or to create new music experiences (Fig. 12.1).

Fig. 12.1
figure 1

Digital touch instrument apps: a piano, b drum and c liveloops from the GarageBand (http://www.apple.com/ios/garageband/, last accessed on 25 Nov 2017) DAW, d sound objects (https://itunes.apple.com/us/app/sound-objects/id656640735?mt=8, last accessed on 25 Nov 2017)

Wanderley and Battier [1] described the importance of gestures and their recognition for music performance. Choi categorized gestural primitives as trajectory-based primitives, force-based primitives and pattern-based primitives. Several of these primitives can be recognized using touch-sensitive interfaces [2].

Several table-based interfaces for musical applications have been developed recently: the Reactable (RotorFootnote 1), AkustichFootnote 2, Bricktable, Surface Music, Sound StormFootnote 3 or ToCoPlay [3,4,5,6]. Most of these devices use a tangible interface where the player controls the system by means of real objects. Musical applications running on touchscreen devices such as smartphones and tablets followed this trend. However, not only gesture recognition but also haptic feedback plays an important role in the success of such kind of applications. The missing haptic feedback in touchscreen-based devices strongly limits the capabilities of the system. The design of musical applications calls for the addition of advanced haptic feedback [7, 8]. For audio mixing, music composition applications and musical performances, touchscreen systems with haptic feedback are very promising.

Several technical solutions have been developed for haptic feedback integration in touchscreen devices. Various types of low-cost and compact actuators are currently used in consumer electronics, having different characteristics [9]. In recent years, electrostatic and ultrasonic technologies have been researched for use in haptic interfaces. On touchscreens using electrostatic technology, finger movements over the touch surface induce an electric force field due to electrostatic friction [10, 11]. Various systems exist based on ultrasonic technology such as mid-air (no direct contact with the surface) [12, 13] or touch interfaces [14,15,16]. The latter employ ultrasonic vibrations to create a squeeze film of air between the vibrating surface and the fingertip, thus modulating the surface’s friction. Focused ultrasound is capable of inducing tactile, thermal and tickling sensations [17, 18]. Both electrostatic and ultrasonic technologies do not use any moving parts.

Over the last few years, the authors have conducted several investigations with touchscreen-based devices to understand and improve the capabilities of such kind of systems for musical applications [19,20,21,22,23,24]. In this chapter, various aspects of these investigations are summarized, extended and discussed. Particularly, musical interactions with touchscreens require to consider both auditory and haptic perception. In most cases, the haptic feedback is generated by means of the audio signal; therefore, the interaction of both is an important issue. This chapter aims to illustrate some fundamental aspects of haptic and audio feedback for touchscreen-based musical applications and introduce the benefits of audio–tactile interaction.

2 Perceptual Aspects of Auditory and Haptic Modalities for Musical Touchscreen Applications

Playing a musical instrument is a complex task, and optimized multisensory stimuli may be useful, e.g. supporting spatial and temporal accuracy. Sound and vibration are physically coupled while playing a musical instrument or listening to music live or through loudspeakers. The knowledge of auditory and haptic psychophysics is necessary for the designer of multimodal interfaces to develop high-quality devices. In this section, perception of intensity, frequency and temporal aspects is discussed with respect to their importance to musical applications.

2.1 Intensity

Dynamic ranges of the auditory and tactile perceptions differ greatly. Although the perceivable dynamic range for hearing is approximately 130 dB, tactile perception can only discriminate a dynamic range of 50 dB. The just-noticeable differences (JNDs) in level for both modalities are about 1 dB. In music applications, such dynamic range differences should be taken especially into account, especially if haptic feedback is produced using audio signals: The perceived vibration magnitude might rise rapidly from imperceptible to strong if vibrations are generated from audio signal with wide dynamic range. Therefore, it might be advantageous to apply dynamic compression [21].

Intensity perception across the two modalities shows different behaviours. At 1 kHz, an increase of 10 dB in sound pressure level causes a sense of doubling in perceived loudness. At 250 Hz, an increase of 4–8 dB in vibration level causes a sense of doubling in perceived vibration intensity. In Fig. 12.2, the perceived intensity growth functions of auditory and tactile modalities are compared at same frequency (250 Hz): The rate of growth for the tactile modality is higher than for the auditory modality.

Fig. 12.2
figure 2

Growth of perceived magnitude as a function of sensation level for acoustical and vibratory stimuli at 250 Hz [19, 21, 22]

2.2 Frequency

In most musical applications, the frequency spectra of auditory and vibrotactile cues are coupled to each other by physical laws. Such frequency coupling plays an important role in how humans integrate auditory and tactile information [19].

Sounds that are audible to the human ear fall in the frequency range of about 20–20,000 Hz, with highest sensitivity between 500 and 4000 Hz. Just-noticeable frequency differences (JNFDs) for the auditory system were reported by Zwicker and Fastl [25]. They investigated that, at frequencies below 500 Hz, humans are able to differentiate between two tone bursts with a frequency difference of only about 1 Hz, and this value increases with frequency. Above 500 Hz, the JNFD is approximately 0.002 times the frequency.

The frequency range of auditory perception is much wider than that of tactile perception: The skin is sensitive to frequencies between 1 and 1000 Hz, with highest sensitivity in the range of 200–300 Hz. JNFDs for sinusoidal vibrations and tactile pulses on the finger and volar forearm were measured by different researchers [25,26,27]. The values for the Weber fraction (difference threshold divided by stimulus intensity) range from 0.07 to 0.2. Frequency discrimination of the tactile channel is fairly good at low frequencies but deteriorates rapidly as frequency increases [25].

Overall, these results indicate that the skin is rather poor at discriminating frequency in comparison with the ear.

2.3 Temporal Acuity and Rhythm Perception

Conversely, the auditory modality shows an extraordinary temporal resolution. As an example, two impulses will be perceived as separate sounds if there is only 1–2 ms gap between them. Although the temporal acuity of the cutaneous system is not as high as that of the auditory system, still individuals can distinguish 8–10 ms gap between two tactile sinusoidal bursts [28, 29]. Anyhow, in comparison with vision, both audition and vibrotaction have very high temporal resolution.

Apart from temporal acuity, the perception of rhythm is an important capability of both modalities. In all cultures, it is common that people tap or move their hand, foot or other body parts in synchrony with music [30]. The processing of such metric information is only possible through the auditory and tactile/somatosensory channels, but not by means of vision. A research study by Brochard and colleagues shows that humans can abstract the metric structure from tactile rhythmic sequences as efficiently as from equivalent auditory patterns [31]. This ability is independent from the musical expertise. Various scientists assume that early developing relationship between the auditory modality and movement-related sensory inputs is maintained in adulthood [32]. The results of Bresciani et al. [33] show that the visual modality alone plays a minor role in feeling the contact with objects, at least when tactile and auditory modalities are available.

2.4 Synchrony

Temporal correlation is an important cue for the brain to integrate multiple sensory inputs generated by a single event, as well as to differentiate inputs related to separate events occurring at the same time. However, the synchronization of different modalities in multimedia applications is a major issue, due to technical constraints such as data transfer time, computer processing time and delays that occur during feedback generation processes. As the asynchrony between different modalities increases, the sense of presence and realism of multimedia applications decrease.

Several results are available on audio–tactile asynchrony perception [34, 35], indicating that, in order to preserve a unitary percept, the temporal discrepancy between the auditory and tactile modalities must be within 25 ms for various multimedia systems. However, for the purpose of the discussion in this chapter, it is necessary to consider the literature focusing on touchscreens. Kaaresoja has measured the tolerable multimodal latency in mobile touchscreen virtual button interaction, showing that tactile feedback latency should not exceed 25 ms and audio feedback latency should not exceed 100 ms [36]. Unfortunately, most of the current mobile phones or tablets cannot fulfil these latency figures. Such latency issues have a negative effect on the quality of musical interaction. Therefore, the progress of multimodal technology with respect to synchrony and latency will play an important role for the success of musical touchscreen applications.

3 Experiment 1: Identification of Audio-Driven Tactile Feedback on a Touchscreen

Grooveboxes can be considered as a combination of a control surface, a sampler, a music sequencer and a drum computer. They are popularly used for the production of various kinds of loop-based music styles, such as electro, techno, hip hop, especially in live concerts. Touchscreen-based grooveboxes may enable the user to redefine the combination, organization and size of the knobs, sliders, buttons [20]. In groovebox applications, the possibility to identify and discriminate the available musical loops is crucial to the user. A series of four experiments (referred to as 1a–d) were set up, whereby tactile feedback was generated from audio signals based on four different approaches. Tactile signal parameters were systematically varied according to the perceptual characteristics discussed in Sect. 12.2. The objective was to test which tactile feedback processing strategies helped distinguish audio loops. Furthermore, the attractiveness of the system, including pragmatic and hedonic qualities, was evaluated.

3.1 Stimuli

The main discriminant acoustic features of musical instruments are the frequency and amplitude structure, and temporal envelope of the produced tones. Most percussive instruments are unpitched (e.g. the snare), while others excite auditory pitch perception (e.g. the kettledrum). Features such as melody, rhythm and dynamics must be processed to some extent to generate a suitable vibrotactile signal from the acoustical signal. To this end, various strategies have been applied in the experiments reported in this chapter, similar to what is described in Sect. 7.3.

The simplest way to generate tactile feedback from acoustic signals is by low-pass filtering, as done in experiments 1a and 1d with cut-off frequency set to 1 kHz. As discussed already, auditory and tactile signals have strong similarities in the frequency domain. However, the tactile system is not sensitive to frequencies above 1 kHz.

Experiment 1b investigated the use of a frequency-shift strategy to generate vibrotactile feedback from the original audio signal. Assuming that good integration between auditory and tactile information occurs when the acoustical frequency is a harmonic of the vibration frequency, the spectrum of the audio signal was shifted down one octave by means of granular synthesis technique. While this allowed to preserve accurate timing, the processing resulted in some unwanted artefacts. However, such artefacts are produced especially at higher frequencies, mostly above the range of tactile perception (see Sect. 4.2).

In experiment 1c, beat information was extracted from audio loops looking for fast attack transients in the amplitude envelope. The detected beats triggered sinusoidal pulses at 100 Hz and lasting 80 ms, that is easily perceived.

3.2 Set-up

An Apple iPod TouchFootnote 4 was used as touch-sensitive input device, while tactile feedback was delivered by an electrodynamic exciter (Monacor BR-25) mounted at the back of the iPod (see Fig. 12.3). Its touchscreen surface was divided into six virtual buttons, each of which corresponded to a specific audio loop. When the participant pressed a button, tactile feedback for the respective channel was rendered in real time using Pure Data, while the audio signals were reproduced by closed-back reference headphones (Sennheiser HDA 200). The headphones offer effective sound isolation and therefore masked the background noise generated by the tactile system. The task was to associate each vibrating button to the corresponding audio signal.

Fig. 12.3
figure 3

Touchscreen device was mounted on an electrodynamic shaker for vibration reproduction

3.3 Subjects

Twenty subjects, sixteen male and four female, aged between 20 and 40 years, participated in the experiment. They had no knowledge of acoustics, and they voluntarily participated in this study. All subjects were right-handed and had self-reported normal hearing.

3.4 Results and Discussion

In this section, the results of the identification investigations for different signal processing strategies are summarized.

3.4.1 Low-Pass Filtering

In experiment 1a, the six vibrotactile stimuli were generated by low-pass filtering the audio loops at 1 kHz.

The percentage of correct responses for the stimuli are shown in Fig. 12.4a. Subjects could correctly identify most of the instruments. Errors are particularly low for percussion instruments which generate mainly higher frequencies, such as the snare, hi-hat or tambourine: The percentage of correct responses for snare, hi-hat and tambourine is higher than 80%. The participants reported that temporal envelope and frequency content were important cues.

Fig. 12.4
figure 4

Results of the identification experiment for different percussive instruments (audio loops). The vibration signals were generated by processing the audio signal via a low-pass filtering with cut-off at 1 kHz and b pitch shifting one octave down

3.4.2 Pitch Shifting

In experiment 1b, the vibration signals were generated by shifting down by one octave the spectra of the audio loops. The resulting signals were low-pass filtered at 1 kHz to get rid of high-frequency artefacts due to the processing.

The percentage of correct responses for the six stimuli are shown in Fig. 12.4b. Compared to simple low-pass filtering, octave shifting improved the identification of the loops. Indeed, pitch shifting allowed to perceive important components of the original sounds through the tactile sense. For instance, the attack of the kick drum presents relevant content at frequencies above 1 kHz. The kick drum and shaker could be better identified than in the low-pass filtering condition, but there were slightly more errors between the hi-hat and snare, perhaps because the hi-hat was perceived more intense than before as its dominant high-frequency energy was shifted towards lower frequencies. However, it is unclear whether features of the sequence (e.g. rhythm) or features of the source (e.g. frequency content) or both influenced the results; therefore, experiments 1c and 1d focused on separating the sequence and source features.

3.4.3 Beat Detection

In experiment 1c, the individual loops were analysed and their beat was detected, which in turn triggered artificial vibration signals. Thus, source features such as frequency content were not conveyed from the vibration signal, while the original rhythmic sequence was preserved.

Results are shown in Fig. 12.5a. While rhythm is an important factor for loop identification, the overall detection rate decreased. This showed that other features of musical signals play an important role.

Fig. 12.5
figure 5

Identification results for different instruments. The vibration signals were generated using a sequence features (beat detection and signal substitution) and b source features (low-passed percussive hits)

3.4.4 Single Hits

In experiment 1d, rhythm (sequence) information was removed to test whether a percussion instrument could be identified with only source features; thus, only a single hit was reproduced. Accordingly, the bass line and tambourine loops were removed from the stimuli set, and other percussion sounds (guiro and handclap) with distinct source features were added. The vibration signals were generated by low-pass filtering single hits at 1 kHz.

As seen in Fig. 12.5b, the kick drum and snare were identified with 100% accuracy, possibly due to their characteristic frequency content, which resulted in clearly distinct tactile perceptual qualities. Of the remaining instruments, the guiro had the highest number of correct identifications, perhaps because of its typical time structure (rattle like) that distinguishes it from the instruments with different time structures (bang like). The high-frequency percussive sounds were not differentiated well. Subsequent experiments revealed that the detection rate did not improve with octave shifting the single hits, or by adding a preliminary training phase.

3.4.5 Summary

The best identification rates were obtained when the source and sequence features were preserved (low-pass filtered or octave-shifted signals). Identification relying on rhythm information (beat detection) was observed to be time consuming and varied largely between subjects: The average identification time was approximately 10 s per loop in experiment 1c, while only 6 s were needed in experiments 1a and 1b and 8 s in the case of 1d.

3.5 Usability and Attractiveness

Before and after the experiments reported above, participants were asked to mix the six audio loops into a 90 s composition using the set-up described in Sect. 12.3.2. Instead of buttons, six faders were used to blend the different audio signals. In the first set, a conventional groovebox without tactile feedback was simulated. In the second set, audio-driven tactile feedback was rendered using the octave shift approach described in Sect. 12.3.4.2. When the finger of the user came in contact with a fader, vibration for the respective channel was rendered.

After completion, participants were asked to judge the usability and attractiveness of the groovebox using the AttrakDiff [37] semantic differential. This method uses pairs of bipolar adjectives to evaluate the pragmatic and hedonic qualities of interactive products. The adjectives, grouped under four categories, and relative across-participants mean semantic ratings are reported in Fig. 12.6. The pragmatic quality is on average better without tactile feedback; this was likely due to participants experiencing some difficulty with audio–tactile association in the prior experiments. The individual ratings for the tactile feedback set-up varied, indicating disagreement between subjects. However, the difference in pragmatic quality is not statistically significant (dependent t test for paired samples, p > 0.05). On average, the hedonic quality was better with tactile feedback, especially for the “stimulation” aspect (p < 0.05). The hedonic category “stimulation” refers to the ability of a product to support the user to further personal development. The groovebox with audio-driven tactile feedback was rated as more innovative, captivating and challenging. These results are in agreement with other studies that evaluated multimodal feedback [38]. The overall attractiveness of the groovebox remains the same with or without audio-driven tactile feedback. This result is reasonable if the attractiveness is understood based on the hedonic and pragmatic qualities, where each contributes in equal parts to the attractiveness of a product [35].

Fig. 12.6
figure 6

Mean values of the AttrakDiff semantic differential for seven items on each of the four dimensions: pragmatic quality, hedonic qualityidentity, hedonic qualitystimulation and attractiveness

Obviously, the presented data are only valid for the specific exercise and the laboratory conditions described above, while results might change depending on task and context. For example, in a real live set it might be more important to know if a finger is on the correct fader; tactile feedback might also help DJs match beats between different tracks, influencing their pragmatic quality perception. Thus, conclusions should be drawn carefully.

In most touchscreen-based consumer devices, such as mobile phones and tablets, smaller low-fidelity actuators are used instead of the electrodynamic exciter that was used in the described experiments. Small actuators have several limitations in terms of the achievable vibration intensity and frequency range. Additionally, they have a slow temporal response time in comparison with other technologies, such as voice coil or piezoelectric actuators (see Sect. 13.2 for a review of actuator technology). To overcome such limitations, multimodal interaction can be very promising as it can compensate what is lacking in one modality with higher fidelity in another channel. In this perspective, a further experiment was conducted to investigate crossmodal intensity interaction between the auditory and tactile channels.

4 Experiment 2: Effect of Loudness on Perceived Tactile Intensity of Virtual Buttons

For several conventional or digital musical instruments, one fundamental interaction is that of pressing a button or a key [39]. Also, interaction with the user interface of DMIs (e.g. a groovebox) or mixing consoles is often mediated by buttons. This experiment aims to investigate the effect of loudness on the perceived intensity of tactile feedback provided by a touchscreen.

4.1 Stimuli

An impulsive waveform was selected as tactile signal, which represents the feedback produced by a conventional button. The stimuli amplitude corresponds to the perpendicular displacement of the surface, and positive values mean movement towards the subject. In order to be compatible with the characteristics of small actuators, a relatively small amplitude was selected. The maximum amplitude of the stimuli, which occurs at the beginning of the interaction, is 20 μm. The amplitude of the impulse then decays exponentially in 100 ms. As audio signal, a 400 Hz decaying sinusoid lasting also 100 ms was selected. The initial and maximum sound pressure level could be set at 50, 60 or 70 dB. Again, an exponential decay was applied.

4.2 Set-up

The experiment made use of the same hardware set-up as in experiment 1 (see Sect. 12.3.2). In this case, the surface of the touchscreen was divided into two virtual buttons.

4.3 Subjects

Eighteen subjects, twelve male and six female, aged between 20 and 35 years, participated in this experiment. The subjects had no any acoustic knowledge, and they voluntarily participated in this study. All subjects were right-handed and had self-reported normal hearing.

4.4 Procedure

The task was to estimate the intensity of the feedback delivered by the virtual button. Participants were instructed to concentrate only on the tactile feedback. The magnitude estimation method with anchor stimulus was used [40]. After the tactile-only anchor stimulus, a test stimulus was presented and participants had to assign a number proportional to their subjective impression of the stimulus intensity relative to the anchor stimulus, assuming that the intensity of the latter corresponded to 100. When participants did not perceive the test stimulus, they had to assign 0. Each stimulus pairs were presented ten times in random order.

4.5 Results and Discussion

Figure 12.7 shows the responses of all subjects. Geometric mean values were computed for the magnitude estimates obtained from all subjects for each stimulus condition.

Fig. 12.7
figure 7

Perceived tactile feedback intensity for different stimulus conditions

All audio–tactile conditions produced higher estimates than the only-tactile condition. Dependent t tests of the means showed that three conditions (only tactile, audiotactile 50 dB and audiotactile 70 dB) differed significantly (p < 0.05).

The results show that if a tactile button feedback is combined with audio feedback, the perceived intensity of the tactile feedback increases. When the tactile stimulus was accompanied by the acoustic stimulus, the tactile intensity was perceived on average between 56 and 96% higher.

The perceived tactile intensity magnitude increased for increasing sound levels, in spite of no change in the actual tactile feedback level. Similarly, in a previous investigation the authors found that, for a virtual drum, the magnitude of force feedback strength increased with increasing loudness, in spite of no change in force feedback [19].

Overall, these results indicate that auditory information can be useful in overcoming the current limitations of haptic devices.

5 Conclusions

In this chapter, first the fundamental perceptual aspects of auditory and tactile perception were discussed focusing on musical touchscreen applications. Based on this knowledge, various audio–tactile signal generation techniques were introduced and evaluated.

In a first series of experiments, it was found that percussive instruments can be identified to some degree if audio-driven tactile feedback is rendered. The detection rate was best when source characteristics and rhythmic features were maintained while translating from audio to tactile signals. A qualitative study showed that tactile feedback can improve the quality of touchscreen-based music interfaces and make them more attractive for the users.

A second investigation based on the same set-up focused on the perceived tactile feedback intensity of virtual buttons, showing that this can be significantly influenced by parallel auditory. This result may be used to compensate for the limitations of current small actuator technology as found in consumer devices. The coupled perception of sound and vibration is important for the implementation of innovative touch-based musical interaction, and tactile feedback is useful to enrich the musical interaction.