Speech Mashups

Di Fabbrizio, Giuseppe; Okken, Thomas; Wilpon, Jay

doi:10.1007/978-3-642-36403-7_7

Giuseppe Di Fabbrizio²,
Thomas Okken² &
Jay Wilpon²

974 Accesses

Abstract

Speech mashups are an emerging type of mashup that expose cloud-based speech and language processing technologies as web services. They allow researchers, practitioners, and developers to access commercial-grade speech recognition and text-to-speech systems without the need to install, configure, or manage speech processing software or equipment. This approach significantly lowers the barrier to build speech applications by having all the necessary components and tools available in the network. Compared to traditional mashups, they introduce a number of new concepts such as audio capturing, audio play-back, streaming media across the network, and resource configuration management.

Scotty: “Computer—”

Bones steps in quickly, picks up the “Mouse” and shoves it into Scotty’s hand. Scotty looks at the mouse, baffled, then puts it to his lips like a mike.

Scotty: (continuing) “Hello? Computer...?”

Nichols: (bewildered) “Just use the keyboard...”

Scotty: “The keyboard... How quaint.”

Star Trek IV—The Voyage Home

Original movie script, March 11, 1986

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.w3c.org.
2.
http://www.rssboard.org/rss-specification.
3.
http://developer.att.com/API.
4.
http://nuance.com.
5.
http://microsoft.com/en-us/tellme/.
6.
http://en.wikipedia.org/wiki/Arpabet.
7.
http://tropo.com.
8.
http://voxeo.com.
9.
http://phono.com.
10.
http://pipes.yahoo.com.
11.
http://twilio.com.
12.
http://aws.amazon.com/ec2/.
13.
http://plivo.org.
14.
http://freeswitch.org.
15.
http://teleku.com.
16.
http://callfire.com.
17.
http://en.wikipedia.org/wiki/Adaptive_multi-rate_compression.
18.
Java Micro Edition—http://www.oracle.com/technetwork/java/javame.
19.
http://dragonmobile.nuancemobiledeveloper.com.
20.
http://nexiwave.com.
21.
http://www.acapela-group.com.
22.
General Packet Radio Service.
23.
Enhanced Data Rates for GSM Evolution.
24.
http://www.speak4it.com.
25.
http://dev.w3.org/html5/spec.
26.
https://docs.google.com/View?id=dcfg79pz_5dhnp23f5.
27.
http://www.w3.org/2005/Incubator/htmlspeech.
28.
http://html5labs.interoperabilitybridges.com/prototypes/media-capture-api/media-capture-api/info.
29.
http://lists.w3.org/Archives/Public/www-archive/2011Mar/att-0001/microsoft-api-draft-final.html#capture_api_extensions.
30.
http://dev.w3.org/2011/webrtc/editor/webrtc.html.
31.
http://tools.ietf.org/html/rfc5389.
32.
http://tools.ietf.org/html/rfc5766.
33.
http://tools.ietf.org/html/rfc5245.
34.
http://tools.ietf.org/html/rfc2960.
35.
http://tools.ietf.org/html/rfc6083.

References

Allauzen C, Mohri M, Roark B (2005) The design principles and algorithms of a weighted grammar library. Int J Found Comput Sci 16:403–421
Article MathSciNet MATH Google Scholar
Alonso G, Casati F, Kunoand H, Machiraju V (2003) Web services. Springer, Berlin
Google Scholar
Axelsson J, Cross C, Lie HW, McCobb G, Raman TV, Wilson L (2001) XHTML+voice profile 1.0. http://www.w3.org/TR/2001/NOTE-xhtml+voice-20011221
Beckham JL, Di Fabbrizio G, Klarlund N (2001) Towards SMIL as a foundation for multimodal, multimedia applications. In: Eurospeech 2001, European conference on speech communication and technology
Google Scholar
Berners-Lee T, Fielding RT, Frystyk Nielsen H (1996) Hypertext transfer protocol—HTTP/1.0. http://www.w3.org/Protocols/HTTP/1.0/spec.html
Beutnagel M, Conkie A, Schroeter J, Stylianou Y, Syrdal A (1999) The AT&T next-gen TTS system. In: Joint meeting of ASA, EAA, and DAGA
Google Scholar
Black AW (2002) Perfect synthesis for all of the people all of the time. In: IEEE 2002 workshop on speech synthesis
Google Scholar
Blechschmitt E, Strödecke C (2002) An architecture to provide adaptive, synchronized and multimodal human computer interaction. In: Multimedia’02: proceedings of the tenth ACM international conference on multimedia, pp 287–290
Chapter Google Scholar
Burnett DC, Walker MR, Hunt A (2004) Speech synthesis markup language (SSML) version 1.0. http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
Caceres R, Friday A (2012) Ubicomp systems at 20: progress, opportunities, and challenges. IEEE Pervasive Comput 11:14–21
Article Google Scholar
Chudnovskyy O, Weinhold F, Gebhardt H, Gaedke M (2012) Integration of telco services into enterprise mashup applications. In: Current trends in web engineering. Lecture notes in computer science, vol 7059, pp 37–48
Chapter Google Scholar
Conkie A, Okken T, Kim Y-J, Di Fabbrizio G (2012) Building text-to-speech voices in the cloud. In: Eight international conference on language resources and evaluation (LREC’12)
Google Scholar
Crockford D (2006) The application/json media type for JavaScript object notation (JSON). RFC 4627. http://www.json.org/
Di Fabbrizio G, Okken T, Wilpon JG (2009) A speech mashup framework for multimodal mobile services. In: ICMI-MLMI 2009, pp 71–78
Chapter Google Scholar
Eberman B, Carter J, Meyer D, Goddeau D (2002) Building VoiceXML browsers with OpenVXI. In: WWW 2002: proceedings of the 11th international conference on world wide web, pp 713–717
Google Scholar
Feng J, Bangalore S, Gilbert M (2009) Role of natural language understanding in voice local search. In: 10th annual conference of the international speech communication association (Interspeech 2009)
Google Scholar
Fielding RT (2000) REST: architectural styles and the design of network-based software architectures. PhD thesis, Univ. of California, Irvine. http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
Gebhardt H, Gaedke M, Daniel F, Soi S, Casati F, Iglesias CA, Wilson S (2012) From mashups to telco mashups: a survey. IEEE Internet Comput 16(3):70–76
Article Google Scholar
Goffin V, Allauzen C, Bocchieri E, Hakkani-Tür D, Ljolje A, Parthasarathy S, Rahim M, Riccardi G, Saraclar M (2005) The AT&T WATSON speech recognizer. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)
Google Scholar
Gruenstein A, McGraw I, Badr I (2008) The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces. In: ICMI 2008: proceedings of the 10th international conference on multimodal interfaces, pp 141–148
Google Scholar
Johnston M (2009) EMMA: extensible multimodal annotation markup language. http://www.w3.org/TR/emma/
Johnston M, Di Fabbrizio G, Urbanek S (2011) mTalk—a multimodal browser for mobile services. In: Interspeech 2011, 12th annual conference of the International Speech Communication Association, pp 3261–3264
Google Scholar
Lee K-F, Hon H-W, Reddy R (1990) An overview of the SPHINX speech recognition system. IEEE Trans Acoust Speech Signal Process 38(1):35–45
Article Google Scholar
McGlashan S, Hunt A (2004) Speech recognition grammar specification version 1.0. http://www.w3.org/TR/2004/REC-speech-grammar-20040316/
Newcomer E, Lomow G (2004) Understanding SOA with web services. Addison-Wesley Professional, Reading
Google Scholar
Niklfeld G, Finan R, Pucher M (2001) Architecture for adaptive multimodal dialog systems based on VoiceXML. In: EuroSpeech 2001
Google Scholar
Niklfeld G, Pucher M, Finan R, Eckhart E (2002) Mobile multi-modal data services for GPRS phones and beyond. In: Fourth IEEE international conference on multimodal interfaces, pp 337–342
Chapter Google Scholar
Porter B, McGlashan S, Burke D, Burnett DC, Candell E, Auburn RJ, Baggia P, Carter J, Rehor K, Oshry M, Bodell M, Lee A (2007) Voice extensible markup language (VoiceXML) 2.1. http://www.w3.org/TR/2007/REC-voicexml21-20070619/
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding
Google Scholar
Shanmugham S, Burnett D (2009) Media resource control protocol version 2 (MRCPv2). tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-18
Wang K (2002) SALT: an XML application for web-based multimodal dialog management. In: NLPXML 2002: 2nd workshop on NLP and XML, pp 1–8
Google Scholar
Yee R (2008) Pro Web 2.0 mashups: remixing data and web services. APress, New York
Google Scholar

Download references

Acknowledgements

We would like to thank Linda Crane and Amanda Stent for their contributions and continuous unconditional support.

Author information

Authors and Affiliations

Research, AT&T Labs, 180 Park Avenue, Florham Park, NJ, USA
Giuseppe Di Fabbrizio, Thomas Okken & Jay Wilpon

Authors

Giuseppe Di Fabbrizio
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Okken
View author publications
You can also search for this author in PubMed Google Scholar
Jay Wilpon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Di Fabbrizio .

Editor information

Editors and Affiliations

Expo Plaza 12, Hannover, 30539, Niedersachsen, Germany
Brigitte Endres-Niggemeyer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Di Fabbrizio, G., Okken, T., Wilpon, J. (2013). Speech Mashups. In: Endres-Niggemeyer, B. (eds) Semantic Mashups. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36403-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-36403-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36402-0
Online ISBN: 978-3-642-36403-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics