Skip to main content

Speech Mashups

  • Chapter
Semantic Mashups

Abstract

Speech mashups are an emerging type of mashup that expose cloud-based speech and language processing technologies as web services. They allow researchers, practitioners, and developers to access commercial-grade speech recognition and text-to-speech systems without the need to install, configure, or manage speech processing software or equipment. This approach significantly lowers the barrier to build speech applications by having all the necessary components and tools available in the network. Compared to traditional mashups, they introduce a number of new concepts such as audio capturing, audio play-back, streaming media across the network, and resource configuration management.

Scotty: “Computer—”

Bones steps in quickly, picks up the “Mouse” and shoves it into Scotty’s hand. Scotty looks at the mouse, baffled, then puts it to his lips like a mike.

Scotty: (continuing) “Hello? Computer...?”

Nichols: (bewildered) “Just use the keyboard...”

Scotty: “The keyboard... How quaint.”

Star Trek IV—The Voyage Home

Original movie script, March 11, 1986

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.w3c.org.

  2. 2.

    http://www.rssboard.org/rss-specification.

  3. 3.

    http://developer.att.com/API.

  4. 4.

    http://nuance.com.

  5. 5.

    http://microsoft.com/en-us/tellme/.

  6. 6.

    http://en.wikipedia.org/wiki/Arpabet.

  7. 7.

    http://tropo.com.

  8. 8.

    http://voxeo.com.

  9. 9.

    http://phono.com.

  10. 10.

    http://pipes.yahoo.com.

  11. 11.

    http://twilio.com.

  12. 12.

    http://aws.amazon.com/ec2/.

  13. 13.

    http://plivo.org.

  14. 14.

    http://freeswitch.org.

  15. 15.

    http://teleku.com.

  16. 16.

    http://callfire.com.

  17. 17.

    http://en.wikipedia.org/wiki/Adaptive_multi-rate_compression.

  18. 18.

    Java Micro Edition—http://www.oracle.com/technetwork/java/javame.

  19. 19.

    http://dragonmobile.nuancemobiledeveloper.com.

  20. 20.

    http://nexiwave.com.

  21. 21.

    http://www.acapela-group.com.

  22. 22.

    General Packet Radio Service.

  23. 23.

    Enhanced Data Rates for GSM Evolution.

  24. 24.

    http://www.speak4it.com.

  25. 25.

    http://dev.w3.org/html5/spec.

  26. 26.

    https://docs.google.com/View?id=dcfg79pz_5dhnp23f5.

  27. 27.

    http://www.w3.org/2005/Incubator/htmlspeech.

  28. 28.

    http://html5labs.interoperabilitybridges.com/prototypes/media-capture-api/media-capture-api/info.

  29. 29.

    http://lists.w3.org/Archives/Public/www-archive/2011Mar/att-0001/microsoft-api-draft-final.html#capture_api_extensions.

  30. 30.

    http://dev.w3.org/2011/webrtc/editor/webrtc.html.

  31. 31.

    http://tools.ietf.org/html/rfc5389.

  32. 32.

    http://tools.ietf.org/html/rfc5766.

  33. 33.

    http://tools.ietf.org/html/rfc5245.

  34. 34.

    http://tools.ietf.org/html/rfc2960.

  35. 35.

    http://tools.ietf.org/html/rfc6083.

References

  1. Allauzen C, Mohri M, Roark B (2005) The design principles and algorithms of a weighted grammar library. Int J Found Comput Sci 16:403–421

    Article  MathSciNet  MATH  Google Scholar 

  2. Alonso G, Casati F, Kunoand H, Machiraju V (2003) Web services. Springer, Berlin

    Google Scholar 

  3. Axelsson J, Cross C, Lie HW, McCobb G, Raman TV, Wilson L (2001) XHTML+voice profile 1.0. http://www.w3.org/TR/2001/NOTE-xhtml+voice-20011221

  4. Beckham JL, Di Fabbrizio G, Klarlund N (2001) Towards SMIL as a foundation for multimodal, multimedia applications. In: Eurospeech 2001, European conference on speech communication and technology

    Google Scholar 

  5. Berners-Lee T, Fielding RT, Frystyk Nielsen H (1996) Hypertext transfer protocol—HTTP/1.0. http://www.w3.org/Protocols/HTTP/1.0/spec.html

  6. Beutnagel M, Conkie A, Schroeter J, Stylianou Y, Syrdal A (1999) The AT&T next-gen TTS system. In: Joint meeting of ASA, EAA, and DAGA

    Google Scholar 

  7. Black AW (2002) Perfect synthesis for all of the people all of the time. In: IEEE 2002 workshop on speech synthesis

    Google Scholar 

  8. Blechschmitt E, Strödecke C (2002) An architecture to provide adaptive, synchronized and multimodal human computer interaction. In: Multimedia’02: proceedings of the tenth ACM international conference on multimedia, pp 287–290

    Chapter  Google Scholar 

  9. Burnett DC, Walker MR, Hunt A (2004) Speech synthesis markup language (SSML) version 1.0. http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/

  10. Caceres R, Friday A (2012) Ubicomp systems at 20: progress, opportunities, and challenges. IEEE Pervasive Comput 11:14–21

    Article  Google Scholar 

  11. Chudnovskyy O, Weinhold F, Gebhardt H, Gaedke M (2012) Integration of telco services into enterprise mashup applications. In: Current trends in web engineering. Lecture notes in computer science, vol 7059, pp 37–48

    Chapter  Google Scholar 

  12. Conkie A, Okken T, Kim Y-J, Di Fabbrizio G (2012) Building text-to-speech voices in the cloud. In: Eight international conference on language resources and evaluation (LREC’12)

    Google Scholar 

  13. Crockford D (2006) The application/json media type for JavaScript object notation (JSON). RFC 4627. http://www.json.org/

  14. Di Fabbrizio G, Okken T, Wilpon JG (2009) A speech mashup framework for multimodal mobile services. In: ICMI-MLMI 2009, pp 71–78

    Chapter  Google Scholar 

  15. Eberman B, Carter J, Meyer D, Goddeau D (2002) Building VoiceXML browsers with OpenVXI. In: WWW 2002: proceedings of the 11th international conference on world wide web, pp 713–717

    Google Scholar 

  16. Feng J, Bangalore S, Gilbert M (2009) Role of natural language understanding in voice local search. In: 10th annual conference of the international speech communication association (Interspeech 2009)

    Google Scholar 

  17. Fielding RT (2000) REST: architectural styles and the design of network-based software architectures. PhD thesis, Univ. of California, Irvine. http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm

  18. Gebhardt H, Gaedke M, Daniel F, Soi S, Casati F, Iglesias CA, Wilson S (2012) From mashups to telco mashups: a survey. IEEE Internet Comput 16(3):70–76

    Article  Google Scholar 

  19. Goffin V, Allauzen C, Bocchieri E, Hakkani-Tür D, Ljolje A, Parthasarathy S, Rahim M, Riccardi G, Saraclar M (2005) The AT&T WATSON speech recognizer. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)

    Google Scholar 

  20. Gruenstein A, McGraw I, Badr I (2008) The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces. In: ICMI 2008: proceedings of the 10th international conference on multimodal interfaces, pp 141–148

    Google Scholar 

  21. Johnston M (2009) EMMA: extensible multimodal annotation markup language. http://www.w3.org/TR/emma/

  22. Johnston M, Di Fabbrizio G, Urbanek S (2011) mTalk—a multimodal browser for mobile services. In: Interspeech 2011, 12th annual conference of the International Speech Communication Association, pp 3261–3264

    Google Scholar 

  23. Lee K-F, Hon H-W, Reddy R (1990) An overview of the SPHINX speech recognition system. IEEE Trans Acoust Speech Signal Process 38(1):35–45

    Article  Google Scholar 

  24. McGlashan S, Hunt A (2004) Speech recognition grammar specification version 1.0. http://www.w3.org/TR/2004/REC-speech-grammar-20040316/

  25. Newcomer E, Lomow G (2004) Understanding SOA with web services. Addison-Wesley Professional, Reading

    Google Scholar 

  26. Niklfeld G, Finan R, Pucher M (2001) Architecture for adaptive multimodal dialog systems based on VoiceXML. In: EuroSpeech 2001

    Google Scholar 

  27. Niklfeld G, Pucher M, Finan R, Eckhart E (2002) Mobile multi-modal data services for GPRS phones and beyond. In: Fourth IEEE international conference on multimodal interfaces, pp 337–342

    Chapter  Google Scholar 

  28. Porter B, McGlashan S, Burke D, Burnett DC, Candell E, Auburn RJ, Baggia P, Carter J, Rehor K, Oshry M, Bodell M, Lee A (2007) Voice extensible markup language (VoiceXML) 2.1. http://www.w3.org/TR/2007/REC-voicexml21-20070619/

  29. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding

    Google Scholar 

  30. Shanmugham S, Burnett D (2009) Media resource control protocol version 2 (MRCPv2). tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-18

  31. Wang K (2002) SALT: an XML application for web-based multimodal dialog management. In: NLPXML 2002: 2nd workshop on NLP and XML, pp 1–8

    Google Scholar 

  32. Yee R (2008) Pro Web 2.0 mashups: remixing data and web services. APress, New York

    Google Scholar 

Download references

Acknowledgements

We would like to thank Linda Crane and Amanda Stent for their contributions and continuous unconditional support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giuseppe Di Fabbrizio .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Di Fabbrizio, G., Okken, T., Wilpon, J. (2013). Speech Mashups. In: Endres-Niggemeyer, B. (eds) Semantic Mashups. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36403-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36403-7_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36402-0

  • Online ISBN: 978-3-642-36403-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics