Multimedia Tools and Applications

, Volume 61, Issue 2, pp 353–388 | Cite as

A multimodal alignment framework for spoken documents

  • Dalila MekhaldiEmail author
  • Denis Lalanne
  • Rolf Ingold


We present a multimodal document alignment framework, which highlights existing alignment relationships between documents that are discussed and recorded during multimedia events such as meetings. These relationships that should help indexing the archives of these events are detected using various techniques from natural language processing and information retrieval. The main alignment strategies studied are based on thematic, quotation and reference relationships. At the analysis level, the alignment framework was applied at several levels of granularity of documents, requiring specific document segmentation techniques. Our framework that is language independent was evaluated on corpora in French and English, including meetings and scientific presentations. The satisfactory evaluation results obtained at several stages show the importance of our approach in bridging the gap between meeting documents, independently from the language and domain. They highlight also the utility of the multimodal alignment in advanced applications, e.g. multimedia document browsing, content-based / temporal-based searching, etc.


Multimodal document Document structure Thematic alignment Quotation alignment Reference alignment 


  1. 1.
    AMIDA project homepage. http:/ Accessed 20 March 2011
  2. 2.
    Anderson R, Hoyer C, Prince C, Su J, Videon F, Wolfman S (2004) Speech, ink, and slides: the interaction of content channels. In: Proceedings of ACM multimedia. New York, USA, pp 796–803Google Scholar
  3. 3.
    Anderson R, Hoyer C, Wolfman S A (2005) A study of diagrammatic ink in lecture. In: Proceedings of computers and graphics, pp 480–489Google Scholar
  4. 4.
    Anderson R, Davis P, Linnell N, Prince C, Razmov V, Videon F (2007) Classroom Presenter: Enhancing Interactive Education with Digital Ink. IEEE Computer 40–9:56–61CrossRefGoogle Scholar
  5. 5.
    Barras C, Geoffrois E, Wu Z, Liberman M (1998) Transcriber: a free tool for segmenting, labelling and transcribing speech. In: Proceedings of LREC’98. Spain, pp 1373–1376Google Scholar
  6. 6.
    Behera A, Lalanne D, Ingold R (2008) DocMIR: an automatic document-based indexing system for meeting retrieval. Int J Multimed Tools Appl 37–2:135–167CrossRefGoogle Scholar
  7. 7.
    Bloechle J.L, Rigamonti M, Hadjar K, Lalanne D, Ingold R (2006) XCDF: a canonical and structured document format. In: Proceedings of DAS, the 7th IAPR International Workshop on document analysis systems. New ZealandGoogle Scholar
  8. 8.
    Brotherton JA (2001) eClass: building, observing and understanding the impact of capture and access in an educational setting, PhD Thesis. Georgia Institute of Technology, USAGoogle Scholar
  9. 9.
    Brotherton JA, Bhalodia JR, Abowd GD (1998) Automated capture, integration, and visualization of multiple media streams. In: Proceedings of IEEE multimedia, pp 54–63Google Scholar
  10. 10.
    Chiu P, Foote J, Girgensohn A, Boreczky J (2000) Automatically linking multimedia meeting documents by image matching. In: Proceedings of Hypertext’00, ACM Press, Texas, USA, pp 244–245Google Scholar
  11. 11.
    Chiu P, Kapuskar A, Reitmeier A, Wilcox L (2000) Room with a Rear View: Meeting Capture in a Multimedia Conference Room. IEEE Multimedia 7–4:48–54Google Scholar
  12. 12.
    Chiu P, Girgensohn A, Liu Q (2004) Stained-glass visualization for highly condensed video summaries. In Proceedings of IEEE International Conference on Multimedia and Expo ICME’04. Taipei, TaiwanGoogle Scholar
  13. 13.
    CMU Sphinx system. Accessed 7 December 2010
  14. 14.
    Corral D (2005) Including a thesaurus in similarity calculation. A Bachelor Thesis in Computer Science. University of Fribourg, SwitzerlandGoogle Scholar
  15. 15.
    Cutler R, Rui Y, Gupta A, Cadiz J, Tashev I, He L, Colburn A, Zhang Z, Liu Z, Silverberg S (2002) Distributed meetings: a meeting capture and broadcasting system. In: Proceedings of ACM multimedia. France, pp 503–512Google Scholar
  16. 16.
    Elsweiler D, Ruthven I, Jones C (2007) Towards memory supporting personal information management tools. Am Soc Inf Sci Technol 58–7:924–946CrossRefGoogle Scholar
  17. 17.
    Girgensohn A, Borczkyj WL (2001) Keyframe-based user interfaces for digital video. IEEE Computer 34–9:61–67CrossRefGoogle Scholar
  18. 18.
    Gruenstein A, Seneff A (2007) Releasing a multimodal dialogue system into thewild: user support mechanisms. In: Proceedings of the 8th SIGdial workshop on discourse and dialogue, pp 111–119Google Scholar
  19. 19.
    Hearst M (1994). Multi-paragraph segmentation of expository text. In: Proceedings of ACL, the 32nd Annual Meeting of the Association for Computational Linguistics. USA, pp 9–16Google Scholar
  20. 20.
    HTK tool. Accessed 7 December 2010
  21. 21.
    Kornfield EM, Manmatha R, Allan J (2004) Text alignment with handwritten documents. In: Proceedings of DIAL, document image analysis for libraries. San Jose, California, USA, pp 195–211Google Scholar
  22. 22.
    Lalanne D, Von Rotz D, Ingold R (2005) IM2.DI, Integration de Documents dans des Archives Multimedias de Reunions. In : Flash Informatique, Ecole Polytechnique Federale de Lausanne, FI2/05, pp 15–18Google Scholar
  23. 23.
    Le QA, Popescu-Belis A (2009) Automatic vs. human question answering over multimedia meeting recordings. In: Proceedings of Interspeech’09 (10th Annual Conference of the International Speech Communication Association). Brighton, UK, pp 624–627Google Scholar
  24. 24.
    Le Meur JY, Bourillot D (2005) INDICO, un Logiciel de Pointe pour la Gestion de Conference. In: Flash Informatique, Ecole Polytechnique Fédérale de Lausanne, FI2/05, pp 12–14Google Scholar
  25. 25.
    Little S, Geurts J, Hunter J (2002) Dynamic generation of intelligent multimedia presentations through semantic inferencing. In: Proceedings of ECDL, the 6th European Conference on Research and Advanced Technology for Digital Libraries. Rome, Italy, pp 158–175Google Scholar
  26. 26.
    Macedo AA, Da Graca CPM, Camacho-Guerrero JA (2001) Latent semantic linking over homogeneous repositories. In; Proceedings of DocEng, the ACM symposium on document engenieer. USA, pp 144–151Google Scholar
  27. 27.
    Macedo AA, Camacho-Guerrero JA, Cattelan RG, Inacio VR, Da Graca CPM (2004) Interaction alternatives for linking everyday presentations. In: Proceedings of ACM hypertext. USA, pp 112–113Google Scholar
  28. 28.
    Matrakas M.D, Bortolozzi F (2000) Segmentation and validation of commercial documents logical structure. In: Proceedings of ITCC, International Conference on information technology: coding and computing. USA, pp 242–246Google Scholar
  29. 29.
    Mekhaldi D (2006) A study on multimodal document alignment: bridging the gap between textual documents and spoken language. PhD Thesis, N° 1521. Fribourg, SwitzerlandGoogle Scholar
  30. 30.
    Mekhaldi D (2007) Multimodal document alignment: towards a fully-indexed multimedia archive. In: Proceedings of multimedia informtation retrieval workshop, SIGIR’07. The NetherlandsGoogle Scholar
  31. 31.
    Mekhaldi D, Lalanne D (2010) Multimodal document alignment: feature-based validation to strengthen thematic links. J Multimed Proc Technol (JMPT) 1(1):30–46Google Scholar
  32. 32.
    Mekhaldi D, Lalanne D, Ingold R (2004) Thematic segmentation of meetings through document/speech alignment. In: Proceedings of 12th Annual Conference ACM Multimedia 2004. New York, USA, pp 804–811Google Scholar
  33. 33.
    Mekhaldi D, Lalanne D, Ingold R (2005) From searching to browsing through multimodal documents linking. In: Proceedings of ICDAR, the 8th International Conference on Document Analysis and Recognition. Korea, pp 924–928Google Scholar
  34. 34.
    Memoir project homepage. Accessed 13 February 2009
  35. 35.
    Moore D (2002) The IDIAP smart meeting room. Technical report. IDIAP-Com. Martigny, SwitzerlandGoogle Scholar
  36. 36.
    Morde A, Kashi RS, Brown MB, Grove D, Flanagan JL (2002) A multimodal system for accessing driving directions. In: Proceedings of document analysis systems. Princeton, NJ, USA, pp 595–601Google Scholar
  37. 37.
    Mukhopadhyay S, Smith B (1999) Passive capture and structuring of lectures. In Proceedings of the 17th ACM International Conference on multimedia. Florida, USA, pp 477–487Google Scholar
  38. 38.
    Olligschlaeger AM, Hauptmann AG (1999) Multimodal information systems and GIS: the informedia digital video library. In: Proceedings of ESRI user conference. California, USAGoogle Scholar
  39. 39.
    Ponte JM, Croft WB (1997) Text segmentation by topic. In: Proceedings of ECDL’97. Italy, pp 113–125Google Scholar
  40. 40.
    Popescu-Belis A, Lalanne D (2004) Reference Resolution over a Restricted Domain: References to Documents. In: Proceedings of ACL Workshop on Reference Resolution and its Applications. Barcelona, Spain, pp 71–78.Google Scholar
  41. 41.
    Popescu-Belis A, Georgescul M, Clark A, Armstrong S (2004) Building and using a corpus of shallow dialogue annotated meetings. In: Proceedings of LREC’04. Portugal, pp 1451–1454Google Scholar
  42. 42.
    Popescu-Belis A, Kilgour J, Poller P, Nanchen A, Boertjes E, de Wit J (2010) Automatic content linking: speech-based just-in-time retrieval for multimedia archives. In: Proceedings of SIGIR’10, 33rd Annual International ACM SIGIR Conference on research and development on information retrieval, demonstration session. Geneva, SwitzerlandGoogle Scholar
  43. 43.
    QALLME project. Accessed 7 December 2010
  44. 44.
    Saetre R, Tveit A, Steigedal TS, Laegreid A (2005) Semantic annotation of biomedical literature using google. In: Proceedings of DMBIO’05. Singapore, pp 327–337Google Scholar
  45. 45.
    Scansoft system. Accessed 7 December 2010
  46. 46.
    Schultz T, Waibel A, Bett M, Metze F, Pan Y, Ries K, Schaaf T, Soltau H, Westphal M, Yu H, Zechner K (2002) The ISL meeting room system. In: Proceedings of HSC, the workshop on hands-free speech communication. Kyoto, JapanGoogle Scholar
  47. 47.
    Tang L, Kender, J (2005) Educational video understanding: mapping handwritten text to textbook chapters. In: Proceedings of ICDAR, the 8th International Conference on document analysis and recognition. Seoul, Korea, pp 919–923Google Scholar
  48. 48.
    The Quranic Arabic Corpus.homepage. Accessed 25 March 2011
  49. 49.
    The Smart meeting room recorded data. Accessed 7 December 2010
  50. 50.
    Von Rotz D, Bourillot D, Abou Khaled O, Scheurer R, Lalanne D, Ingold R, Le Meur J-Y, Baron T (2006) SMAC—Smart Multimedia Archive for Conferences. In: Flash Informatique FI1/06, Ecole Polytechnique Fédérale de Lausanne, ISSN 1420-7192, pp 3–10Google Scholar
  51. 51.
    Wahlster W, Andre E, Finkler W, Profitlich HJ, Rist T (1993) Plan-based Integration of Natural Language and Graphics Generation. In Artificial Intelligence 63:387–427CrossRefGoogle Scholar
  52. 52.
    WordNet thesaurus. Accessed 7 December 2010
  53. 53.
    Yu JH (2004) Alignment of Bilingual web pages based on the MT evaluation method of BLEU. In: Student Workshop of COCLING 14, conference on computational linguistics and speech processing. Taipei, TaiwanGoogle Scholar
  54. 54.
    Zhang B, Andre M, Calado P, Cristo M (2004) Combining structural and citation-based evidence for text classification. In: Proceedings of CIKM, the 13th conference on information and knowledge management. Washington D.C., USA 2004, pp 162–163Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Computational Linguistics GroupUniversity of WolverhamptonWolverhamptonUK
  2. 2.Department of InformaticsUniversity of FribourgFribourgSwitzerland

Personalised recommendations