1 Preface

Communication is the core for cultural, scientific, economic and technological evolution of the contemporary society, its progress and innovation. Nowadays, more and more Information and Communication Technologies are emphasizing the central role of all issues connected with communication. The numerous and different ways people are using to communicate with each other, the multiplicity of features to produce and exchange information, and the relevance and pervasiveness of multimodal, mobile devices and phenomena such as Internet, are all elements that enhance the complexity of communication processes and their management.

Devices supporting multimodal interaction become more and more widespread. When applied in an appropriate way, multimodal interaction provides users with a flexible, natural and robust interaction approach, allowing them to communicate in a synergistic manner using their five senses along with several communication channels. This also permits the organization to store, index, retrieve and more generally to manage a wide amount of multimodal data and information, thus, enabling people to use a multimodal dialog approach in order to access information and/or services.

In this context, we have decided to contribute to the scientific debate concerning these emerging questions defining this special issue. After a very selective peer review, nine papers were selected for publication. They deal with theories and techniques concerning multimodal information retrieval, indexing, query processing and extracting features from multimodal data, multimodal interaction issues, and applications.

This special issue is organized as follows.

In the first paper titled “Scene Extraction System for Video Clips using Attached Comment Interval and Pointing Region”, S. Wakamiya et al. address the issue of information retrieval and querying a large amount of multimodal information and data. They describe a method enabling users of video sharing websites to easily retrieve video scenes according to their interests. Video analysis techniques, such as image processing, and speech recognition are useful for recognizing objects in a video clip.

However, these types of analysis are expensive. The proposed method enables users to view scenes and their attached annotations, both considering text information and non-text information, according to their specific interests.

In the second paper, titled “Bayesian Belief Network Based Broadcast Sports Video Indexing”, Maheshkumar H. Kolekar presents a method for automatic indexing of excitement clips of sports video sequences based on a probabilistic Bayesian belief network (BBN). The paper gives a general method to the automatic tagging of large scale multimedia content with rich semantics enabling browsing, searching and manipulating video documents. The proposed method has been validated in the sports domain by demonstrating successful indexing of soccer and cricket video excitement clips.

The third paper is called “MQSS: Multimodal Query Suggestion and Searching for Video Search” authored by L. Li and J. Li. Here, the authors aim at improving access for video information by describing a multimodal query suggestion method for video search. Suggested keywords and representative image examples are presented in an easy-to-use dropdown list in order to support users in specifying their query precisely and effortlessly. The effectiveness of the provided approach has been evaluated, and 96% of users estimated that the query suggestions of MQSS are useful, while 74% and 66% users considered the query suggestions of Google Video Search and Yahoo! Video Search useful, respectively.

Z. Wu et al. address in “Multimedia Selection Operation Placement” the complexity of queries involving multimedia operations. They describe their theory and algorithm for optimizing queries using expensive multimedia operations. Also, they try to establish the optimal placement of each multimedia operation in a query plan by considering selectivity and unit execution cost of each operation. The defined algorithm has a time complexity that is polynomial in the number of multimedia operations in the query plan.

Authored by P. Luigi Scala et al., the fifth paper is titled “TMS for Multimodal Information Processing”. It elaborates on a component-based system for modeling and automatizing the management of complex information-oriented working processes. In particular, working processes are considered that are able to search, acquire, describe and assemble computational agents.

In the sixth paper, J. Renny Octavia et al. provide “Adaptation in Virtual Environments Conceptual Framework and User Models” in which they propose a conceptual framework for an adaptive personalized interaction in virtual environments. Switching between interaction techniques in virtual environments according to the constructed user model has been implemented using the adaptation framework. An evaluation has demonstrated that users positively respond to the use of an adaptable system as their frustration usually decreases.

A. Cerekovic et al., in the seventh paper called “Multimodal Behavior Realization for Embodied Conversational Agents”, assume that decreasing frustration and increasing naturalness are not sufficient for an adequate interaction. In fact, interaction has to help overcoming the digital divide. In this scenario naturalness of the interaction processes is a particularly relevant factor. Thus, they describe how to create believable and expressive virtual characters in order to enhance the communication abilities of machines. They use Embodied Conversational Agents’ technologies to bring human-like abilities into machines for verbal and nonverbal channels, discussing issues of multimodal behavior focusing on motion control.

In “On creating multimodal virtual humans—real time speech driven facial gesturing”, G. Zoric et al. describe a multimodal interface, based on virtual humans, which uses speech as input and speech with facial gestures as output. They provide an original method for automatic audio to visual mapping to produce a wide set of facial gestures. The mapping only occurs based on the speech signal in the real time, using a hybrid data-driven and rule-based approach.

In the last paper concluding this special issue, “Improving Multimodal Web Accessibility for Deaf People: Sign Language Interpreter Module”, M. Debevc et al. focus on the problem of web sites’ Accessibility. The authors address the necessity of reducing the digital divide in the particular case of deaf and hard-of-hearing people. The paper underlines the importance of using sign language instead of information presented in an appropriate written form. The SLI software Module has been presented; on a contextual and technical basis the deaf and hard-of-hearing people can adjust the web access to their needs.

We hope this special issue motivates researchers to take the next step beyond building models to implement, evaluate, compare, and extend their proposed approaches. Making this special issue to become a reality has been a considerable effort for many persons. We would first like to gratefully acknowledge and sincerely thank all the reviewers for their timely and insightful valuable comments and evaluations of the manuscripts that greatly improved the quality of the final versions. Of course, thanks are due to the authors, who provided excellent articles and timely extended revisions. Finally, we are grateful to the editors of MTAP for their trust in us.