The number of digital images stored, managed and shared through the internet is growing at a phenomenal rate. Press and photo agencies, photo-sharing networks and image search engines face the challenge of managing effectively billions of images. Millions of people register to photo-sharing networks or simply visit such sites through a variety of devices ranging from mobile phones to broadband connected digital video recorders. High quality cameras come part and parcel with mobile devices boosting consumer-generated visual content, as well as the exchange of such content with others through photo-messaging. These trends point to a shift in user groups, user needs and content type in digital imaging: it is not only professionals but also laymen who generate, access and interact with digital images for professional, educational or entertainment-related purposes; subsequently, the content itself is not always of the highest quality, or of professionally staged scenes, it now carries all idiosyncrasies of consumer-generated content.

Within such conditions, finding the most relevant or the most appealing image for a given task (e.g. to illustrate a story) has become an extremely difficult process, a process that requires one to take advantage of any pieces of information related to the images. For example, metadata related to image capture such as date, location, camera settings or name of photographer is often available from the digital camera used to take the photograph. The owner can further add a relevant title, filename or/and descriptive caption or any other textual reference. If the image is uploaded to a shared photo collection, additional comments are frequently added to the image by other users. On the other hand, images used in documents, i.e. web pages, frequently have captions and surrounding text. All this information can be considered image metadata and is of value for organizing, sharing, and processing images.

However, how could one exploit the information contained in such metadata in an intelligent, generic or task-specific way? Linking this information with the actual image content is still an open challenge. The aim of this special issue is to present research on bridging textual and visual data for developing technology that shows a more “advanced understanding” of image contents. To this end, related work from two major research communities is brought together: computer vision and computational linguistics. The papers presented in this volume have been reviewed by researchers from both communities, as a further step in activating the dialogue between the communities on such topics of common interest.

Actually, this special issue sprung out of the First International Workshop on Metadata Mining for Image Understanding (MMIU 2008), that took place in January 2008 in Madeira, Portugal, as a satellite event of the Computer Vision Theory and Applications Conference (VISAPP 2008). The event was an opportunity for researchers, content providers and related user-service providers to elaborate on the needs and practices of digital image management, to share ideas that point to new directions on using metadata for image understanding and to demonstrate related technology representative of the state of the art and beyond. The workshop was successful in attracting interdisciplinary and international interest, as evidenced in the diverse programme committee, as well as the papers presented. Computer vision and computational linguistics researchers from the academia and the industry showed that the time is ripe for the two communities to interact more and gain from the resulting cross-fertilization of ideas for addressing several multimedia application challenges. The best papers of the workshop were selected, extended and revised and now form the main volume of this special issue.

In particular, the papers that we have selected to include in this volume introduce methods for combining visual and textual metadata for a variety of multimedia applications. Half of them focus on methods that could be used in a variety of multimedia applications, while the other half present methods that are more tuned to the idiosyncrasies of specific applications:

Battiato et al., Using Visual and Text Features for Direct Marketing in the Multimedia Messaging Services Domain, combine visual and textual features in a cascade of regression methods for learning which advertisements in the form of mobile multimedia messages are more likely to appeal to the users. The work demonstrates the usefulness of such technology in Direct Marketing. On their turn, Ah-Pine et al., Crossing textual and visual content in different application scenarios, present two “trans-media” feedback metrics for automatic image annotation, text illustration and multimedia retrieval and clustering. The metrics have been used within a travel blog assistant system and a tool for browsing the Wikipedia. In Kludas et al., Can Feature Information Interaction help for Information Fusion in Multimedia Problems?, the authors present an information fusion method that takes into account feature interaction in multivariate settings. The method is compared to bivariate dependence measures on both artificial and real world data, the latter comprising of a captioned image database.

Turning to the idiosyncrasies of personal photo-collections, Carvalho et al., Attributing Semantics to Personal Photographs, present work towards automatic propagation of image tags, in personal photo-albums. They introduce a hybrid information extraction method for extracting person, object and location information from image captions and implement their suggestion for combining visual content and time-capture metadata for clustering photographs in personal photo-albums and propagating their location-related tags. Lindstaedt et al., Automatic Image Annotation using Visual Content and Folksonomies, present automatic image classification and similarity methods for a tag recommendation system within the context of collaborative annotation. The system relies on visual content analysis, tag association and user preferences. Last, the needs of professional image cataloguers in large image libraries are addressed in Klavans et al., Computational Linguistics for Metadata Building (CLIMB): Using Text Mining for the Automatic Identification, Categorization, and Disambiguation of Subject Terms for Image Metadata. The authors provide an overview of a toolkit for image cataloguers which augments image metadata by associating web-based text segments to image captions and categorizes this metadata in terms of the type of information it reveals (e.g. historical context). Word sense disambiguation takes also place in these textual resources for more accurate indexing and retrieval.

We believe that in these papers the readers will find valuable information and ideas not only on methods to be used for combining visual and textual metadata, but also for the multimedia applications that benefit from such research and affect the specifics of the methods to be followed. Vision–language integration has a long past in Artificial Intelligence, and a very active present; we hope that the current volume will contribute to a more vivid future.