An important characteristic of multimedia content analysis is that multimedia objects exhibit much richer structures than simple objects. In some cases, we might need to label a set of inter-related instances altogether because determining the class label of an object depends on the class labels of spatially, temporally related objects. For example, the part-of-speech tagging, also called grammatical tagging, is the process of automatically determining the grammatical role (or attribute) of each word in a text. This is a typical problem that can not be solved without examining both the word itself and the neighboring words in the same sentence, because many words in natural languages can represent more than one part of speech at different times. On the other hand, detecting ”home run” events from a baseball video program is another typical problem that requires a joint labeling of a sequence of frames, because such a event is composed of a sequence of actions that span over many video frames. The label of each frame can not be determined without examining both its visual content and its context within the sequence.


