Keywords

1 Introduction

Recently, in the field of higher education, it has been often suggested to record the lectures on videos to review them for Faculty Development (FD) [1,2,3]. However, the instructors cannot easily select the scenes to be watched, and it is very time-consuming for them to review their lectures by watching the whole video. To reduce the heavy workload of watching the lecture videos, previous works have proposed to recognize various situations related to the instructor and the students for indexing the videos [4,5,6,7,8,9].

Those previous works can be classified into two types: those that consider mainly the situations related to the instructor [4,5,6,7], and those that focus on the students [8, 9]. The previous works of the first type discuss how to recognize the instructor’s behaviors, which include writing on the blackboard, presenting slides, talking to the students and so on. Those of the second type focus mainly on students’ behaviors because it has been pointed out that there is a relation between students’ behaviors and their interest during the lectures. That is, students’ behavior of looking ahead often reflects their interest in the lecture [10].

Additionally, recent work has analyzed the relation between the postures taken by the students during a lecture, and as the result, it has been shown that different behaviors such as dozing off and looking away as well as looking ahead can be used as useful clues to estimate the students’ understanding of the lecture [11]. Based on these results, this article discusses how to recognize combinations of those behaviors of the whole group of the students in the classroom during a lecture as the situation of the lecture. To this aim, it is necessary to clarify what kinds of situations can be observed in the lecture, because the situations of lectures related to the behaviors of the whole group of the students are not so well organized as those of the instructors, who gives the lectures with the specific purpose of giving clear explanations using slides and whiteboards. Moreover, whereas most students look ahead when they are paying attention to the lectures, the postures taken by the students while they are dozing off or looking away might be different for different students.

In our work, we classify different types of situations from the combinations of the behaviors observed for the whole group of the students at different moments of the lectures. To cope with individual differences in the postures for the same behavior in this classification, we assume that the same posture taken by the same student implies the same behavior of the student, and classify different behaviors of each student based on the similarity between the postures actually observed for the student. More precisely, first we obtain representative postures for each student by clustering his/her postures observed at each moment of the lecture. Then, we describe specific situations at each moment of the lecture combining the representative postures of all students attending the lecture. Finally, those situations are again clustered based on the similarity in the combination of the representative postures, and different situations related to the students during the lecture are recognized.

In Sect. 2, we will provide a more detailed explanation of the procedure used in this study. In Sect. 3, we will present the results of an experiment conducted by one of the authors in his university to evaluate the procedure described above. Finally, in Sect. 4, we will summarize the main points of this article and discuss possible future steps for our research.

2 The Classification of Students’ Situations by Clustering Their Postures

2.1 The Identification of Representative Postures for Each Student

The posture of each student observed in each frame of the lecture video can be obtained by conventional human image processing techniques for pose estimation. The obtained posture is described by the two-dimensional (2D) coordinates of all the observable feature points of the student’s body. Let \( \varvec{x}_{i} (t) \) denote the posture of i-th student denoted by \( S_{i} \) observed in t-th frame denoted by \( F_{t} \) of the lecture video (\( i = 1, \cdots ,N; t = 1, \cdots ,T \)), where \( N \) and \( T \) denote the number of the students observed in the lecture video and that of the frames constituting the video, respectively. The posture \( \varvec{x}_{i} (t) \) is a 2\( J \) dimensional vector, where \( J \) denotes the number of feature points, mainly the joints, of a student’s body. In this article, this vector is named the observed posture of student \( S_{i} \) at frame \( F_{t} \). Since each observed posture describes only 2D positions in the image frame for the feature points of each student, and therefore does not include any information concerning depth, the observed posture changes according to the geometric relation between the student and the camera used to take the lecture video, even when the same posture and the same student are involved. However, it is possible to keep this geometric relation unchanged by fixing the camera in the classroom, given that each student sits in the same seat throughout the lecture. Under this condition, the difference in observed posture \( \varvec{x}_{i} (t) \) reflects the difference in actual 3D posture of student \( S_{i} \).

The set of all the observed postures in each video frame obtained for student \( S_{i} \) is denoted by \( O_{i} = \left\{ {\varvec{x}_{i} \left( 1 \right), \cdots ,\varvec{x}_{i} (T)} \right\} \). Assuming that each student should take similar postures for the same behavior, the clusters denoted by \( C_{i} = \left\{ {C_{i}^{1} , \cdots ,C_{i}^{K(i)} } \right\} \), in which K(i) denotes the number of the clusters, are obtained by grouping all the postures included in \( O_{i} \), which should correspond to the number of different postures actually taken by student \( S_{i} \), and thus differs from one student to another (see Fig. 1). Since the observed postures \( \left\{ {\varvec{x}_{i} \left( t \right)| \varvec{x}_{i} (t) \in C_{i}^{k\left( i \right)} , C_{i}^{k\left( i \right)} \in C_{i} } \right\} \), which are all classified into the same cluster \( C_{i}^{k(i)} \), are similar to each other, those postures are regarded as representing the same posture taken by student \( S_{i} \) for the same behavior. The representative postures are defined to indicate these observed postures taken for the same behavior by each student. The \( k(i) \)-th representative posture \( \varvec{X}_{i}^{k(i)} \) of student \( S_{i} \) is defined by the centroid of \( C_{i}^{k(i)} \) as follows:

Fig. 1.
figure 1

Representative postures obtained by clustering observed postures.

$$ \varvec{X}_{i}^{k(i)} = \frac{1}{{|C_{i}^{k(i)} |}}\mathop \sum \limits_{{\varvec{x}_{i} (t) \in C_{i}^{k(i)} }} \varvec{x}_{i} (t) $$
(1)

where all the representative postures of student \( S_{i} \) are given by the set \( X_{i} = \left\{ {\varvec{X}_{i}^{1} , \cdots , \varvec{X}_{i}^{K(i)} } \right\} \).

To describe the behavior associated to an observed posture for each student at any frame, the observed posture is substituted by the representative posture that is more similar to that observed among all the representative postures of the student. Let \( \varvec{y}_{i} \left( t \right) \) denote the representative posture to substitute observed posture \( \varvec{x}_{i} (t) \) of student \( S_{i} \) in frame \( F_{t} \). This representative posture is given as that with the minimal Euclidian distance from \( \varvec{x}_{i} (t) \) in \( X_{i} \) as follows:

$$ \varvec{y}_{i} \left( t \right) = \mathop {\text{argmin}}\limits_{{\varvec{X}_{i}^{k(i)} \in X_{i} }} \left\| {\varvec{x}_{i} \left( t \right) - \left. {\varvec{X}_{i}^{k(i)} } \right\|} \right. $$
(2)

2.2 The Classification of Different Situations in the Whole Group of Students

As a result of the procedure described in Sect 2.1, representative postures \( \varvec{y}_{1} \left( t \right), \cdots ,\varvec{y}_{N} \left( t \right) \) of all the students \( S_{1} , \cdots , S_{N} \) are obtained for each frame \( F_{t} \). Since any observed posture is described as a 2\( J \) dimensional vector, any of the \( N \) representative postures are also described as a 2\( J \) dimensional vector. These \( N \) representative postures are employed to describe the situation of the whole group of students in each frame. The situation of the whole group of students in frame \( F_{t} \) is denoted by \( \varvec{y}(t) \), which is called here combined representative posture, and it is defined as the 2\( JN \) dimensional vector, whose elements are constituted by those of the \( N \) representative postures as follows:

$$ \varvec{y}\left( t \right) = \left[ {\varvec{y}_{1} \left( t \right) \cdots \varvec{y}_{N} \left( t \right)} \right] $$
(3)

Let \( R \) denote the set of the combined representative posture \( \varvec{y}\left( t \right) \) for all the frames, where \( R = \left\{ {\varvec{y}\left( 1 \right), \cdots ,\varvec{y}\left( T \right)} \right\} \). Since the frames in which each student takes the observable postures to be substituted by the same representative posture of his/her own should be regarded as the frames with the same behavior for the whole group of the students, the frames with similar combined representative postures can be regarded as the frames representing the same situation for the whole group of students. Based on this idea, the sets of all combined representative postures \( R \) are classified into the clusters, each including similar combined representative postures (see Fig. 2). The resultant set of clusters is denoted by \( D = \left\{ {D^{1} , \cdots ,D^{L} } \right\} \), where \( L \) is the number of clusters corresponding to the number of different situations that actually occurred during the observed lecture.

Fig. 2.
figure 2

Lecture situations obtained by clustering representative postures.

If the situation for each frame needs to be further recognized among its possible variations obtained as \( D \) described above, the situation to be recognized for frame \( F_{t} \) can be obtained by replacing combined representative posture \( \varvec{y}\left( t \right) \) with the centroid of the cluster including \( \varvec{y}\left( t \right) \). Let \( \varvec{Y}_{l} \) denote the centroid of cluster \( D^{l} \) \( \left( {l = 1, \cdots ,L} \right) \), which is defined as follows:

$$ \varvec{Y}_{l} = \frac{1}{{|D^{l} |}}\mathop \sum \limits_{{\varvec{y}(t) \in D^{l} }} \varvec{y}(t) $$
(4)

where \( Y = \left\{ {\varvec{Y}_{1} , \cdots , \varvec{Y}_{L} } \right\} \) describes different situations of the whole group of the students. Thus, the situation of the whole group of students in frame \( F_{t} \) can be recognized by finding \( \varvec{z}\left( t \right) \), which denotes the element with the minimal Euclidean distance from \( \varvec{y}\left( t \right) \) among \( Y \):

$$ \varvec{z}\left( t \right) = \mathop {\text{argmin}}\limits_{{\varvec{Y}_{l} \in Y}} \left\| {\varvec{y}\left( t \right) - \left. {\varvec{Y}_{l} } \right\| } \right. $$
(5)

Since \( Y \) is not given in advance but is obtained based on the similarity between the students’ postures, in order to identify the situations in which the students are involved, we do not need to know in advance neither what kinds of situations possibly happen during the lecture nor what postures are actually taken by each student in each situation.

3 Experimental Results

3.1 Students’ Observed Postures

We run an experiment to evaluate whether the method described in Sect. 2 can be successfully used to identify situations that are useful for instructors to review and improve their lectures. We recorded the seminar supervised by one of the authors of this article by fixing a camera in the classroom after obtaining students’ approval. The recorded video consisted of 2771 frames (T = 2771) and lasted 90 min. The results of pose estimation for the students appearing in the video included the postures of 13 students out of all those who attended the seminar for each frame (N = 13). OpenPose [12] was employed to pose estimations. Postures of all the other students could not be obtained due to occlusions among the students. The observed posture for a student for each frame is described as a 24-dimensional vector, which consists of 2D coordinates in the image frame for 12 feature points, including the nose, neck, shoulders, elbows, wrists, eyes, and ears (J = 12). Figure 3 illustrates the observed postures for the 13 students in a frame of the lecture video. Different lines indicate different pairs of feature points adjacent to each other. The face of each student is hidden in the image for privacy protection.

Fig. 3.
figure 3

An example of the observed postures.

3.2 Representative Postures Obtained for Each Student

The observed postures obtained for each student in all frames were classified into clusters of similar postures. The k-means method [13] was employed for clustering. Since this method requires that the number of clusters K(i) is specified, we tried different values for K(i) in order to find the appropriate number for the clusters. As a result, clusters including the observed postures that can be interpreted as meaningful behaviors were obtained for K(i) = 2–8.

Figures 4 and 5 show examples of the observed postures included in each of the three clusters obtained for two different students when K(i) = 3. The observed postures in this example can be interpreted as the behaviors of looking ahead, taking notes, and looking away. However, the observed postures included in the clusters corresponding to the same behavior for different students are not necessarily similar in terms of their geometric shapes. This result implies that the observed postures taken by different students during the same lecture may have a similar variation of their behavior, whereas the geometric shapes of the observed postures that can be interpreted as the same behavior often include individual difference. Nevertheless, our method allows us to extract meaningful behaviors that occur during the lecture while tolerating individual differences in the observed postures by merely clustering the observed postures of each student.

Fig. 4.
figure 4

The representative postures of student A.

Fig. 5.
figure 5

The representative postures of student B.

3.3 Obtaining the Situations of the Whole Group of Students

The representative postures of each student were obtained as the centroids of the clusters of the observed postures obtained in Sect 3.2 to replace the observed postures of the student in each frame with one of those representative postures and form the combined representative postures for the whole group of students in the frame. By clustering the combined representative postures in all frames, different situations of the group of students during the lecture were obtained. The k-means method was employed again for clustering. Since the number of clusters L is unknown, we tried different values also for L. As a result, most clusters could be interpreted as meaningful situations for the whole group of students for L = 4.

Figures 6 and 7 show examples of frames classified into different clusters. In each figure, the representative posture of each student is shown at the position of the student in the image frame. Figure 6 shows examples of situations that can be given a meaningful interpretation, whereas the situations depicted in Fig. 7 cannot be interpreted meaningfully. For example, the situations illustrated in Fig. 6 can be interpreted respectively as (a) paying attention to the lecture, (b) taking notes, and (c) looking away, because almost all students show the same behavior although the geometric shapes of the representative postures are different. On the other hand, the examples in Fig. 7 are not easily interpreted in a meaningful way for the whole group, because some students are paying attention to the lecture while others are taking notes.

Fig. 6.
figure 6

Examples of situations that have meaningful interpretations.

Fig. 7.
figure 7

Examples of situations that are difficult to interpret meaningfully.

From the examples reported above, it can be said that our method is fairly useful in obtaining meaningful situations of the students regardless of their individual differences in posture, but still needs further improvement. One of the reasons why the situations in Fig. 7 cannot be interpreted univocally is that the students begin and finish taking notes in different moments. To deal with the asynchrony of the behaviors, it is necessary to make our clustering method tolerant for a slight temporal difference.

4 Conclusions

This article discussed the possibility of identifying various situations related to the whole group of students during lectures from the videos obtained with a fixed camera in the classroom. The proposed method first obtains observed postures for each student, described as 2D positions of the feature points of the body, by pose estimation for each frame of the recorded lecture. Since each student is seated at the same location throughout the lecture and the camera is fixed in the classroom, the differences in the observed postures of each student reflect the changes in his/her posture. Thus, assuming that the same posture of the same student reflects the same behavior, the observed postures of each student in all the frames are classified into clusters based on their similarity to obtain the representative postures as the centroids of the clusters. The representative postures of all students in each frame are used to form the combined representative postures in the frame, and different situations of the whole group of students during the lecture are obtained by further clustering the combined representative postures in all the frames. Applying this method to the analysis of the video of a seminar, most of the obtained clusters could be given meaningful interpretations, although some of them were difficult to interpret meaningfully.

In future research, we need to modify the method so that the clustering can tolerate individual differences related to the moment in which the posture changes. Although the most straightforward solution would be to reduce the temporal resolution of the video frames, further discussion is required to understand how to address this issue properly.

It is also important to consider the different relevance of different feature points for evaluating the similarity between different postures based on their positions. For example, the position of each hand is not as relevant as the position of the head for evaluating the similarity in the posture of the whole body, because the hands tend to take more different positions than the head for the same behavior including paying attention, taking notes, and looking away. Thus, it becomes necessary for the clustering to give different weights to different feature points or to normalize the distance between the feature points for evaluating the difference in posture.