1 Introduction

In this technological era, accessibility and sharing of multimedia content on social media platforms have increased as the speed and reliability of internet connections have improved. Animated images, such as graphical interchange format images (GIFs), are exceedingly popular; notably, they are used to share varied kinds of stories, summarize events, express emotion, gain attention, and enhance (or even replace) text-based communication [2]. There are many versatile types of media formats, but GIFs have become prevalent over the last decade owing to their distinct and unique features, such as instantaneous (very short in duration) and visual storytelling (no audio involved). Owing to their low bandwidth requirement and lightweight nature, GIFs are also integrated into streaming platforms to highlight videos. Figure 1 shows an example of animated images (WebP) on YouTube that are employed to instantaneously highlight a recommended video. Notably, there have only been a few generation studies for GIFs in multimedia research, despite their increasing popularity and unique visual characteristics.

Fig. 1
figure 1

Streaming platforms commonly use animated images to highlight recommended videos

According to a recent study [31], more than 500 million users spend approximately 11 million hours every day on the GIPHY website watching GIFs. Nevertheless, no real-time, lightweight GIF generation framework has been established, particularly for streaming platforms, despite the ubiquitous adoption and prevalence of GIFs. Server-driven techniques can provide real-time solutions to this problem, as all the information, including user data and video content, already exists on servers. There are three main concerns regarding server-driven solutions: (i) provision of real-time response to many concurrent users with limited computational resources; (ii) user privacy violation in a personalized approach; and (iii) the fact that current solutions process entire videos to create GIFs, which increases the overall computation time and demands substantial computational resources. These are the key factors that prompted us to research a conversational method and to explore a lightweight client-driven technique for GIF generation.

This paper proposes a novel, lightweight, and computationally efficient client-driven framework that requires minimal computational resources to generate animated GIFs. It analyzes an acoustic feature to track and estimate the timestamp corresponding to the maximum pitch (henceforth referred to as the “maximum pitch timestamp”) from the climax section of the corresponding video. Instead of processing the entire video, it processes a small segment of the video to generate a GIF. This makes the process efficient in terms of computational resources, communication, and storage. Sixteen publicly broadcast videos are analyzed to evaluate the effectiveness of the proposed approach.Footnote 1 The main innovative technical contributions of this study are summarized as follows:

  • A novel lightweight client-driven method is proposed to generate animated GIFs in accordance with user preferences.

  • A two-dimensional convolutional neural network (CNN) model is designed to identify audio files according to the interest of the user..

  • To validate the effectiveness of the proposed framework, Hypertext Transfer Protocol (HTTP) Live Streaming (HLS) clients and servers are locally configured.

  • Extensive quantitative experimental results obtained for 16 videos show that the proposed method is 3.76 times more computationally efficient than the baseline when an Nvidia Jetson TX2 is employed as the client device. Moreover, it is 1.87 times more computationally efficient than the baseline when a device with substantial computational resources is employed (details in Section 4.3). Qualitative evaluations were conducted on 16 videos with the collaboration of 16 participants (details in Section 4.4)

To the best of our knowledge, this is the first attempt to design an entirely client-driven technique to generate animated GIFs using an acoustic feature for streaming platforms.Footnote 2

The remainder of this paper is organized as follows. Section 2 briefly describes a summary of related work. Section 3 presents the details of the proposed client-driven framework. Section 4 presents the qualitative and quantitative results, along with the discussion. Finally, the concluding remarks are provided in Section 5.

2 Related research

This section briefly reviews the related research on animated GIF generation methods. We also review current music genre classification (MGC) methods; notably, MGC is an important technique for classifying audio files based on the genre preference of users in the proposed method.

2.1 GIF generation methods

In recent years, there has been a growing interest in researching animated GIFs. Many qualities of animated GIFs that make them more engaging than videos and other media on social network websites have been identified [2]. Facial expressions, histograms, and aesthetic features have been predicted and compared [18] to determine the most suitable video features that express useful emotions in GIFs. Another recent study [22] used sentiment analysis to estimate the textual-visual sentiment score for annotated GIFs. Several researchers have collected and prepared datasets to annotate animated GIFs [14]. Particularly, they have collected the Video2GIF dataset for highlighting videos and extended it to include emotion recognition [13]. The GIFGIF+ dataset has been proposed for emotion recognition [4]. Another dataset, Image2GIF, has been proposed for video prediction [46], together with a method to generate cinemagraphs from a single image by predicting future frames.

2.2 MGC methods

MGC has become a prevalent topic in machine learning since a seminal report [38] was published. MGC has commercial value, but in addition, it has many practical applications such as music recommendation [39], music tagging [6], and genre classification [5]. Recent research [8, 41] has shown that spectrograms, such as short-term Fourier transform spectrograms and Mel-frequency cepstral coefficient (MFCC) spectrograms, transformed from audio signals, can be successfully applied to MGC tasks. This is owing to their capability of describing temporal changes in energy distributions with respect to frequency. CNN models have been used for different MGC tasks. In early studies on MGC using neural networks [37], researchers confirmed that techniques such as dropout, use of rectified linear units, and Hessian-free optimization can enhance feature learning effects. To exploit the feature learning capability of CNNs, an initially trained CNN was used as a feature extractor and then the extracted feature was used as a classifier [20]. The researchers achieved good results on the GTZAN [38] dataset by combining extracted features with the majority voting method. The CNN-based approaches obtain notable results in MGC tasks; however, they neglect spectrogram temporal information, which may be useful. Based on this reasoning, the long short-term memory recurrent neural network (RNN) has been used [9] to extract features from scatter spectrograms [1] of audio segments and fuse them with those obtained using CNNs. In addition, to take advantage of both CNNs and RNNs, a convolutional RNN has been designed for music tagging [7] . By adopting this, both the spatial and temporal information of the spectrograms is used.

Despite the extensive research on GIFs and MGC, a lightweight client-driven GIF generation technique specialized for streaming platforms has not been developed. Most modern end-user devices have low computational resources, and analyzing entire videos to generate animated GIFs is time consuming. This is not feasible for real-time solutions. This paper presents a novel method to generate animated GIFs on end-user devices. It uses an acoustic feature and video segments, which makes it computationally efficient and robust enough to create GIFs in real-time. The following section explains the major components of the proposed framework and GIF generation process using an acoustic feature.

3 Proposed framework

Streaming platforms manage video and audio separately for each video. The video is split and stored in small continuous segments. Dividing a video into segments and separating audio allows the streaming platforms to manage them separately according to different specifications. As described in Section 2, existing techniques process the entire video to generate an animated GIF, which is not an efficient approach. In this context, a novel animated GIF generation method is proposed to reduce the consumption of computational resources and computation time. The use of an audio file instead of an entire video enables us to create animated GIFs within the limit of acceptable computation time for end-user devices such as an Nvidia Jetson TX2.

The high-level system architecture of the proposed method is illustrated in Fig. 2. It comprises two main parts: the HLS Server and the HLS Client. The proposed method mainly focuses on the client-side implementation. In the following subsections, the configuration and role of each component of the proposed method are explained.

Fig. 2
figure 2

High-level system architecture of the proposed client-driven GIF generation framework

3.1 HLS server

The first component of the proposed system architecture is the HLS server. The purpose of the HLS server is to smoothly transmit audio files and segments to concurrent users on heterogeneous end-user devices. Internet Information Services (IIS) was selected for this purpose and locally configured on Microsoft Windows 10. IIS supports most network protocols [29]. To reduce potential corruption or loss of packets during transmission [17], all videos are encoded as H.264/AAC Moving Picture Experts Group 2 (MPEG-2) transport stream (.ts) segments using FFmpeg [12]. Each video segment corresponds to approximately ten seconds of playback with a continuous timestamp. Similarly, the list of segments for each video is stored in a text-based playlist file (M3U8) in the playback order of the segments. Along with the segments, the HLS server also contains the audio file (.mp3) of the video, which is separately extracted from the source video using FFmpeg [12]. The detailed hardware specifications of the HLS server are described in Section 4.1.1. The following sections describe the details and roles of each HLS client component.

3.2 HLS client

The purpose of the HLS client is to process the audio file and a segment of the corresponding video to generate a GIF on the end-user device. To this end, the Nvidia Jetson TX2 was configured as the HLS client. This device is a GPU-based board with a 256 core Nvidia pascal architecture [11]. Jetpack 4.3 SDK was used to automate the basic installations and the maximum energy profile used in the proposed method. The HLS client consists of four major components: HTTP persistent connection, MGC module, animated GIFs generation module, and web-based user interface. The details of each component are described in the following subsections.

3.2.1 HTTP persistent connection

Several requests are initiated from the end-user device to obtain music files and segments during the GIF generation process. An HTTP persistent connection is used to download all corresponding files. It is used because it can simultaneously execute multiple requests and returns of data via a single transmission control protocol connection [3]. There are many advantages of using persistent connections, such as fewer new transport layer security handshakes, less overall CPU usage, and fewer round trips [47].

3.2.2 MGC module

The purpose of the MGC module is to analyze the audio files according to user music genre preferences. For this purpose, the GTZAN dataset was used in the experiments, which is extensively used as a benchmark for MGC [38]. The dataset has ten genres, and each genre includes 100 soundtracks of 30 s duration with a sampling rate of 22,050 HZ and a bit depth of 16 bits. The genres are blues, classic, country, disco, hip-hop, jazz, metal, pop, reggae, and rock. The dataset is divided into two sub-datasets: 70% for training and 30% for testing. The Librosa library was used to extract MFCC spectrograms from raw audio data [24]. The extracted features were used as input to the CNN model for training. MFCC spectrograms are a good representation of music signals [7].

The extracted MFCC features were input to the CNN model, which was a two-dimensional array in terms of time and feature value. Each 30 second long audio clip was split into 3 second windows with a size of 19,000 samples × 129 time × 128 frequency × 1 channels. The backbone of the proposed network was based on the VGG16 neural network. The network structure of the music classification model is shown in Fig. 3. The model was trained using the SGDW optimization algorithm with a learning rate of 0.01, momentum rate of 0.9, and the default weight decay value [23]. The training data were fed into the model with a batch size of 256 and learning rate of 0.001 for cost minimization, and 1,000 iterations were performed for learning the sequence patterns in the data. The early stopping method was adopted with ten epochs. The network trained for 100 epochs, and the best validation accuracy was obtained in 51 epochs within 30 minutes training phase. The Keras toolbox was used for feature extraction and to train the GeForce RTX 2080 Ti GPU network. The results of the trained model are described in Section 4.1.

Fig. 3
figure 3

Network structure of music classification model

3.2.3 Animated GIFs generation module

The objective of this module is to extract the climax section from the corresponding audio file and estimate the maximum pitch timestamp from it, so that a video segment can be requested to generate the animated GIF. The first three seconds of the segment are used to create a GIF in the proposed technique. Here, the length of each GIF is fixed, but it can be extended to generate GIFs that have a specific length. The proposed method mainly focuses on music videos that have a plot. A composed story generally consists of an exposition, rising action, climax, and resolution. The most exciting part of the plot is the climax section, where all the key events happen, and this represents the most memorable part [16]. Generally, the climax section in a classical story plot begins at 2/3 of the total running time. Figure 4 shows the classical story plot structure of the Big Buck Bunny (2008) video. The details of GIF generation from the climax part using the proposed method are explained in Section 4.2.

Fig. 4
figure 4

Classical story plot structure in a video

3.2.4 Web-based user interface

The user can select the video and preferred music genres and also view the generated GIFs using the web-based user interface. The open source hls.js player is used for this purpose [10]. Hypertext Markup Language 5 video and media source extensions are needed to play back the transmuted MPEG-2 transport stream. This player supports client-driven data delivery, meaning that the player can decide when to request a segment. The details of the animated GIF generation process are explained in the following sections.

4 Experimental results and discussion

This section presents an extensive experimental evaluation of the proposed method. First, the experimental setup is described along with the baseline approach, which involves processing the entire video and audio. The complete flow of the proposed GIF generation process is then explained from the user perspective. The accuracy of the proposed action MGC model is presented and compared with those of other well-known approaches. Finally, the performance of the proposed method is compared with that of the baseline method.

4.1 Experimental setup

4.1.1 Hardware configuration

In the experimental evaluation, both the HLS clients and the HLS server were locally configured. Two different hardware configurations were used for the HLS client: A high-computational-resources (HCR) device ran on the open-source Ubuntu 18.04 LTS operating system, and a low-computational-resources (LCR) device was configured using an Nvidia Jetson TX2. The proposed and baseline methods were deployed on the HLS clients separately for the LCR and HCR devices. The HLS server was configured on Windows 10 and used in all experiments. All hardware devices were locally connected to the SKKU school network. Table 1 lists the specifications of each hardware device used in the experiments. The entire GIF generation process from the user perspective using the proposed approach is explained in the following subsection.

Table 1 Specification of hardware devices

4.1.2 Proposed GIF generation process

This section explains the entire flow of the proposed GIF generation process from the user perspective. The flow is explained based on 16 popular videos that were selected from YouTube.Footnote 3 A complete description of the videos is provided in Table 2. The statistics for the number of views were collected in June 2020. Because of their popularity, some of the videos have been viewed more than a billion times on YouTube. All the videos used in the experiments had a resolution of 480 × 360 pixels.

Table 2 List of videos used for analysis in the proposed method

The user selects a video using a web-based interface to start the process. The user then selects the music genre preference using a web-based interface. The system requests an audio file for the corresponding video. The downloaded audio file is then analyzed by the proposed trained CNN model according to the user music genre preference. If the music genre of the audio file is consistent with the user preference, the system extracts the climax section from it using the Pydub library [32]. As described in Section 3.2.3, most of the videos follow the same plot structure, and the climax section begins after 2/3 of the running time. Thus, only the last 25% of the audio file is used as the climax section. The model estimates the maximum pitch timestamp from the climax section using Crepe [19]. The timestamp information is obtained in seconds to determine and download a segment. Equation 2 is used to estimate the segment number from the obtained pitch timestamp information.

The system requests a specific segment to be downloaded from the HLS server. Later, that segment is used to generate an animated GIF. The system uses FFmpeg [12] to create the GIF from the segment. Algorithm 1 shows all the processing steps required for using the proposed method to generate a GIF for each video. The variables are Ct (climax time), A (audio file), Sn (segment number), Asr (sample rate of the audio file), Sd (segment duration), and Pt (timestamp of maximum pitch). Sd is a constant and its value is ten units (seconds).

$$ \begin{array}{@{}rcl@{}} Ct &=& A / Asr \times 0.75 \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} Sn &=& \frac{Ct + Pt}{Sd} \end{array} $$
(2)
figure a

4.1.3 Baseline method

This subsection explains the baseline method used for comparison with the proposed GIF generation method. As described in Section 2, the previous methods used the entire video in the GIF generation process. Here, the entire video and audio are used in the baseline method. As highlighted in Section 4.1.2, the baseline method uses the same web-based interface. In the baseline method, after the video and music genre preferences are selected, the client-side device requests the corresponding video file. The audio file is extracted from the video using FFmpeg [12]; further, the audio file is analyzed using the proposed trained CNN model to classify the music genre. Further, the baseline model estimates the maximum pitch timestamp from the entire audio file using Crepe [19]. The timestamp information is obtained within seconds to determine the starting point in the video to generate the GIF. Later, the timestamp information is used to generate the GIF from the video using FFmpeg [12]. Algorithm 2 shows the processing steps employed in the baseline method. The computation times of the proposed and baseline approaches are compared in the following experiments.

figure b

4.2 Experimental evaluation of MGC

This subsection evaluates the current CNN methods on the GTZAN dataset. To the best of our knowledge, Senac at al. [36] have achieved the best performance on the GTZAN dataset. The proposed method performed 4.06% better, in terms of the validation accuracy, within 133.83 million floating point operations per second. The experimental results of the baseline method and the proposed method on the GTZAN dataset are shown in Table 3. The proposed CNN model was used in all the experiments to identify the music genre according to user interest.

Table 3 Performance comparison on GTZAN dataset

4.3 Performance analysis of the proposed method

Thissection compares the performance of the proposed GIF generation method with that of the baseline scheme described in Section 4.1.3. The performance evaluation was conducted on 16 videos with different playtimes (see Table 2 for details). The computation time for the baseline method was determined by considering the time required to (i) download the video corresponding to the video, (ii) extract the audio, (iii) identify the genre, (iv) estimate the pitch from the audio, and (v) generate the GIF from the video. Meanwhile, the computation time for the proposed method was determined by considering the time required to (i) download the audio corresponding to the video, (ii) identify the genre, (iii) estimate the pitch from the climax section, (iv) download the segment, and (v) generate the GIF from the segment. Here, the model loading time is not encompassed in all experiments to calculate computation time.

The computation times required (seconds) to generate the animated GIF using the baseline and proposed methods were compared in the first experiment. Both approaches were configured on the HCR device (refer to Table 1 for the detailed specifications of the device). The sizes of the segment and climax section employed to estimate the pitch in the proposed method were significantly smaller than that of the video and the duration of the audio used in the baseline method. The overall computation time of the proposed method was significantly lower than that of the baseline method. Tables 4 and 5 show the computation times required for the HCR device to create a GIF using the baseline method and the proposed method, respectively.

Table 4 Computation time needed to create the GIF using the baseline method on HCR device
Table 5 Computation time needed to create the GIF using the proposed method on HCR device

Since this study focused on creating GIFs using the computational resources of the end-user device, in the next experiment, the proposed and baseline approaches were configured on the LCR device (i.e., an Nvidia Jetson TX2). Tables 6 and 7 show the computation times required to create a GIF on the LCR device using the baseline method and the proposed method, respectively. The overall computation times obtained using the proposed method were significantly lower than those obtained using the baseline method.

Table 6 Computation time needed to create the GIF using the baseline method on LCR device
Table 7 Computation time needed to create the GIF using the proposed method on LCR device

The combined duration of the 16 videos was 70 min. To generate the 16 corresponding GIFs on the HCR device, the baseline method required 11.27 min, whereas the proposed method required 6.01 min. Furthermore, on the LRC device, the baseline method required 77.67 min, and the proposed method required 20.64 min. Thus, based on the analysis of these 16 videos, on average, the proposed method was 1.87 times and 3.76 times faster than the baseline method on the HCR and LCR devices, respectively. In conclusion, these results show that the proposed method is more computationally efficient than the baseline method on both HCR and LCR devices.

4.4 Qualitative evaluation

This section presents the evaluation of the quality of the GIFs generated using the proposed method by comparing these with those obtained from YouTube and those generated using the baseline approach. The evaluation was based on a survey conducted with the help of 16 participants. Undergraduate students were recruited from our university as participants for this task. They were divided into four groups based on their music genre of interest. Each group of participants was shown three GIFs (i.e. YouTube, baseline, and proposed).

The survey was based on 16 videos (refer to Table 2). The quality of the generated GIFs were evaluated using the exact rating scale. The participants were asked to rate the GIFs according to the arousal aspect. An anonymized questionnaire was created for the generated GIFs so that the users could not determine which method was used (i.e., YouTube or baseline or proposed). They were asked to watch all GIFs and rank them on a scale of 1–10 (1 being the worst and 10 being the best). Table 8 shows the ratings given by the participants for all three approaches. The average ratings obtained for all 16 videos for the YouTube GIFs, the baseline method, and the proposed method were 6.6, 7, and 7.9, respectively. Figure 5 shows the sample frames obtained using the GIFs generated using each of the three approaches.

Fig. 5
figure 5

Illustrations of the frame samples from generated GIFs

Table 8 Average rating (1\(\sim \)10) by the participants

4.5 Discussion

The previous sections evaluated the overall effectiveness of the present study by comparing the proposed and baseline approaches. The proposed approach exhibited better performance and reduced computation time on the HCR and LCR devices (Nvidia Jetson TX2). Instead of processing the entire audio and video (baseline), the proposed method used the climax section of the audio and video segment to generate an animated GIF. This is also indicated in the experimental results of comparison of the proposed method with the baseline, according to which the proposed approach was 3.76 and 1.87 times more computationally efficient on the LCR and HCR devices, respectively. This reduces the overall demand for computational resources and the computation time required to generate GIFs on end-user devices.

In qualitative evaluation in Section 4.4, the proposed method received overall a higher average rating than the other approaches used in the qualitative analysis. One of the main reasons that the GIFs generated using the proposed method have higher ratings was because the GIFs were generated from the most exciting parts of the videos.

This study demonstrates the use of an acoustic feature in the GIF generation process while using client device computational resources. Instead of processing the entire audio and video, the proposed method uses a small portion of the audio (climax section) and a video segment to generate the animated GIF. This makes it computationally efficient. There is one constraint while analyzing an acoustic feature. It is possible that while analyzing the baseline and proposed methods to obtain the maximum pitch timestamp, the timestamp information may be the same.

The proposed framework is designed to support a wide range of end-user devices with diverse computational resources capabilities. Because of its simplicity and scalability for implementing various configuration devices [21], the proposed approach can be easily adapted to other animated images (WebP), recommendation techniques [25, 43], and streaming protocols such as dynamic adaptive streaming over HTTP (DASH). In addition to reducing the required server computational resources, the proposed method can serve as a privacy-preserving solution using efficient encryption techniques [28, 30] that can be integrated into other client-driven solutions [26, 27]; moreover, it can be adapted in three-screen TV solutions [33, 34].

5 Conclusion

This paper proposes a novel, lightweight method for generating animated GIFs using end-user-device computational resources for the entire process. The proposed method analyzes the climax section of the audio file, estimates the maximum pitch, and obtains the corresponding video segment to generate the animated GIF. This improves computational efficiency and decreases the demand for communication and storage resources on resource-constrained devices. The extensive experimental results obtained based on a set of 16 videos show that the proposed approach is 3.76 times more computationally efficient compared with the baseline on an Nvidia Jetson TX2. Moreover, it is 1.87 times more computationally efficient than the baseline on the HCR device. Qualitative results show that the proposed method outperforms other methods and receives higher overall ratings.