1 Introduction

Alignment of subtitles to audio is a task which aims to produce accurate timings of the words within a subtitle track given a corresponding audio track. The alignment task is of significant relevance in the media industry. It is required when producing television shows and motion pictures at several stages, for instance when adding the speech track to the general audio track, when generating dubbed versions in multiple languages [26], and also when preparing captions for final broadcasting [42]. Semi–automated solutions exist, and they involve different degrees of human post-edit based on the automatically generated results. Further task automation would drive down production time and the expensive human labour costs in the production of accurate timing labels.

Spoken language technologies, such as Automatic Speech Recognition (ASR), can provide tools for the automation of these processes. Thanks to deep learning techniques [18], the recent improvements in empirical performance make ASR and relevant technologies suitable for a wide range of tasks in the multimedia domain, including the automatic generation of subtitles [1]. In terms of the alignment task, typically a set of previously trained acoustic models is first used to decode the audio signal into time–marked strings. The Viterbi algorithm, a dynamic programming technique, is then used to pair the reference and the decoded strings. By this method text can be quickly aligned to audio. However, this technique is unreliable with long audio files so other proposals using automatic speech segmentation and recognition have long been proposed [29]. Furthermore, the quality of such alignment degrades significantly when the subtitles deviate substantially from the actual spoken content, which is common in subtitling for media content.

For those scenarios where transcriptions are incomplete, [5] and [20] propose the use of large background acoustic and language models, and [39] implements a method for sentence–level alignment based on grapheme acoustic models. Moreover, if the transcript quality is very poor, [24] presents an alternative to improve lightly supervised decoding using phone–level mismatch information. To deal with complex audio conditions, [11] proposes to add audio markups to the audio file in order to facilitate the later alignment procedure. Other related works such as [4] also take into account situations where transcripts include a mixture of languages.

The main contribution of this paper is to propose a fully automated system for the alignment of subtitles to audio using lightly supervised alignment. The lightly supervised alignment process involves automatic speech recognition of the audio in order to obtain an output and then matching this output, which includes timings, to the subtitles. In order for this process to be effective, the automatic speech recognition output should match the subtitle text as closely as possible, which is known as lightly supervised decoding. This is done by biasing the language model to the subtitles. In this work, both recurrent neural network language models (RNNLMs) and n-gram language models, are biased to the subtitles. Whilst the biasing of n-gram language models to the subtitles is a known procedure and is achieved by merging the n-gram counts, the biasing of RNNLMs to subtitles has not been previously explored for lightly supervised decoding and is a novelty proposed in this paper. Another contribution is an error correction algorithm which helps improve the correctness of word-level alignments and the removal of non-spoken words. The proposed system is made available to the public using the webASR web API, which is free to use by both industrial and academic users. By making this system available for research and demonstration purposes, this paper aims to encourage users operating in the field of broadcast media to investigate lightly supervised approaches to deal with subtitling requirements.

This paper is organised as follows: Section 2 discusses the alignment task and the different ways it can be performed and measured. Section 3 will describe the proposed system of lightly supervised alignment using lightly supervised decoding. Section 4 will present the experimental conditions and results of the proposed system on the Multi–Genre Broadcast (MGB) challenge data [2]. Section 5 will describe the deployment of this system through webASR and how it can be used to build applications using the webASR API. Finally, Section 6 will present the conclusions to this work.

2 The alignment task

Many tasks in the speech technology domain have very clearly defined targets and measures of quality. For instance, in ASR the target is to produce the same sequence of words spoken in the audio; and this can be evaluated by a direct comparison with a manual transcription and counting the number of correct and incorrect words. However, the alignment task is liable to multiple interpretations and its assessment often relies on subjective measurements.

In general terms, there are two approaches to the alignment task. The first one takes as input a list of word sequences and aims to provide a start time and end time for each sequence, from where the input audio corresponds. With the time markings, input audio is truncated into short segments, each linked to the word sequences. Therefore, this approach is the most relevant to subtitling and close captioning. In this approach, a perfect alignment requires the word sequences to be mapped to the audio even when they are not verbatim transcriptions of the audio and may contain paraphrases or deletions and insertion of words for subtitling reasons.

The second approach aims to produce time information at the word level i.e, the precise time at which every given word is pronounced. In this case, either the word sequence to align is an exact verbatim transcription of the audio, or the alignment procedure must discard all words not actually pronounced, as they cannot be acoustically matched to any section of the audio. This approach is of relevance when a finer time resolution in alignment is required, for instance in dubbing procedures.

The way the quality of an alignment is measured differs depending whether sequence–level or word–level alignment is required. For sequence–level alignments, it is possible to manually come up with a ground truth labelling of where the segments should be aligned to and then compare this to the sequence boundaries automatically derived. In applications such as subtitling this manual ground truth depends on several subjective elements, such as the speed at which viewers can plausibly read the subtitles, or the way sentences are paraphrased, which makes measuring the quality of the alignment very subjective.

For word–level alignments, objective measurements are more feasible, as the start time and end time of any given word in an audio can always be identified manually. In this case, the alignment task turns into a classification task where the target is to correctly determine the timings for each word in the ground truth. As any other classification task, it can then be measured in terms of True Positive (TP) rate, the rate of words for which the times are correctly given, False Negative (FN) rate, the rate of words for which no time or incorrect times are given, and False Positive (FP) rate, the rate of words for which a time is given when no such word exists. From these values, standard classification metrics such as accuracy, precision, recall, sensitivity, specificity or F1 score can be computed.

Figure 1 visualises the differences between sequence–level and word–level alignment. In this example, the utterances “Not far from here is a modern shopping mall. There are all kinds of shops there.” is subtitled as “A modern shopping mall is nearby. There are all kinds of shops there.”. Looking into the first half of the subtitles “A modern shopping mall is nearby.”, when performing sequence–level alignment the subtitles should be aligned to the utterance “Not far from here is a modern shopping mall.” since they are paraphrases of each other. In the word–level alignment, only the words actually spoken “A modern shopping mall” could be aligned. Moving on to the second half of the subtitles “There are all kinds of shops here”, the sequence–level and word–level alignment output the same words, just with sequence or word timings as required.

Fig. 1
figure 1

Sequence–level (top, blue) and word–level (bottom, yellow) alignment for the subtitle text “A modern shopping mall is nearby. There are all kinds of shops there.”. Actual utterance is “Not far from here is a modern shopping mall. There are all kinds of shops there”

3 Lightly supervised alignment system

The system proposed in this paper follows the concept of lightly supervised alignment, i.e., of an alignment system where the input subtitles are used to train lightly supervised models that can be used to inform the alignment procedure. The main building blocks that are required for this setup are speech segmentation, lightly supervised decoding and the alignment itself.

3.1 Speech segmentation

Speech segmentation is the process in which a large piece of audio, i.e. with long duration is split into many short utterances, normally delimited by speech pauses or non-speech areas. The process consists initially of a Voice Activity Detection (VAD) procedure, where areas containing purely speech are identified in the audio signal. From these sections of pure speech, speech utterances are created by merging several of these sections into larger chunks. The final goal is to generate acoustically coherent segments of continuous speech that can then be used independently in downstream processes, like speech recognition.

VAD is a well studied problem in the speech community, where several solutions have long been proposed [34]. Previous approaches used acoustic properties of the speech signal to identify speech areas. The most basic VAD systems are based on detecting areas of higher energy, usually associated with speech; while more complex approaches performed an online estimation of the spectral characteristics of speech and non–speech areas to perform this separation [35].

Statistical approaches to VAD have produced improved performance, including the use of Neural Networks (NNs) to learn the classification of speech and non–speech [10, 14]. Deep Neural Networks (DNNs) have provided further improvements in this task [37] and are the basis of the VAD proposed in this system. In this setup, the neural networks are trained to classify each frame in one of two classes, one corresponding to speech being present and the other one representing speech not being present.

During VAD implementation, a DNN provides an estimation of the posterior probability for each audio frame, on whether or not it contains speech. Subsequently, a Hidden Markov Model (HMM), that takes as input the posteriors from DNN, determines the optimal sequence of speech and non–speech chunks by considering the speech/non–speech likelihood in a smoothed version over multiple frames. The final output is a speech segmentation which corresponds to segments of continuous speech.

Figure 2 provides an example of the speech segmentation process on a 25-second audio clip. The red chunks correspond to areas of speech as detected by the VAD, which has identified 8 speech areas. The green chunks are the speech segments obtained after agglomerating the VAD segments. In this case, speech segments with short pauses in between are merged into a single segment, resulting in only 4 output segments from the initial 8.

Fig. 2
figure 2

Example of speech segmentation process

3.2 Lightly supervised decoding

Once a set of speech segments has been identified, decoding is the process in which ASR is run in order to provide a hypothesis transcript for each segment in the set. The decoding process proposed in this system employs a 3–stage procedure based on a standard setup using the Kaldi toolkit [33]. First, it performs decoding using a set of previously trained hybrid DNN–HMM acoustic models [19] and a Weighted Finite State Transducer (WFST) calculated from a previously trained 3–gram language model. This generates a set of lattices which are then rescored using a 4–gram language model. Finally, the 25–best hypotheses for each segment are rescored again using a Recurrent neural network language model (RNNLM) [7, 27]. The hypothesis with the best final score after this process is given as the output for each segment.

Lightly supervised decoding involves using the input subtitles that have to be aligned to the audio to adapt the language model components of the previous decoding system in order to improve the quality of the decoding hypothesis. Figure 3 shows the block diagram of the proposed system. On the left–hand side of the diagram, the decoder works in 3 stages as described above: Decoding, lattice rescoring and N-best RNN rescoring. On the right-hand side, the lightly supervised procedure is depicted.

Fig. 3
figure 3

Block diagram of lightly supervised decoding system

First, the subtitles are tokenised — the text is normalised on a set of constraints used by the decoding procedure. This includes capitalisation, punctuation removal, numeral-to-text conversion and acronym expansion. In this procedure, for instance, the subtitle text “Today is 14th of July, you are watching BBC1.” is converted to “TODAY IS FOURTEENTH OF JULY YOU ARE WATCHING B. B. C. ONE”.

After tokenisation, the decoding lexicon, which contains the set of words that can be recognised is expanded with out-of-vocabulary (OOV) words. OOV are words from the subtitles in the training set that are not covered by the decoding lexicon. In this process, a phonetic transcription of the words is either extracted with some carefully crafted dictionary or generated by automatic phonetisation. In the proposed system, the Combilex dictionary [36] is used to extract manual pronunciations and the Phonetisaurus toolkit is used to derive new pronunciations [30] when not covered by Combilex.

The next step involves n–gram language model (LM) interpolation between a previously trained baseline n–gram LM that provides a full coverage of the target language and, an n–gram LM trained exclusively on the subtitles. In this work, this is achieved using the SRILM toolkit [40] and biases the decoder and the lattice rescoring towards producing hypotheses which are closer to the words and language used in the subtitles. Such interpolation of n–grams [21] has been shown in the past to help improve accuracy in ASR systems when interpolating a large out–of–domain n–gram model with a smaller in–domain n–gram model.

Finally, a previously trained baseline RNNLM is fine–tuned [6, 9] using the subtitles by further training the RNNLM using the subtitle text as input for a given number of iterations in order to make the RNNLM model closer to the linguistic space of the subtitles. Fine–tuning of RNNLMs has also been shown to produce better accuracies in the ASR task [9, 41]. Once the adapted n–grams and RNNLMs are trained, they can be used in the decoding procedure instead of the baseline language models, i.e., n–grams/WFSTs for decoding and lattice rescoring and RNNLM for N–best rescoring.

3.3 Alignment

Finally, when an ASR hypothesis transcript is available for the audio file, a Dynamic Time Warping (DTW) alignment is performed comparing the hypothesis and the input subtitles. The aim of this alignment is to assign words in the subtitles to segments in the hypothesis. This alignment is performed in several stages. First, sequences of words from the hypothesis and the subtitles with high matching content are matched together. For each of these matching word/sequence pairs, the timing of the subtitle is derived from that of the corresponding ASR hypothesis. When all the best matches are found, the residual words in the subtitles not already matched will have their timings assigned to fill the time gaps left behind by previous matching.

Table 1 presents an example of this procedure in a 34–second audio clip. The speech segmentation identifies two segments from 1.47 seconds to 17.59 seconds and from 21.96 seconds to 34.15 seconds; and the lightly supervised decoding gives the hypothesis presented in rows 2 and 3 of the Table. The original and tokenised subtitles for this clip are shown in rows 4 and 5. Then, the output of the lightly supervised alignment system is presented, which gives 3 segments, different to the original ones. The first segment matches the subtitles in the range “Justice, wombats ... one, go!” with the hypothesis in the range “JUSTICE WOMBATS ... ONE GO”. The first word of the hypothesis (“PENDULUM”) is deleted as it does not match the subtitles. The next match occurs between the subtitles in “looking forward ... Chipmunk. Chipmunk.” and the hypothesis in “LOOKING FORWARD ... CHIPMUNK CHIPMUNK”. The remaining subtitle words cannot be matched with any remaining hypothesis so they are then assigned to a new intermediate segment covering “I’m Anthony. Who are you”.

Table 1 Lightly supervised alignment example

From here on, the system provides word–level time information by performing Viterbi forced alignment of the words in each segment. At this step, some of the segments may be dropped from the output if the alignment procedure cannot find an acoustic match in the audio, resulting in the loss of some words in the output. On the other hand, it will usually be the case that words which are not pronounced in the audio but are in the subtitles will still appear in the output. To improve the correctness of word–level alignments and remove non–spoken words, an algorithm [31] has been proposed to find such cases and remove them. This algorithm uses a previously trained binary regression tree to identify these words based on some acoustic values of each aligned word, like duration or confidence measure.

The alignment procedure generates an output on tokenised form, and in order to recover the original textual form of the subtitles a re–normalisation procedure is performed to recover punctuations, cases, numerals and acronyms in their initial form. This can be easily done as a hash table is generated during the tokenisation procedure linking each original word in the subtitles to one or more tokens in the normalised form.

4 Experiments and results

The experimental setup in which the proposed system was evaluated was based on Task 2 of the MGB challenge 2015[3]. This task was defined as “Alignment of broadcast audio to a subtitle file” and was one of the four core tasks of the challengeFootnote 1. The MGB challenge aimed to evaluate and improve several speech technology tasks in the area of media broadcasts, extending the work of previous evaluations such as Hub4 [32], TDT [8], Ester [12], Albayzin [45] and MediaEval [23]. MGB was the first evaluation campaign in the media domain to propose lightly supervised alignment of broadcasts as a main task.

The focus of the MGB challenge was on multi-genre data. Most previous work on broadcast audio has focused on broadcast news and similar content. However, the performance achieved on broadcast data dramatically degrade in the presence of more complex genres. The MGB challenge thus defined 8 broadcast genres: Advice, children’s, comedy, competition, documentary, drama, events and news.

4.1 Experimental setup

The experimental data provided on the MGB challenge 2015 consisted of more than 1,600 hours of television shows broadcasts on the BBC through April and May of 2008. It was divided into training, development and evaluation sets as shown in Table 2.

Table 2 Datasets in the MGB challenge

The only transcription available for the 1,200 hours of training speech were the original BBC subtitles, aligned to the audio data using a lightly supervised approach [25]. No other audio data could be used to train acoustic models according to the evaluation conditions. More than 650 million words of BBC subtitles, from the 1970s to 2008, were also provided for language model training. As with the acoustic model training data, no other linguistic materials could be used for training language models. For building lexical models, a version of the Combilex dictionary [36] was distributed and was the only available source for developing the lexicon.

The system used for decoding [38] was based on acoustic models trained on 700 hours of speech extracted from the available 1200 hours using a segment–level confidence measure based on posterior estimates obtained with a DNN [46] to select only segments with an accurate transcription. A 6–hidden–layer DNN with 2,048 neurons was trained using Deep Belief Network (DBN) pretraining and then fine–tuned using first the Cross–Entropy (CE) criterion, followed by the state–level Minimum Bayes Risk (sMBR) criterion. The input to the DNN are 15 spliced Perceptual Linear Prediction (PLP) acoustic frames. The vocabulary used for decoding was a set of common 50,000 words from the linguistic training data and n–gram language models were trained from the available linguistic resources and then converted to WFSTs. For the rescoring, an RNNLM was trained also using the available language training data.

For speech segmentation, two strategies based on 2–hidden–layer DNNs were explored [28]. In the first one, all the available acoustic training data was separated into speech and non–speech segments, providing 800 hours of speech and 800 hours of non–speech for training the DNN using the CE criterion. This is referred to as DNN VAD 1. In the second strategy, data selection was applied to yield 400 hours of audio with 300 hours and 100 hours of speech and non-speech content respectively. Identical training was performed on the carefully selected data set to give a 2–layer DNN (DNN VAD 2).

4.2 Results

Task 2 of the MGB challenge was a word–level alignment task and thus it was evaluated as a classification task. The evaluation metrics were the precision and recall of the system, with the final metric being the F1 score (or F–measure). The precision is measured as the number of TPs divided by the total number of words in the system output, which is the sum of TPs and FPs. Recall is measured as the number of TPs divided by the total number of words in the reference, which is the sum of TPs and FNs. With these two measures, the F1 score is computed as the geometrical mean of precision and recall:

$$ F1 = 2\frac{Precision*Recall}{Precision+Recall}=\frac{2*TP}{2*TP+FN+FP} $$
(1)

For scoring purposes of the MGB challenge, a word is considered correct if the output start time and end time are less than 100 milliseconds away from the ground truth start and end times of that word. A set of experiments were performed using the MGB development set to investigate the optimal setup of the proposed lightly supervised alignment system, in terms of getting the best F1 scores. Table 3 presents the results for four different configurations. The first two rows show the differences achieved using an unadapted decoding system with the two speech segmentation strategies with DNN VAD 1 and DNN VAD 2. The use of the DNN VAD 2 leads to an increase in Segmentation Error Rate (SER), but reduces the Word Error Rate (WER) and improves the F1 score by a significant 1%. This is due to the fact that DNN VAD 2 only misses 1.2% of speech frames, which helps the alignment procedure to identify matches between the hypothesis and the subtitles. The use of lightly supervised decoding using n–gram adaptation reduces WER to 24.9% and increases F1 score to 90.3%. Finally, RNNLMfine–tuning provides an extra 1.7% reduction in WER and 0.2% increase in F1 score.

Table 3 Results in the MGB development set

The final proposed system, achieving 90.5% F1 score on the MGB development set, was then run on the MGB evaluation set, where it achieved 88.8% F1 score. Table 4 presents the result for this system compared to other systems reported previously. In terms of the systems officially submitted to the MGB challenge, and reported in [3], the proposed system would achieve second place, only 0.5% below the University of Cambridge system, and substantially improving the original submission by the University of Sheffield.

Table 4 Results in the MGB evaluation set

5 System deployment

The automatic alignment system described in this paper has been made available through webASRFootnote 2. webASR was setup as a free cloud–based speech recognition engine in 2006 [15, 17, 43, 44] and was redeveloped in 2016 as a multi–purpose speech technology engine [16]. It allows research and non–commercial users to freely run several speech technology tasks, including automatic speech recognition, speech segmentation, speaker diarisation, spoken language translation and lightly supervised alignment. It runs as a cloud service on servers located at the University of Sheffield using the Resource Optimization ToolKit (ROTK) [13] as its backend. ROTK is a workflow engine developed at the University of Sheffield that allows the running of very complex systems as a set of smaller asynchronous tasks through job scheduling software in a grid computing environment.

The web interface of webASR allows new users to register for free and, once registered, to submit their audio files to one of the available tasks. Once the processing in the backend is finished, the users can retrieve the result files directly from their accounts in webASR. As processes run asynchronously, users can run multiple systems at the same time and wait for the results of each one as they happen.

In order to facilitate the building of applications using the webASR cloud service, an API was implemented using the Django web framework. Figure 4 depicts the integration of an ASR backend system into webASR using the API. The API acts as a backend system wrapper and handles all post and query requests from the user. Taking the alignment system as an example, a user first submits an audio file and an untimed subtitle file to webASR through a POST command (http://webasr.org/newupload). This will trigger webASR to connect to the ASR backend and run the alignment system in ROTK. The user can poll the status of the system through a GET command (http://webasr.org/getstatus), which will return whether the backend has finished processing the file or not. When ROTK is finished it updates webASR with the outcome. At that point, the user can use a final GET command (http://webasr.org/getfile) to retrieve a set of files containing the aligned subtitles. These files are PDF, XML, TTML, WebVTT and SRT formats.

Fig. 4
figure 4

Use of the webASR API for lightly supervised alignment of subtitles

6 Conclusions

This paper has presented a lightly supervised alignment system of subtitles to broadcast audio. A thorough description of the steps required to implement such a system has been given, from speech segmentation, lightly supervised decoding to text alignment. Results show that a minimum missed rate of speech in the upstream speech segmentation is essential to downstream performance improvements.

In terms of methodologies proposed in this work and in contrast to other systems for lightly supervised alignment that were proposed for the MGB challenge 2015 and whose results are given in Table 4, we must note two main novelties. The first is the use of RNNLM adaptation, achieved by fine-tuning the RNNLM on the subtitle text in order to bias the RNNLM towards the subtitles, which was shown to both reduce recognition errors as well as improve the accuracy of the alignment output. The second is the use of the error correction algorithm proposed by same authors [31], which deals with improving the correctness of word–level alignments and the removal of non–spoken words, using a binary regression tree to identify these words based on acoustic values such as duration and confidence measures.

From the point of view of lightly supervised decoding, the experiments have shown how RNNLM rescoring helps not only to reduce the recognition errors but, more importantly to improve the accuracy of the alignment output. Adaptation of n–grams and RNNLMs produce a significant reduction in recognition errors and an associated increase in alignment accuracy. In general, the lightly supervised approach has shown how it can significantly improve the outcome of the alignment task.

The current state-of-the-art system for the MGB challenge alignment task is the University of Cambridge system for lightly supervised alignment [22]. The steps for lightly supervised decoding and alignment are very similar to the one presented in this paper, except for the two novel contributions detailed above. The reason why the Cambridge system produced the best results in the challenge, was because their audio segmentation and lightly supervised decoding systems were better, making use of enhanced Deep Neural Network (DNN) architectures. In this work, the improvements proposed to both the lightly supervised decoding and alignment stages help us achieve results close to the state-of-the-art.

The proposed alignment system achieves F1 scores of 90.5% and 88.8% in the development and evaluation sets, respectively, in Task 2 of the MGB challenge. The evaluation results are the third best reported results on this setup and would achieve the second place on the official challenge results behind the Cambridge System [22]. In order to improve these results, even larger improvements in acoustic modelling and language modelling of the lightly supervised decoding stage would be necessary. While the presented system achieves 23.2% WER on the development set, it is expected that reducing this error rate to below 20% would increase the F1 score further, getting it closer to the best reported results of 90.0% in the evaluation set.

In order to facilitate the use of this system, for research and non–commercial purposes, this system has been implemented in webASR. Through its API, webASR allows an easy integration into any given workflow. Given an audio file and its corresponding untimed subtitles, timed subtitles are produced and can be used in further processing. This can greatly facilitate the work on subtitling, close captioning and dubbing.

7 Data access management

All the data related to the MGB challenge, including audio files, subtitle text and scoring scripts is available via special license with the BBC on https://doi.org/http://www.mgb-challenge.org/. All recognition outputs and scoring results are available with https://doi.org/10.15131/shef.data.3495854.