SubtitleFormatter: Making Subtitles Easier to Read for Deaf and Hard of Hearing Viewers on Personal Devices
For deaf or hard of hearing (DHH) viewers who cannot understand speech, many countries require video producers/distributors to provide speech-to-text over the video, also called subtitles that can be turned on or off by the viewer. These subtitles must comply with national subtitle quality standards. The growth in video capable personal devices has shifted viewers away from watching broadcast video on a standardized television display and towards watching video on interactive personal devices. However, personal devices range widely from tiny watch displays to enormous television displays, with different proportions which impact subtitle readability. SubtitleFormatter automatically formats subtitles according to a display’s screen size and minimum font size for reading. A user study of subtitle formatting evaluates subtitle readability, and finds that viewers preferred SubtitleFormatted-segmented subtitles over wrap around (arbitrarily-formatted) subtitles.
KeywordsSubtitles Speech-to-text Deaf or hard of hearing
Hearing loss is an invisible but significant barrier in daily life, including education and streaming television. In addition to approximately 2% of people born deaf or hard of hearing, approximately 31% of people over 65 have significant hearing loss . Many rely on subtitles to access and enjoy videos.
Videos are a vital part of our shared cultural experience and shapes our identity as citizens. Subtitles provides accessibility to these individuals such that they are not shut out of society and culture. Subtitling laws and policies in many countries, including the United States, United Kingdom, Brazil and India, guarantee some access to video programming [2, 3].
In the United States, pre-prepared subtitles were included in national TV shows from 1973 , and real-time subtitles were included in television shows from 1982 . The timeline for introducing subtitles in television was similar in most developed countries. For over 30 years, standards for subtitling  were developed and standardized for an average DHH viewer who watched analog broadcasts on non-interactive, fixed format television displays. The television screen resolution and proportions were set for standard resolution (e.g., NTSC: 720 by 480 pixels or PAL: 720 by 576 pixels with an aspect ratio of 4:3). This resolution and aspect ratio remained unchanged till the advent of digital broadcasts in the 2000s.
2.1 Subtitled Videos on Personal Devices
The advent of digital television on personal displays led to far more diversity in resolution, size and proportions. Viewers consume video programming personal devices with varying resolutions and aspect ratios. This diversity of personal display characteristics can influence a viewer’s preferred size and number of caption lines. Although viewers can view videos anywhere, any time, and no longer be tethered to their couches, it becomes harder to fit subtitles on the widely varying sizes and proportions of personal devices. Viewing devices range from tiny smart watch displays to enormous television displays. While there has been a substantial body of research focused on a standardized speech-to-text presentation for all users watching television programs built over 30 years, there is scant research focused on adapting subtitles according to the display characteristics.
Currently, most video providers follow television captioning standards. When they do add features in their speech-text caption interfaces, these features create additional problems. For example, if the interface offers resizable fonts, and the font is made bigger, the caption lines will become too big for the video display and wrap around, which disrupts the reading process. To the best of our knowledge, no video platform provides a feature to reformat subtitles depending on screen size or user preferences.
3 Related Work
We focus on improving two parts of the closed caption reading process – the cognitive process of reading subtitles, and the process of segmenting and formatting subtitles to fit the video display.
3.1 Cognitive Process of Subtitle Reading
Prior research has shown that the cognitive process of reading subtitles that are a form of real-time speech-to-text is different from reading static text or print. Speech-to-text is short and regularly changes, while print is long, formatted and does not change . Reading subtitles often takes relatively more time and energy than it does to listen to spoken or signed languages, and those watching subtitles must often split their attention between the subtitles and other visual information, such as whatever is happening on the TV screen .
Research on automatic caption segmentation seems to suggest it can have a positive impact, but is not conclusive. Perego et al.  found that segmenting subtitles in inappropriate places had no impact on sentence recall or eye movement. However, Rajendran et al.  found a significant difference in eye movement for different kinds of subtitle segmentation. Waller and Kushalnagar  suggest that segmentation may have an impact on our memory of the text.
4 SubtitleFormatter Design and Evaluation
We created a linguistically aware automatic formatting system, called SubtitleFormatter. The system automatically formats subtitles by parsing the text to break the text at linguistically appropriate places.
We conducted a user study to verify the utility of the system’s parsing and breaking of subtitles. We compared the readability of unparsed subtitles versus human parsed subtitles that were generated by professional closed caption stenographers on both regular television screens and for small phone screens. We also compared the readability of unparsed subtitles versus automatic subtitles on both regular and small phone screens, to investigate whether the readability can reach the level and quality of human-parsed subtitles generated by professional closed caption stenographers.
The SubtitleFormatter system analyzes the subtitles and the display so that it can format the subtitles according to the display. It has two parts – a linguistic analyzer and a display analyzer.
4.2 Linguistic Analyzer
It breaks down the sentence into phrases, which in turn are broken down into smaller and smaller phrases down to the level of words. The SubtitleFormatter system uses this information to identify optimal point breaks.
To evaluate caption segmentation and formatting, we evaluated the difference in readability between unparsed subtitles (A), human-parsed (H), and SubtitleFormatter-parsed (P) subtitles on a regular television display and on a personal phone.
We recruited 34 deaf and hard of hearing participants. All participants regularly use captions when watching online videos, TV, and other audio-video content. The participants ranged in age from 20 to 48 years old: 20 men, and 14 women. By ethnicity, 18 participants identified as white, 5 identified as black, 5 identified as Asian or Asian-American, 4 identified as Hispanic and 2 as multiracial.
Each participant watched six 4-minute videos with A, H or P subtitles, with breaks in between. The survey and videos were shown either on a 40-inch television set or on iPhone 5 s. The entire experiment took 30–45 min. Half viewed the first three videos on a 40-inch high-definition (1920 × 1080) television display, and the next three videos on a 4-inch iPhone 5 (1136 × 640) display, and the other half watched in the reverse order. Each viewer watched all six videos in a balanced, randomized order. We gathered data from all participants through three parts – a Likert rating questionnaire, a comprehension questionnaire, a sentence completion task, and an eye-tracking data gathering part. After each video, the participants were given a sentence completion task in which they were presented with the beginning part of a sentence from the text. Afterwards the researcher explained the purpose of the study, and the participant was invited to add a comment on either the study or subtitles.
The results from the following evaluations were grouped by Likert ratings, comprehension scores including sentence completion scores. The Shapiro-Wilk test indicated the observed values were not normally distributed, so non-parametric testing was done. The Wilcoxon Signed Rank tests was used to perform post-hoc comparisons between pairs of samples, with Bonferroni corrections to address the multiple comparisons.
6.1 Likert Ratings
For the subjective responses, no significant differences were found: A vs. H: Wilcox = 54, p > 0.05; A vs. P: Wilcox = 45.5, p > 0.10; H vs. P: Wilcox = 48.5, p > 0.05.
For satisfaction, none of the three conditions had a significant impact on the ratings relative to each other: A vs. H: Wilcox = 48.5, p > 0.05; A Vs. P: Wilcox = 51, p > 0.05; H vs. P: Wilcox = 75.5, p > 0.05.
For the subjective and satisfaction responses, there was a significant difference between A vs. H, and A vs. P, but not H vs. P.
For satisfaction, there was no significant difference between A vs. H and A vs. P, but there was a significant difference between P and H: H vs. P: Wilcox = 14.8, p < 0.05.
6.2 Comprehension Scores
For general comprehension questions, there was no significant difference: A vs. H: t = 0.583, p > 0.05; A vs. P: t = 0.432, p > 0.05; H vs. P: t = –0.24, p > 0.05.
For sentence completion, A was significantly different than either H or P: A vs. H: t = –1.048, p < 0.05; A vs. P: t = 1.052, p < 0.05, but not H vs. P: t = 0.414, p > 0.05.
However, participants got significantly more sentence completion questions correct for P than A: t = 2.169, p < 0.05, but not for the other pairwise comparisons.
For general comprehension, none of these means differed significantly: A vs. H: t = 0.541, p > 0.05; A vs. P: t = 0.471, p > 0.05; P vs. H: t = 0.349, p > 0.05.
For sentence completion, H was not significantly different than either of the other conditions: A vs. H: t = –1.276, p > 0.05; H vs. P: t = 1.874, p > 0.05.
However, participants got significantly more sentence completion questions for P vs. A: t = 2.3224, p < 0.05, but not for the other pairwise comparisons.
The participant comments were generally negative about unsegmented subtitles. They were generally positive about grammatically segmented subtitles, generated either by professionals, or SubtitleFormatter program, on both on small and large displays. When the participants had different comments between large and small displays, they said that the lines or words were too hard to see on small displays. On large displays, they said the lines were not complete, or that they did not like it and could not explain why. For human-segmented subtitles, 28 participants wrote comments, and 15 wrote down identical comments for both small and large displays. When they wrote the same comment for both, they said that it was easy to read, or that they could understand each line. When the participants had different comments between large and small displays, they generally said that the lines or words were too fast on small displays, and that on large displays, they said the lines were not complete.
Participants significantly preferred H or P segmented subtitles for smaller screens, but not for bigger screens. They performed significantly better on sentence completion tasks for either P or H over A. However, they did not perform significantly better on either H or P over A for either large or small screens. The general lack of significant differences between H and A for bigger screens agrees with the assertion by Perego et al. that segmentation in captioning has little or no impact on readability. It is possible that human or SubtitleFormatter subtitles were easier to remember, but the difference from unsegmented subtitles did not rise to the level of significance.
The preference for human or SubtitleFormatter subtitles on smaller screens could be that when lines are shorter, the sentence concepts are more likely to be distributed on multiple lines and that breaking outside of phrases is likely to be confusing. The fact that segmentation has an impact on user preferences and sentence recall for smaller screens has important implications for captioning, as currently caption guidelines all encourage proper segmentation
The SubtitleFormatter supports viewer preferences for proportionately larger text on small screens by automatically adjusting caption line width according to screen size. It can be viewed as an automatic enhancement of accessibility for viewers who use captions, like how people with diverse magnification needs can benefit from automatic magnification. We present a novel approach to automatic subtitle segmentation which generates and selects optimal segmentation points according to SubtitleFormatter. SubtitleFormatter segmentation can be an inclusive approach for viewers who wish to follow best practice in segmentation guidelines and rules, or one that fits their own needs.
8 Future Work
Automatically formatting subtitles by display can also benefit people with limited English proficiency, or viewers with situational auditory barriers, e.g., quiet public spaces.
- 2.United States Congress, Americans with Disabilities Act, Pub. L. No. 101-336, 104 Stat. 328. United States of America (1990)Google Scholar
- 3.National Congress of Brazil, Ley de Igualdad de Oportunidades para las Personas con Discapacidad (Law of Equal Opportunities for Persons with Disabilities) (1992)Google Scholar
- 4.NIDCD: Captions For Deaf and Hard-of-Hearing Viewers, NIH Publ., no. 4834 (2017). https://www.nidcd.nih.gov/health/captions-deaf-and-hard-hearing-viewers
- 5.Block, M.H., Okrand, M.: Real-time closed captioned television as an educational tool. Am. Ann. Deaf 128(5), 636–641 (1983)Google Scholar
- 6.United States General Publications Office, 47 CFR 79.A.1. (2015)Google Scholar
- 8.Kushalnagar, R.S., Behm, G.W., Stanislow, J.S., Gupta, V.: Enhancing caption accessibility through simultaneous multimodal information: visual-tactile captions. In: ASSETS14 - Proceedings of the 16th International ACM SIGACCESS Conference on Computers and Accessibility (2014)Google Scholar
- 11.Waller, J.M., Kushalnagar, R.S.: Evaluation of automatic caption segmentation. In: Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility - ASSETS 2016, pp. 331–332 (2016)Google Scholar
- 12.Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL 2003, vol. 1, pp. 423–430 (2003)Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.