Synthesized Audio Descriptions Hironobu Takagi Chieko Asakawa IBM

National Women's Education Center - July 6 th, 2010. IBM History of Accessibility 1960 s　Talking Typewriter 1975　1403 Braille Printer 1984　Talking 3270 Terminal 1988　Screen. Reader/DOS 1990　Voice. Type™ 1994　Screen Magnifier™/2 1960 s Talking Typewriter 1997　Home Page Reader 1984 Talking 3270 Terminal 1998　Via. Voice® 2000　Accessibility Center 2004　a. Designer 2007　ai. Browser for Multimedia 2007　Eclipse Accessibility Tools Framework 2008　Social Accessibility 2009　ARIA (Accessible Rich Internet Application) 2 1999 Home Page Reader Japanese, Italian, French, German, Spanish, US English, UK English © 2010 IBM Corporation

IBM Research - Tokyo Status of Audio Descriptions in Japan 0. 9% 12. 0% Movies Ratio of Japanese movies with Captions (2008) Ratio of Japanese movie with Audio Descriptions from NPO Media Access Support Center Public TV TV Public TV 　　　　　Private TV Private 49. 4%, 42. 3% 5. 6%, 0. 4% 　　　　 Ratio of TV Programs with captions (2008) (*1) Ratio of TV Programs with Audio Descriptions (2008) (*2) *1 : Ministry of Internal Affair and Communication (2008) *2 : NICT: National Institute of Information and Communications Technology 3 © 2010 IBM Corporation

IBM Research - Tokyo Captions and Audio Descriptions for TV Programs based on data

IBM Research - Tokyo Problems: Workload and Cost Workload Captions Audio descriptions Recording Transcribing 5 § Recording an audio description calls for a skilled narrator and a good recording environment. § Writing an audio description script requires special expertise to describe the scenes between dialogues and scene changes. © 2010 IBM Corporation

IBM Research - Tokyo History of Text-to-speech Engines 1980 1990 1985 IBM 1983年 Dec. Talk 6 2000 1996 Pro. Talker(IBM) 2004 Super Voice (IBM) 2010 2008 Emotional TTS (IBM) 2004 Super Voice (IBM) © 2010 IBM Corporation

IBM Research - Tokyo Possible Reduction of Workload Current audio descriptions Recording Reduction by Synthesis Reduction by Tool support Transcribing 7 Synthesized audio descriptions Transcribing © 2010 IBM Corporation

IBM Research - Tokyo Acceptance Ratio (United States) § Method Online Survey § Participants 236 （39 low-vision, 197 blind） § Genre Education and documentary § Voice quality Human and TTS（Heather） 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Uncomfortable Slightly Uncomfortable Neutral Acceptable Comfortable Set 1 Set 2 Set 3 Set 4 Constantly 70%～ 80% answered more than neutral 8 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発 © 2010 IBM Corporation

IBM Research - Tokyo Video Accessibility Project: Goals § Prove feasibility of text-based audio descriptions via user studies. – Work with professional teams for audio descriptions – Japan – IBM with CAP and content from NHK – U. S. - WGBH § Create an open source platform for audio descriptions and captions – Authoring tools and players – Captions and text-based audio descriptions – Based on Eclipse. org Accessibility Tools Framework (ACTF) § Contribute to standardization of Internet media accessibility – Focus on “missing markups” in the existing standards. – Maintain neutrality for existing standards. – HTML 5 is the primary target. Supported by the Japanese government agency NICT (National Institute of Information and Communications Technology) 9 © 2010 IBM Corporation

IBM Research - Tokyo ACTF Script Editor § Authoring tool, specialized for audio descriptions. § Flexible to import and export various formats. § Planned for release as open source in March. 11 © 2010 IBM Corporation

IBM Research - Tokyo Case of the audio guide for the museum / the stage § Museums : There are many actual usage of audio guide in museum and art museum. （The main purpose of audio guide is not to support person with visually impaired but to help everyone for studying the contents. ) – [for example : provider of audio guide] • National Museum of Nature and Science, Tokyo • The National Museum of Western Art • Hiroshima Museum of Art • Osaka Museum of Natural History • Tokyo Museum of Fire Department • Shimane Museum of Ancient Izumo. – Almost every museum in Japan provides audio guide. – Generally, audio guide equipment is specially designed and made with prerecorded voice by manufacture. There is a new approach for using NINTENDO DS and downloading the content in it at the museum. § The stage : Mini-drama group is main. – [for example : provider of audio guide] • Drama group "Bakkari-Bakkari" provides audio guide once in a performance period. • A drama group in the city of Kawasaki, Kanagawa Pref. • A drama group "DORA" – About caption, for example, SHIKI THEATRE COMPANY provides caption. There is very few case that large-scale theatre play provides audio guide. © 2010 IBM Corporation

IBM Research - Tokyo Laws and Regulations § 1993 Act on Advancement of Facilitation Program for Disabled Persons' Use of Telecommunications and Broadcasting Services, with a View to Enhance Convenience of Disabled Persons (1993) § 1997 MIC defined a goal to “provide captions to all TV programs by 1997” § 1998 BROADCAST LAW – Article 3 -2 (4) – Any broadcaster shall, in compiling the broadcast programs for domestic broadcasting, provide as many broadcasting programs as possible which provide voices and other sounds to explain about transient images of fixed or moving objects for blind persons, and providing characters or patterns to explain about voices and other sounds for deaf persons. § 2007 Signed the “Convention on the Rights of Persons with Disabilities” § 2010 New JIS (Japanese Industrial Standard) for Web Accessibiltiy – Technical guidelines are fully harmonized with WCAG 2. 0 13 © 2010 IBM Corporation

IBM Research - Tokyo ACTF ai. Browser 1 Direct audio control Allow users to increase or lower the volume, stop or play, and control audio speed by using simple keyboard commands. 2 User interface simplification Structurally simplify interfaces by converting dynamic visual interfaces into static text-based interfaces Dynamically add alternative texts to images and buttons 3 Audio descriptions with text Infrastructure to provideo descriptions at low cost 14 14 © 2010 IBM Corporation

IBM Research - Tokyo Status of Audio Descriptions in Japan 0. 9% 12. 0% Movies Ratio of Japanese movies with Captions (2008) Ratio of Japanese movie with Audio Descriptions from NPO Media Access Support Center Public TV TV Public TV 　　　　　Private TV Private 49. 4%, 42. 3% 5. 6%, 0. 4% 　　　　 Ratio of TV Programs with captions (2008) (*1) Ratio of TV Programs with Audio Descriptions (2008) (*2) *1 : Ministry of Internal Affair and Communication (2008) *2 : NICT: National Institute of Information and Communications Technology Internet 0. 2% Ratio of video content with captions in the Open Courseware project. (2 among 1, 474) 0. 0% Popular video sharing services and educational online videos, but no videos with audio descriptions (except for videos prepared as examples of audio descriptions). Team investigation 15 © 2010 IBM Corporation

IBM Research - Tokyo Analysis of Standards and Possible Focus Layer of Markups (vocabulary lists) for text-based audio descriptions Personalization Association with video contents, multilingual, etc. Mozilla <itext>, etc. Index structure for video (Scenes and chapters, etc. ) Each video format has its own specifications. (DVD, MPEG, etc. ) Unique for audio descriptions (extended, audio control, block, etc. ) FOCUS AREA! Voice styles and emotional expressions W 3 C SSML, W 3 C etc. Emotion ML Description (textual information) Addressing (timing) 16 SRT W 3 C SMIL W 3 C TT DFXP Flexible addressing © 2010 IBM Corporation

IBM Research - Tokyo 2 nd study: Level of Description Rate of correct answers for each level of description heard once or twice 30% Using the extended description and listening twice both improved the comprehension. 17 © 2010 IBM Corporation

IBM Research - Tokyo Difficulties in Online Videos News Entertainment E-Learning Now is the time to create a new technical framework for audio descriptions! Historical Videos 18 Consumer-Generated Videos © 2010 IBM Corporation

IBM Research - Tokyo Prior Projects § e-Inclusion project in Canada supported by Canadian Heritage. – CRIM (Centre de recherche informatique de Montréal) – Four-year project completed this year – Authoring tool and playback tool § Live. Describe by Ryerson University – Community-based authoring system – Authoring tool and playback tool § NHK Research – Prototyped and tested TTS-based audio descriptions § ai. Browser – Developed by IBM Research and contributed to Eclipse. org – Audio descriptions with Flash, Quick. Time, and Windows Media Player § Other trials – HTML 5 + Live Region demo (Firefox team) – Web. Shake • Japanese online caption provider prototyped with TTS-based audio descriptions. 19 – ACAV, etc. © 2010 IBM Corporation

IBM Research - Tokyo Distribution Flexibility Human voice (current model) Audio Human narrator Voice quality Authoring cost System cost High Low* Low High** Low High Lowest Low*** Audio Pre-recorded synthesized audio Text Synthesizer Audio Server-side synthesizer Text Synthesizer Audio Client-side synthesizer Text 20 Synthesizer * Server-side synthesis is better than client-side synthesis. *** Client-side software support is required. ** The systems for human voices can be reused. © 2010 IBM Corporation

IBM Research - Tokyo Experimental Results (Japan) § 1 st study (Sep 2009) – 3 blind or visually impaired participants – Face-to-face, one-to-one sessions – Focused on the voice quality, level of description, and speech speed § 2 nd study (Feb 2010) – 24 blind or visually impaired participants – Face-to-face, small group sessions – Consisted of 4 sub-studies for long-term listening, expressive voices, describer expertise, and level of description 21 © 2010 IBM Corporation

IBM Research - Tokyo 1 st study: Results The descriptions greatly improved the user experience regardless of the voice quality. The participants’ comments indicated that Modern TTS was almost comparable to a human voice though the human was still preferred. 23 © 2010 IBM Corporation

IBM Research - Tokyo 2 nd study: Sub-studies 1. Long-term listening – Assess if TTS-based descriptions are acceptable for listening to fulllength programs – Target videos: cartoon (comedy), drama (tragedy), documentary 2. Expressive voices – Determine if the expressive TTS improves the user experience – Target videos: cartoon (comedy), drama (tragedy) 3. Describer expertise – Assess how the describer expertise affects understanding – Target video: public service announcement (warning about fraud) 4. Level of description – Assess how the level of description and repetitive listening affects understanding – Target video: instructional program (how to fold and store clothing) 24 © 2010 IBM Corporation

IBM Research - Tokyo 2 nd study 25 © 2010 IBM Corporation

IBM Research - Tokyo 2 nd study: Long-term Listening Effectiveness scores for each video category TTS-based descriptions were generally acceptable for full-length programs From comments, the documentary film received the highest evaluation, but that was not clear from the effectiveness scores. 26 © 2010 IBM Corporation

IBM Research - Tokyo 2 nd study: Describer Expertise Effectiveness scores for each describer expertise and level of description Expert (Normal) Expert (Extended) Novice (Normal) Novice (Extended) Frequency 12 9 6 3 0 1 2 3 Score 4 5 Novice (Normal) was not preferred (score: 3. 0) Novice (Extended) was comparable (score: 4. 3) to expert descriptions (score: 4. 3 for normal, 4. 6 for extended) 27 © 2010 IBM Corporation

IBM Research - Tokyo Typical Client-side TTS Setting Online Video Script Editor Video Player Website r efe Bro ws R Audio Description Script e h tc Fe Po st Metadata Repository 28 © 2010 IBM Corporation