T02: Speech-based Interaction: Myths, Challenges and Opportunities

Sunday, 15 July 2018, 08:30 – 12:30

 

Gerald Penn (short bio)

University of Toronto, Canada

Cosmin Munteanu (short bio)

University of Toronto Mississauga, Canada

 

Objectives:

  • How Automatic Speech Recognition (ASR) and Speech Synthesis (or Text-To-Speech, aka TTS) systems work and why these are such computationally difficult problems
  • Where are ASR and TTS used in current commercial interactive applications
  • What are the usability issues surrounding speech-based interaction systems, particularly in mobile and pervasive computing
  • What are the challenges in enabling speech as a modality for mobile interaction
  • What is the current state-of-the-art in ASR and TTS research
  • What are the differences between the commercial ASR systems' accuracy claims and the needs of mobile interactive applications
  • What are the difficulties in evaluating the quality of TTS systems, particularly from a usability and user perspective
  • What opportunities exist for HCI researchers in terms of enhancing systems' interactivity by enabling speech

 

Content:

HCI research has long been dedicated to better and more naturally facilitating information transfer between humans and machines. Unfortunately, our most natural form of communication, speech, is also one of the most difficult modalities to be understood by machines. Despite significant recent advances towards understanding speech, HCI has been relatively timid in embracing this modality as a central research focus - partly due to the relatively discouraging accuracy of speech understanding in some genres (exaggerated claims from the industry notwithstanding), but also due to the intrinsic difficulty of designing and evaluating speech and natural language interfaces.On the engineering side, improving speech technology with respect to largely arbitrary measures of performance has led to systems that deviate from user-centered design principles, and that fail to consider usability or usefulness.

The goal of this course is to inform the HCI community of the current state of speech and natural language research, to dispel some of the myths surrounding speech-based interaction, as well as to provide an opportunity for researchers and practitioners to learn more about how speech recognition and speech synthesis work, their limitations, and how they could be used to enhance current interaction paradigms.

Our approach is two-fold: present new concepts to the audience, and foster discussions and exchange of ideas. Slides are used to introduce the main points, while videos and audio clips are played to illustrate examples. After each main concept is presented, time is allocated for interaction with the audience.

Variations of this tutorial have been presented at: HCII 2016; MobileHCI 2010-2015; CHI 2011-2017, and I/ITSEC 2010-2016 (next month, I/ITSEC 2017). A recent version of our slides are available at: http://www.cs.toronto.edu/~gpenn/chi2017_speech-interaction_tutorial.pdf

We continuously update presentations to include recent advances, such as: commercial adoption and advancements in speech recognition, mobile platforms, applications involving language translation, HCI for international development and marginalized user groups. For HCII 2018, we will be extensively expanding the dialogue systems section in view of recent advances in deep learning, and commercial-grade adoption.

 

Target Audience:

The course will be beneficial to all HCI researchers or practitioners without a strong expertise in ASR or TTS, who still believe in fulfilling HCI's goal of developing methods and systems that allow humans to naturally interact with the ever-increasingly ubiquitous mobile technology, but are disappointed with the lack of success in using speech and natural language to achieve this goal.

No prior technical experience is required for the participants. The classroom activities will be conducted using the participants' smartphones (Android or iPhone), but the built-in phone functions will be used - no software download will be required. Participants will work in small groups, ensuring that even participants without smartphone are able to fully contribute.

Bio Sketches of Presenters:

Gerald Penn is a Professor of Computer Science at the University of Toronto, specializing in mathematical linguistics and spoken language processing. His lab played a pivotal role in the invention of neural-network-based acoustic models, which are now the standard in speech recognition systems, and specializes in human-subject interaction with speech-enabled devices. He is a senior member of both the IEEE and AAAI, and a recipient of the Ontario Early Researcher Award.

Cosmin Munteanu is an Assistant Professor at the Institute for Communication, Culture, Information, and Technology, University of Toronto at Mississauga), and Associate Director of the Technologies for Ageing Gracefully lab. Until 2014 he was a Research Officer with the National Research Council of Canada. His area of expertise is at the intersection of Human-Computer Interaction, Automatic Speech Recognition, Natural Language Processing, Mobile Computing, and Assistive Technologies. He has extensively studied the human factors of using imperfect speech recognition systems, and has designed and evaluated systems that improve humans' access to and interaction with information-rich media and technologies through natural language. Cosmin's multidisciplinary interests include speech and natural language interaction for mobile devices, mixed reality systems, learning technologies for marginalized users, assistive technologies for older adults, and ethics in human-computer interaction research.