Intelligent Multimodal Interaction Challenges and Promise Mark T








![Common Representations: Communicative Acts [Maybury, 1993; Wahlster, Andre, Rist 1993] MITRE Common Representations: Communicative Acts [Maybury, 1993; Wahlster, Andre, Rist 1993] MITRE](https://slidetodoc.com/presentation_image_h2/cc0e36256ee5942df69fb1241060e9b1/image-9.jpg)










- Slides: 19
Intelligent Multimodal Interaction: Challenges and Promise Mark T. Maybury maybury@mitre. org Schloss Dagstuhl, Germany 29 October 2001 www. mitre. org/resources/centers/it/maybury/mark. html This data is the copyright and proprietary data of the MITRE Corporation. It is made available subject to Limited Rights, as defined in paragraph (a) (15) of the clause at DFAR 252. 227 -7013. The restrictions governing the use and disclosure of these materials are set forth in the aforesaid clause. MITRE
What are we talking about? Information Perception Cognition Emotion Cognition Visualization See Smell Information Speech Facial Haptics/ Gesture MITRE Image Source: Dr. Nahum Gershon and Ellaine Mullen, Copyright The MITRE Corporation
Why Multimedia? Above modified from Cohen, P. 1992. The role of natural language in a multimodal interface. In Proceedings of ACM SIGGRAPH Symposium on User Interface and Software and Technology (UIST), Monterey, CA 143 -149. MITRE
Why Multimedia? l Evidence users prefer both: - Flexibility (user, task, situation) - e. g. , speech text, pen #’s - Efficiency and expressive power MITRE
Our Challenges l Empirical studies of the optimal combination of text, audio, video, gesture for device, cognitive load and style, task, etc. for both input and output - perceptual, cognitive, and emotional effect l Multi* Input - Integration of imprecise, ambiguous and incomplete input - Interpretation of uncertain multi* inputs l Multi* Output - Select, design, allocate, and realize coherent, cohesive and coordinated output l Interaction Management - Natural, joyous, agent-based (? ) mixed initiative interaction l Integrative Architectures - Common components, well defined interfaces, levels of representation l Methodology and Evaluation - Ethnographic studies, community tasks, corpus-based
MITRE
Multimedia Presentation Generation: “No Presentation without Representation” DATA ANIMATED AGENTS Athens Philosopher Born Died Works Emphasis Aristotle 384 BC 322 BC Poetics Science Plato 428 BC 348 BC Republic Virtue Socrates 470 BC 399 BC None Conduct Aristotle MAPS Lifespan Socrates Philsopher Born Died Socrates Plato Aristotle 399 348 322 470 428 384 TABLES MITRE Plato Aristotle 500 450 400 350 300 BC GRAPHS Plato VIDEO Socrates, Plato, and Aristotle were Greek philosophers. . . NATURAL LANGUAGE
Common Presentation Design Tasks Communication Management Content Selection Information, task, user … Presentation Design Expressivity of different languages e. g. , “ven aca” gesture l Co-constraining l Cascaded processes MITRE Media Allocation Length affects layout in space or time (e. g. , EYP, audio) Media Coordination Media Realization Media Layout
Common Representations: Communicative Acts [Maybury, 1993; Wahlster, Andre, Rist 1993] MITRE
Traditional Architecture Information Presentation User(s) MITRE Dialog Control Application Interface Applications People
Architecture of the Smart. Kom Agent Media Input Processing Media/Mode Analysis Interaction Management Language Media Fusion Graphics Gesture Discourse Modeling Application Interface Initiation Request Information Intention Recognition Response Applications People User Modeling Integration Biometrics Media/Mode Design (cf. Maybury/Wahlster 1998) Language User(s) Graphics Gesture Animated Presentation Agent Media Output Rendering Presentation User Model Presentation Design Dialog Control Discourse Model Domain Model Application Interface Task Model Representation and Inference Media Models
DARPA Galaxy Communicator The Galaxy Communicator Software Infrastructure (GCSI) is a distributed, message-based, hub-and-spoke infrastructure optimized for constructing spoken dialogue systems MITRE Language Generation Text-to-Speech Conversion Audio Server Dialogue Management Hub Application Backend Context Tracking Speech Recognition Frame Construction Open source and documentation available at fofoca. mitre. org and sourceforge. net/projects/communicator
An Example: Communicator-Compliant Emergency Management Interface Text-to-speech converts output text to audio CMU Festival engine, Colorado wrapper MIT phone connectivity connects audio to a telephone line Speech recognition converts speech to text MIT SUMMIT engine and wrapper MITRE I/O podium displays input and output text Open source Post. Gres engine, MITRE wrapper Database Hub Frame construction MITRE SQL generation converts abstract requests to SQL MITRE dialogue management tracks information, decides what to do, and formulates answers extracts information from input text Colorado Phoenix engine, MITRE wrapper
AUDITORY SENSORY passive words stimuli - fixation 2=primary auditory cortex OUTPUT repeat words - passive words ASSOCIATION generate use - repeat words (e. g. , cake -> eat) SEMANTIC monitor semantic category - passive words Source: Science or Nature Univ Washington VISUAL 1. 6 cm above ac-pc line aa) - temporoparietal - bilateral superior - posterior temporal - inferior anterior cingulati non-speech audio No 2 bb) occipital cortex (4 cm above ac-pc line) cc) d) d Rolandic cortex (anterior superior motor cortex) (8 cm below) ee) f)f (inferior anterior frontal cortex, area 47 of Brodmann) - Left inferior frontal (semantic association) - Anterior cinguilati gyrus (attentional system for action selection, e. g. Pick dangerous animals)
Evaluation Techniques l IUI harder than HCI evaluation - User influences interface behavior (i. e. , user model) - Interface influences user behavior (e. g. , critiquing, cooperating, challenging) - Varying task complexity, environment - Requires more careful evaluation l Many techniques - “Heuristic evaluation” - i. e. , cognitive walk-through - Analytic/formal/theoretic (e. g. , GOMS, CCT, ICS) l model resources required, task complexity, time to complete to predict performance, critique interface - Ablation studies - Wizard-of-oz, simulations - Instrumentation of live environments MITRE
Instrumented Evaluation Process Corpus-based Adaptation Interaction Logging Instrumented Interactive Application Analysis and Evaluation MITRE Indexed, Enriched Log Replay, Data Visualization, & Annotation Source: DARPA IC&V
WOSIT and COLAGEN observe interpret simulate perform End User Application: TARGETS in te ra Instrumentation: JOSIT Tutoring Agent: Collagen (MERL) te ct ca ni u m co m Instrumentation Software JOSIT: http: //www. mitre. org/tech_transfer/josit/ WOSIT: http: //www. mitre. org/technology/wosit/ MITRE
Summary: Our Challenges l Empirical studies of the optimal combination of text, audio, video, gesture for device, cognitive load and style, task, etc. for both input and output - perceptual, cognitive, and emotional effect l Multi* Input - Integration of imprecise, ambiguous and incomplete input - Interpretation of uncertain multi* inputs l Multi* Output - Select, design, allocate, and realize coherent, cohesive and coordinated output l Interaction Management - Natural, joyous, agent-based (? ) mixed initiative interaction l Integrative Architectures - Common components, well defined interfaces, levels of representation l Methodology and Evaluation - Ethnographic studies, community tasks, corpus-based
Conclusion l Emerging techniques for parsing simultaneous multimedia input, generating coordinated multimedia output, tailoring interaction to the user, task, situation. l Laboratory prototypes that integrate these to support multimedia dialogue, agent-based interaction l Personalization increasing, privacy a concern l Range of application areas: decision support, information retrieval, education and training, entertainment l Potential benefits - Increase the raw bit rate of information flow (right media/modality mix for job) - Increase relevance of information (e. g. , information selection, tailored presentation) - Simplify and speed task performance via interface agents (e. g. , speech inflections, facial expressions, hand gestures, task delegation). MITRE