ALADIN (Adaptation and Learning for Assistive Domestic Vocal
Voice control of the apparatus that we use in our daily lives is often perceived as a luxury or gadget. For persons with a physical impairment, the default user interface, i.e. pushing buttons, is not always as easy as it is for most people and voice control is a viable solution for them. However, vocal interfaces are currently not widely used for assistive devices for several reasons:
1. Due to the high level of variation in user requirements for assistive technology, individual adaptation and development are expensive. Development costs of vocal interfaces are high requiring specialized personnel with experience in grammar development, with knowledge of the strengths and weaknesses of speech technology and having the ability to translate them to applications
2. Users for whom voice command could be of added value often also have a speech pathology, such that state-of-the-art speech recognizers are unusable for them
3. Current speech recognizers are not robust to pronunciation variation caused by regional variation and speaking style.
4. Vocal interfaces require the user to speak according to predefined grammar and to use the predefined words, which often do not match the speaking style and choice of words of the user. Hence the user has to adapt to the apparatus instead of the other way round. Speaker-dependent speech recognition technology only partially solves this problem.
5. The user's voice may change over time due to progressive speech impairments, the environment and body position, requiring blind or unsupervised adaptation.
In ALADIN, we propose an approach that is based on learning and adaptation. The interface should learn what the user means with commands, which words he/she uses and what his/her vocal characteristics are. Users should formulate commands as they like, using the words they like and only addressing the functionality they are interested in. Learning takes place by using the device, i.e. by mining the vocal commands and the change they provoke to the device.
In order to move the state-of-the-art to the point where learning speech interfaces can realize the intended impact, the following scientific goals need to be realized:
(1) To develop methods for robust and adaptive vocabulary acquisition from a small number of examples. By observing recurring acoustic patterns in the user's utterances subwords, words and phrases must be identified and related to their meaning. The vocabulary for the applications we have in mind ranges from tens to hundreds of words. The proposed approach builds on ideas related to exemplar-based speech recognizers and pattern discovery based on non-negative matrix factorization. Once they have determined their phonemic space, humans can use as little as one repetition to learn a new word. This is our ultimate goal for normal speech, while remaining competitive to existing technology.
(2) To develop methods for robust and adaptive grammar induction from a small number of examples. The order in which words are placed in a sentence and how this relates to the semantics should be learned as well. Given the limited domain, the grammars will not be of the complexity of e.g. newspaper text, but at least verb - subject - direct object and even indirect object relations may be relevant, as well as adjective/noun relations. The approach is based on induction of dependency grammars and automatically selecting skeletons therefore. Our goal is to achieve faster learning than existing technologies.
(3) To handle deviant speech. The user's speech can be dialectal or pathological to a varying degree. Since we propose a learning approach from the acoustic to the grammatical level, this is not a structural hurdle. To improve accuracy, we propose the articulatory feature set as well as pitch and timing to enrich the speech representation. We will evaluate the approach on at least 4 pathologies showing successful learning from interaction.
(4) Designing an adequate signal preprecessing ensuring that the speech signal can reach the speech recognizer in good order. Several signal processing modules are composed to cancel unwanted signals from noise sources (such as the television set) using acoustic sensor networks and mitigate reverberation effects. We want to achieve a limited reduction in recognition accuracy with speaker-to-microphone distances up to two meters.
(5) To investigate the user requirements, i.e. for which tasks users with an impairment could or would like to use voice control. By using interviews, videos and demonstrations of voice control for specific tasks, we establish which problems and needs form a basis for speech interfaces. In a Wizard-of-Oz experiment we will study how users interact and for instance deal with corrections.
(6) To design adaptive vocal human-machine interfaces that learn from their usage. We investigate how the concept of learning speech interfaces can be applied in practice by confronting ever more advanced prototypes with users. We examine how functionality can be disclosed or hidden, how fallback strategies can be unified with providing training examples and how users describe both continuous and discrete variables. Furthermore, we investigate how we can deal with the initial phase where the user interface acquired insufficient learning data to give appropriate answers and how we transition to the phase where the interface learns and adapts based on usage. We will realize two applications: a personal assistant and a motor adjustable bed.
Period: 1 Jan 2011 - 31 Dec 2014
K.U.Leuven department ESAT - Prof. dr. ir. Hugo Van hamme
Subcontractor: InHAM - Peter Deboutte
K.U.Leuven CUO - Prof. dr. Dirk- De Grooff
K.H.K. MOBILAB - Prof. dr. ir. Bart Vanrumste
U.A. CLiPS - Prof. dr. Walter Daelemans