Design notes from other ASR + SDS research investigations

There have been many bespoke spoken dialogue systems designed for the goal of language acquisition. These generally contain the following constitute parts 1):

  • Speech recognition and understanding
  • Dialogue manager
  • Response generator

The automatic speech recognition system processes the user's input in an attempt to understand both the meaning and (in a turn unique to language learning SDS) the linguistic, grammatical, phonetic and/or communicative “correctness” of the input.

The dialogue manager receives the intent of the user's interaction (as understood by the ASR) and contextualises it inside the scope of the interaction and interactional script. This is, the dialogue manager attempts to situate the meaning derived from the user's express inside a pre-ordered interaction process. E.g. if the user mentions ordering a coffee (“One coffee please”), and the dialogue manager is focussed on guiding a used through a coffee-shop experience, it would know to present the user with an adequate response (“What size would you like?”). The scope of this context is directly related to how constrained the designers choose the system to be.

The dialogue manager stage also contains the backchanneling and repair framework, requesting clarification from the user if the received user meaning does not adequately match the dialogue manager's expected interaction script or context.

The response generator creates an output for the user to receive. This is presented in the form of a text or spoken response, as well as a mixture of agent gestures, agent expressions, environmental changes, UI changes, and the presentation of contextual information to provide learner scaffolding.

Speech recognition + understanding

This covers automatic speech recognition and natural language understanding - that is, it is both the process of recognising speech and understanding the intent and variables presented in the evocation. In the past 20 years of spoken dialogue systems, there have been many approaches (as outlined by Bibauv), but the prevailing approach to ASR+NLP is intent/entity based, backed by AI trained recognition systems.

While this paradigm is the backbone for most major consumer SDS's, only the most recent language-learning ASR+SDS use this approach, with the majority of pre-2014 systems using hand-coded rules and other, less efficient solutions. This creates a conflict between prior CALL SDS design and modern SDS approaches.

A noted drawback of using ASR based upon entities, intents and AI-trained recognition is the effort it takes in training these systems - specifically the amount of training data required. Post-2014 CALL SDS systems tend to use off-the-shelf recognition platforms (Google's DialogueFlow, IBM's Watson, Microsoft's LUIS, Facebook's but have yet to come to terms with the issue of non-native pronunciation recognition. This concern could be relieved somewhat for system's focussing on English-as-a-foreign language, as the English-language training of these systems is considered both wider (with a variety of pronunciation inputs in training) and deeper (a large number of these inputs). For non-English recondition, however, it is unclear how well these systems perform when dealing with a non-native speaker.

This is an important problem to understand, as poor recognition rates for user inputs have proven alienating and demotivating in previous CALL SDS systems. For example the “word error rate of the original Let’s Go system on non-native speakers was 52%, more than 2.5 times that of native speakers (20.4%) 2).

In the case of this research, which relies heavily on the successful implementation of existing ASR+SDS systems, it is important to understand whether there are any significant differences between DialogueFlow, Watson, LUIS and in understanding non-native speakers' Japanese pronunciation, and selecting the most appropriate option for maximising recognition.

Previous investigations have found it difficult to eliminate this non-native negative bias in speech recognition tools. Designers have attempted to mitigate this limitation through a thoughtful use of dialogue manager design, including effective back-channels, useful repair processes and limited dialogue trees that constrict interaction to a specific context.

Dialogue management

The design of the dialogue management aspect of the SDS is highly dependent on the desired type of SDS (Bibauvw). For task-based interaction, this component typically contains the following elements (Raux):

  • Context: a description of the task to be performed, indicating the information that the system must gather from the user and in what order, what to do with this information, and which information to give to the user in return.
  • Repair process + backchanneling: a set of strategies modeling the behavior of a human speaker (e.g. asking for repetition or confirmation, nodding in affirmation as the speaker continues)

The most common approach to interaction context setting is in environment presentation and limitations of the dialogue tree. Designers attempt to set user's interaction expectations by providing a context for the environment. For example, by presentation an image of a waiter, the user has been guided to engage in dialogue with the SDS in a fashion typical of a normal restaurant situation. Limitations on the dialogue tree (both by design and feasibility) also prevent the user for expanding beyond the confines of the desired interaction scenario, with out-of-context messages guided back on track by a varying degree of specific agent responses. These range from non-specific clarifiers - “huh?” - to very contextually specific - “please say if you would like to order meat or fish”. These responses are not always the same, and some systems increment the level of expected interaction clarification with repeated miscommunication or unexpected communication attempts.

The majority of CALL SDS use dialogue trees to manage the interaction with the user, with specific desired responses leading to pre-set next-steps. Some, however, have generalised interaction into a series of steps, allowing users to skip interaction steps based on their communication, creating a more naturalistic conversation process rather than following an exhaustive, pre-set script 3). The use of steps also provides a contextual response to user communication - the system is adapting its presentation to the user input, rather than just validating a response in order to continue through a pre-set route, with simple variable or entity changes. For either design, the dialogue manager is waiting to receive an expected input in order for the interaction to proceed and respond to the user with the correct contextual next-step.

It is unclear if any systems have included a “fuzzy” response process, in which the conversational routes are dynamically presented. For example, a waiter might be equally likely to ask first if you prefer smoking or non smoking, or to offer to take your coat. Both of these are valid first-steps for a realistic restaurant interaction, but many systems avoid this kind of complication.


There are many aspects of SDS that could cause user inputs to fail. Failures in the ASR, the NLU or in the dialogue manager to match an expected response with a message from the NLU could cause a break-down in the interaction process. The ASR issue would almost certainly be a phonetic problem by the user or the ASR system, while NLU could have issues with grammar and context, while the dialogue manager will only experience contextual issues.

While the causes of these problems are distinct, the solution across SDS systems is coherent: prompt the user to re-provide an input that the systems expect. The form that this “repair process” takes, however, varies amongst implementations. The success also differs, with Raux's Let's Go, 40.7% of the repair prompts were “false positives” (i.e. triggered although the user utterance was grammatical and the dialogue could have proceeded correctly).

Perhaps the most thorough investigation into a CALL SDS repair system is Ayedoun et al's work in affective backchannels.

Raux, A., & Eskenazi, M. (2004). Using task-oriented spoken dialogue systems for language learning: potential, practical applications and challenges. In InSTIL/ICALL Symposium 2004
Ayedoun, E., Hayashi, Y., & Seta, K. (2016). Web-services based conversational agent to encourage willingness to communicate in the EFL context. The Journal of Information and Systems in Education, 15(1), 15-27.