Skip to main content

Predicting user mental states in spoken dialogue systems

Abstract

In this paper we propose a method for predicting the user mental state for the development of more efficient and usable spoken dialogue systems. This prediction, carried out for each user turn in the dialogue, makes it possible to adapt the system dynamically to the user needs. The mental state is built on the basis of the emotional state of the user and their intention, and is recognized by means of a module conceived as an intermediate phase between natural language understanding and the dialogue management in the architecture of the systems. We have implemented the method in the UAH system, for which the evaluation results with both simulated and real users show that taking into account the user's mental state improves system performance as well as its perceived quality.

Introduction

In human conversation, speakers adapt their message and the way they convey it to their interlocutors and to the context in which the dialogue takes place. Thus, the interest in developing systems capable of maintaining a conversation as natural and rich as a human conversation has fostered research on adaptation of these systems to the users.

For example, Jokinen [1] describes different levels of adaptation. The simplest one is through personal profiles in which the users make static choices to customize the interaction (e.g. whether they want a male or female system's voice), which can be further improved by classifying users into preferences' groups. Systems can also adapt to the user environment, as in the case of Ambient Intelligence applications [2]. A more sophisticated approach is to adapt the system to the user specific knowledge and expertise, in which case the main research topics are the adaptation of systems to proficiency in the interaction language [3], age [4], different user expertise levels [5] and special needs [6]. Despite their complexity, these characteristics are to some extent rather static. Jokinen [1] identifies a more complex degree of adaptation in which the system adapts to the user's intentions and state.

Most spoken dialogue systems that employ user mental states address these states as intentions, plans or goals. One of the first models of mental states was introduced by Ginzburg [7] in his information state theory for dialogue management. According to this theory, dialogue is characterized as a set of actions to change the interlocutor's mental state and reach the goals of the interaction. This way, the mental state is addressed as the user's beliefs and intentions. During the last decades, this theory has been successfully applied to build spoken dialogue systems with a reasonable flexibility [8].

Another pioneer work which implemented the concept of mental state was the spoken dialogue system TRAINS-92 [9]. This system integrated a domain plan reasoner which recognized the user mental state and used it as a basis for utterance understanding and dialogue management. The mental state was conceived as a dialogue plan which included goals, actions to be achieved and constraints in the plan execution.

More recently, some authors have considered mental states as equivalent to emotional states [10], given that affect is an evolutionary mechanism that plays a fundamental role in human interaction to adapt to the environment and carry out meaningful decision making [11]. As stated by Sobol-Shikler [12], the term affective state may refer to emotions, attitudes, beliefs, intents, desires, pretending, knowledge and moods.

Although emotion is gaining increasing attention from the dialogue systems community, most research described in the literature is devoted exclusively to emotion recognition. For example, a comprehensive and updated review can be found in [13]. In this paper we propose a mental-state prediction method which takes into account both the users' intentions and their emotions, and describes how to incorporate such a state into the architecture of a spoken dialogue system to adapt dialogue management accordingly.

The rest of the paper is organized as follows. In the "Background" section we describe the motivation of our proposal and related work. The section entitled "New model for predicting the user mental state" presents in detail the proposed model and how it can be included into the architecture of a spoken dialogue system. To test the suitability of the proposal we have carried out experiments with the UAH system, which is described in "The UAH dialogue system" section together with the annotation of a corpus of user interactions. The "Evaluation methodology" section describes the methodology used to evaluate the proposal, whereas in "Evaluation results"we discuss the evaluation results obtained by comparing the initial UAH system with an enhanced version of if that adapts its behaviour to the perceived user mental state. Finally, in "Conclusions and future work" we present the conclusions and outline guidelines for future work.

Background

In traditional computational models of the human mind, it is assumed that mental processes respect the semantics of mental states, and the only computational explanation for such mental processes is a computing mechanism that manipulates symbols related to the semantic properties of mental states [14]. However, there is no universally agreed-upon description of such semantics, and mental states are defined in different ways, usually ad hoc, even when they are shared as a matter of study in different disciplines.

Initially, mental states were reduced to a representation of the information that an agent or system holds internally and it uses to solve tasks. Following this approach, Katoh et al. [15] proposed to use mental states as a basis to decide whether an agent should participate in an assignment according to its self-perceived proficiency in solving it. Using this approach, negotiation and work load distribution can be optimized in multi-agent systems. As they themselves claim, the authors' approach has no basis on the communication theory. Rather, the mental state stores and prioritizes features which are used for action selection. However, in spoken dialogue systems it is necessary to establish the relationship between mental states and the communicative acts.

Beun [16] claimed that in human dialogue, speech acts are intentionally performed to influence "the relevant aspects of the mental state of a recipient". The author considers that a mental state involves beliefs, intentions and expectations. Dragoni [17] followed this vision to formalize the consequences of an utterance or series of dialogue acts on the mental state of the hearer in a multi-context framework. This framework lay on a representation of mental states which coped only with beliefs (representations of the real state of the world) and desires (representations of an "ideal" state of the world). Other aspects which could be considered as mental states, such as intentions, had to be derived from these primitive ones.

The transitions between mental states and the situations that trigger them have been studied from other perspectives differ from dialogue. For example, Jonker and Treur [18] proposed a formalism for mental states and their properties by describing their semantics in temporal traces, thus accounting for their dynamic changes during interactions. However, they only considered physical values such as hunger, pain or temperature.

In psychophysiology, these transitions have been addressed by directly measuring the state of the brain. For example, Fairclough [19] surveyed the field of psychophysiological characterization of the user states, and defined mental states as a representation of the progress within a task-space or problem-space. Das et al. [20] presented a study on mental-state estimation for Brain-Computer Interfaces, where the focus was on mental states obtained from the electrocorticograms of patients with medically intractable epilepsy. In this study, mental states were defined as a set of stages which the brain undergoes when a subject is engaged in certain tasks, and brain activity was the only way for the patients to communicate due to motor disabilities.

Other authors have reported dynamic actions and also physical movements as a main source of information to recognize mental states. For example, Sindlar et al. [21] used dynamic logic to model ascription of beliefs, goals or plans on grounds of observed actions to interpret other agents' actions. Oztop et al. [22] developed a computational model of mental-state inference that used the circuitry that underlay motor control. This way, the mental state of an agent could be described as the goal of the movement or the intention of the agent performing such movement. Lourens et al. [23] also carried out mental-state recognition from motor movements following the mirror neuron system perspective.

In the research described so far, affective information is not explicitly considered although it can sometimes be represented using a number of formalisms. However, recent work has highlighted the affective and social nature of mental states. This is the case of recent psychological studies in which mental states do not cope with beliefs, intentions or actions, but rather are considered emotional states. For example, Dyer et al. [24] presented a study on the cognitive development of mental-state understanding of children in which they discovered the positive effect of storybook reading to make children more effective being aware of mental states. The authors related English terms found in story books to mental states, not only using terms such as think, know or want, but also words that refer to emotion, desire, moral evaluation and obligation.

Similarly, Lee et al. [25] investigated mental-state decoding abilities in depressed women and found that they were significantly less accurate than non-depressed in identifying mental states from pictures of eyes. They accounted for mental states as beliefs, intentions and specially emotions, highlighting their relevance to understand behaviour. The authors also pointed out that the inability to decode and reason about mental states has a severe impact on socialization of patients with schizophrenia, autism, psychopathy and depression.

In [26], the authors investigate the impairment derived from the inability to recognize others' mental states as well as the impaired accessibility of certain self-states. This way, they involve into the concept of mental-state terms not only related to emotion (happy, sad and fearful) but also to personality, such as assertive, confident or shy.

Sobol-Shikler [12] shares this vision and proposes a representation method that comprises a set of affective-state groups or archetypes that often appear in everyday life. His method is designed to infer combinations of affective states that can occur simultaneously and whose level of expression can change over time within a dialogue. By affective states, the author understands moods, emotions and mental states. Although he does not provide any definition of mental state, the categories employed in his experiments do not account for intentional information.

In the area of dialogue systems, emotion has been used for several purposes, as summarized in the taxonomy of applications proposed by Batliner et al. [27]. In some application domains, it is fundamental to recognize the affective state of the user to adapt the systems behaviour. For example, in emergency services [28] or intelligent tutors [29], it is necessary to know the user emotional state to calm them down, or to encourage them in learning activities. For other applications domains, it can also play an important role to solve stages of the dialogue that cause negative emotional states, avoid them and foster positive ones in future interactions.

Emotions affect the explicit message conveyed during the interaction. They change people's voices, facial expressions, gestures and speech speed; a phenomenon addressed as emotional colouring[30, 31]. This effect can be of great importance for the interpretation of user input, for example, to overcome the Lombard effect in the case of angry or stressed users [32], and to disambiguate the meaning of the user utterances depending on their emotional status [33].

Emotions can also affect the actions that the user chooses to communicate with the system. According to Wilks et al. [34], emotion can be understood more widely as a manipulation of the range of interaction affordances available to each counterpart in a conversation. Riccardi and Hakkani-Tür [35] studied the impact of emotion temporal patterns in user transcriptions, semantic and dialogue annotations of the How May I help you? system. In their study, the representation of the user state was defined "only in terms of dialogue act or expected user intent". They found that emotional information can be useful to improve the dialogue strategies and predict system errors, but it was not employed in their system to adapt dialogue management.

Boril et al. [36] measured speech production variations during the interactions of drivers with commercial automated dialogue systems. They discussed that cognitive load and emotional states affect the number of query repetitions required for the users to obtain the information they are looking for.

Baker et al. [37] described a specific experience for the case of computer-based learning systems. They found that boredom significantly increases the chance that a student will game the system on the next observation. However, the authors do not describe any method to couple emotion and the space of afforded possible actions.

Gnjatovic and Rösner [38] implemented an adapted strategy for providing support to users depending on their emotional state while they solved the Tower-of-Hanoi puzzle in the NIMITEK system. Although the help policy was adapted to emotion, the rest of the decisions of the dialogue manager were carried out without taking into account any emotional information.

In our proposal, we merge the traditional view of the dialogue act theory in which communicative acts are defined as intentions or goals, with the recent trends that consider emotion as a vital part of mental states that makes it possible to carry out social communication. To do so, we propose a mental-state prediction module which can be easily integrated in the architecture of a spoken dialogue system and that is comprised of an intention recognizer and an emotion recognizer as explained in "New model for predicting the user mental state" section.

Delaborde and Devillers [39] proposed a similar idea to analyze the immediate expression of emotion of a child playing with an affective robot. The robot reacted according to the prediction of the children emotional response. Although there was no explicit reference to "mental state", their approach processed the child state and employed both emotion and the action that he would prefer according to an interaction profile. There was no dialogue between the children and the robot, as the user input was based mainly in non-speech cues. Thus, the actions that were considered in the representation of the children state are not directly comparable to the dialogue acts that we address in the paper.

Very recently, other authors have developed affective dialogue models which take into account both emotions and dialogue acts. The dialogue model proposed by Pitterman et al. [40] combined three different submodels: an emotional model describing the transitions between user emotional states during the interaction regardless of the data content, a plain dialogue model describing the transitions between existing dialogue states regardless of the emotions, and a combined model including the dependencies between combined dialogue and emotional states. Then, the next dialogue state was derived from a combination of the plain dialogue model and the combined model. The dialogue manager was written in Java embedded in a standard VoiceXML application enhanced with ECMAScript. In our proposal, we employ statistical techniques for inferring user acts, which makes it easier porting it to different application domains. Also the proposed architecture is modular and thus makes it possible to employ different emotion and intention recognizers, as the intention recognizer is not linked to the dialogue manager as in the case of Pitterman et al. [40].

Bui et al. [41] based their model on Partially Observable Markov Decision Processes [42] that adapt the dialogue strategy to the user actions and emotional states, which are the output of an emotion recognition module. Their model was tested in the development of a route navigation system for rescues in an unsafe tunnel in which users could experience five levels of stress. In order to reduce the computational cost required for solving the POMDP problem for dialogue systems in which many emotions and dialogue acts might be considered, the authors employed decision networks to complement POMDP. We propose an alternative to this statistical modelling which can also be used in realistic dialogue systems and evaluate it in a less emotional application domain in which emotions are produced more subtly.

New model for predicting the user mental state

We propose a model for predicting the user mental state which can be integrated in the architecture of a spoken dialogue system as shown in Figure 1. As can be observed, the model is placed between the natural language understanding (NLU) and the dialogue management phases. The model is comprised of an emotion recognizer, an intention recognizer and a mental-state composer. The emotion recognizer detects the user emotional state by extracting an emotion category from the voice signal and the dialogue history. The intention recognizer takes the semantic representation of the user input and predicts the next user action. Then, in the mental-state composition phase, a mental-state data structure is built from the emotion and intention recognized and passed on to the dialogue manager.

Figure 1
figure 1

Integration of mental-state prediction into the architecture of a spoken dialogue system.

An alternative to the proposed method would be to directly estimate the mental state from the voice signal, the dialogue features and the semantics of the user input in a single step. However, we have considered several phases that differentiate the emotion and intentions recognizers to provide a more modular architecture, in which different emotion and intention recognizers could be plugged-in. Nevertheless, we consider interesting as a future work guideline to compare this alternative estimation method with our proposal and check whether the performance gets improved, and if so, how to balance it with the benefits of modularization.

The emotion recognizer

As the architecture shown in Figure 1 has been designed to be highly modular, different emotion recognizers could be employed within it. We propose to use an emotion recognizer based solely in acoustic and dialogue information because in most application domains the user utterances are not long enough for the linguistic parameters to be significant for the detection of emotions. However, emotion recognizers which make use of linguistic information such as the one in [43] can be easily employed within the proposed architecture by accepting an extra input with the result of the automatic speech recognizer.

Our recognition method, based on the previous work described in [44], firstly takes acoustic information into account to distinguish between the emotions which are acoustically more different, and secondly dialogue information to disambiguate between those that are more similar.

We are interested in recognizing negative emotions that might discourage users from employing the system again or even lead them to abort an ongoing dialogue. Concretely, we have considered three negative emotions: anger, boredom and doubtfulness, where the latter refers to a situation in which the user is uncertain about what to do next).

Following the proposed approach, our emotion recognizer employs acoustic information to distinguish anger from doubtfulness or boredom and dialogue information to discriminate between doubtfulness and boredom, which are more difficult to discriminate only by using phonetic cues. This process is shown in Figure 2.

Figure 2
figure 2

Schema of the emotion recognizer.

As can be observed in the figure, the emotion recognizer always chooses one of the three negative emotions under study, not taking neutral into account. This is due to the difficulty of distinguishing neutral from emotional speech in spontaneous utterances when the application domain is not highly affective. This is the case of most information providing spoken dialogue systems, for example the UAH system, which we have used to evaluate our proposal and is described in "The UAH dialogue system" section, in which 85% of the utterances are neutral. Thus, a baseline algorithm which always chooses "neutral" would have a very high accuracy (in our case 85%), which is difficult to improve by classifying the rest of emotions, that are very subtlety produced.

Instead of considering neutral as another emotional class, we calculate the most likely non-neutral category and then the dialogue manager employs the intention information together with this category to decide whether to take the user input as emotional or neutral, as will be explained in the "Evaluation methodology" section.

The first step for emotion recognition is feature extraction. The aim is to compute features from the speech input which can be relevant for the detection of emotion in the user's voice. We extracted the most representative selection from the list of 60 features shown in Table 1. The feature selection process is carried out from a corpus of dialogues on demand, so that when new dialogues are available, the selection algorithms can be executed again and the list of representative features can be updated. The features are selected by majority voting of a forward selection algorithm, a genetic search, and a ranking filter using the default values of their respective parameters provided by Weka [45].

Table 1 Features employed for emotion detection from the acoustic signal

The second step of the emotion recognition process is feature normalization, with which the features extracted in the previous phase are normalized around the user neutral speaking style. This enables us to make more representative classifications, as it might happen that a user 'A' always speaks very fast and loudly, while a user 'B' always speaks in a very relaxed way. Then, some acoustic features may be the same for 'A' neutral as for 'B' angry, which would make the automatic classification fail for one of the users if the features are not normalized.

The values for all features in the neutral style are stored in a user profile. They are calculated as the most frequent values of the user previous utterances which have been annotated as neutral. This can be done when the user logs in to the system before starting the dialogue. If the system does not have information about the identity of the user, we take the first user utterance as neutral assuming that he is not placing the telephone call already in a negative emotional state. In our case, the corpus of spontaneous dialogues employed to train the system (the UAH corpus, to be described in "The UAH dialogue system" section), does not have login information and thus the first utterances were taken as neutral. For the new user calls of the experiments (described in the "Evaluation methodology" section), recruited users were provided with a numeric password.

Once we have obtained the normalized features, we classify the corresponding utterance with a multilayer perceptron (MLP) into two categories: angry and doubtful_or_bored. If an utterance is classified as angry, the emotional category is passed to the mental-state composer, which merges it with the intention information to represent the current mental state of the user. If the utterance is classified as doubtful_or_bored, it is passed through an additional step in which it is classified according to two dialogue parameters: depth and width. The precision values obtained with the MLP are discussed in detail in [44] where we evaluated the accuracy of the initial version of this emotion recognizer.

Dialogue context is considered for emotion recognition by calculating depth and width. Depth represents the total number of dialogue turns up to a particular point of the dialogue, whereas width represents the total number of extra turns needed throughout a subdialogue to confirm or repeat information. This way, the recognizer has information about the situations in the dialogue that may lead to certain negative emotions, e.g. a very long dialogue might increase the probability of boredom, whereas a dialogue in which most turns were employed to confirm data can make the user angry.

The computation of depth and width is carried out according to the dialogue history, which is stored in log files. Depth is initialized to 1 and incremented with each new user turn, as well as each time the interaction goes backwards (e.g. to the main menu). Width is initialized to 0 and is increased by 1 for each user turn generated to confirm, repeat data or ask the system for help.

Once these parameters have been calculated, the emotion recognizer carries out a classification based on thresholds as schematized in Figure 3. An utterance is recognized as bored when more than 50% of the dialogue has been employed to repeat or confirm information to the system. The user can also be bored when the number of errors is low (below 20%) but the dialogue has been long. If the dialogue has been short and with few errors, the user is considered to be doubtful because in the first stages of the dialogue is more likely that users are unsure about how to interact with the system.

Figure 3
figure 3

Emotion classification based on dialogue features (blue = depth, red = width).

Finally, an utterance is recognized as angry when the user was considered to be angry in at least one of his two previous turns in the dialogue (as with human annotation), or the utterance is not in any of the previous situations (i.e. the percentage of the full dialogue depth comprised by the confirmations and/or repetitions is between 20 and 50%).

The thresholds employed are based on an analysis of the UAH emotional corpus, which will be described in "The UAH dialogue system" section. The computation of such thresholds depends on the nature of the task for the dialogue system under study and how "emotional" the interactions can be.

The intention recognizer

The methodology that we have developed for modelling the user intention extends our previous work in statistical models for dialogue management [46]. We define user intention as the predicted next user action to fulfil their objective in the dialogue. It is computed taking into account the information provided by the user throughout the history of the dialogue, and the last system turn.

The formal description of the proposed model is as follows. Let A i be the output of the dialogue system (the system answer) at time i, expressed in terms of dialogue acts. Let U i be the semantic representation of the user intention. We represent a dialogue as a sequence of pairs (system-turn, user-turn)

where A1 is the greeting turn of the system (the first dialogue turn), and U n is the last user turn.

We refer to the pair (A i ;U i ) as S i , which is the state of the dialogue sequence at time i. Given the representation of a dialogue as this sequence of pairs, the objective of the user intention recognizer at time i is to select an appropriate user answer U i . This selection is a local process for each time i, which takes into account the sequence of dialogue states that precede time i and the system answer at time i. If the most likely user intention level U i is selected at each time i, the selection is made using the following maximization rule:

where the set U contains all the possible user answers.

As the number of possible sequences of states is very large, we establish a partition in this space (i.e. in the history of the dialogue up to time i). Let UR i be what we call user register at time i. The user register can be defined as a data structure that contains information about concepts and attributes values provided by the user throughout the previous dialogue history. The information contained in UR i is a summary of the information provided by the user up to time i. That is, the semantic interpretation of the user utterances during the dialogue and the information that is contained in the user profile.

The user profile is comprised of user's:

  • Id, which he can use to log in to the system;

  • Gender;

  • Experience, which can be either 0 for novel users (first time the user calls the system) or the number of times the user has interacted with the system;

  • Skill level, estimated taking into account the level of expertise, the duration of their previous dialogues and the time that was necessary to access a specific content and the date of the last interaction with the system. A low, medium, high or expert level is assigned using these measures;

  • Most frequent objective of the user;

  • Reference to the location of all the information regarding the previous interactions and the corresponding objective and subjective parameters for that user;

  • Parameters of the user neutral voice as explained in "The emotion recognizer" section.

The partition that we establish in this space is based on the assumption that two different sequences of states are equivalent if they lead to the same UR. After applying the above considerations and establishing the equivalence relations in the histories of dialogues, the selection of the best U i is given by:

To recognize the user intention, we assume that the exact values for the attributes provided by the user are not significant. They are important for accessing the databases and constructing the system prompts. However, the only information necessary to determine the user intention and their objective in the dialogue is the presence or absence of concepts and attributes. Therefore, the values of the attributes in the UR are coded in terms of three values {0, 1, 2}, where each value has the following meaning:

  • 0: The concept is not activated, or the value of the attribute has not yet been provided by the user.

  • 1: The concept or attribute is activated with a confidence score that is higher than a certain threshold (between 0 and 1). The confidence score is provided during the recognition and understanding processes and can be increased by means of confirmation turns.

  • 2: The concept or attribute is activated with a confidence score that is lower than the given threshold.

We propose the use of a classification process to predict the user intention following the previous equation. The classification function can be defined in several ways. We previously evaluated four alternatives: a multinomial naive Bayes classifier, a n-gram based classifier, a classifier based on grammatical inference techniques, and a classifier based on neural networks [46, 47]. The accuracy results obtained with these classifiers were respectively 88.5, 51.2, 75.7 and 97.5%. As the best results were obtained using a MLP, we used MLPs as classifiers for these experiments, where the input layer received the current situation of the dialogue, which is represented by the term (UR i- 1,A i ). The values of the output layer can be viewed as the a posteriori probability of selecting the different user intention given the current situation of the dialogue.

The UAH dialogue system

Universidad Al Habla (UAH - University on the Line) is a spoken dialogue system that provides spoken access to academic information about the Department of Languages and Computer Systems at the University of Granada, Spain [48, 49]. The information that the system provides can be classified in four main groups: subjects, professors, doctoral studies and registration, as shown in Table 2. As can be observed, the system asks the user for different pieces of information before producing a response.

Table 2 Information provided by the UAH system

A corpus of 100 dialogues was acquired with this system from student telephone calls. The callers were not recruited and the interaction with the system corresponded to the need of the users to obtain academic information. This resulted in a spontaneous Spanish speech dialogue corpus with 60 different speakers. The total number of user turns was 422 and the recorded material has duration of 150 min. In order to endow the system with the capability to adapt to the user mental state, we carried out two different annotations of the corpus: intention and emotional annotation.

Firstly, we estimated the user intention at each user utterance by using concepts and attribute-value pairs. One or more concepts represented the intention of the utterance, and a sequence of attribute-value pairs contained the information about the values provided by the user. We defined four concepts to represent the different queries that the user can perform (Subject, Lecturers, Doctoral studies and Registration), three task-independent concepts (Affirmation, Negation and Not-Understood), and eight attributes (Subject-Name, Degree, Group-Name, Subject-Type, Lecturer-Name, Program-Name, Semester and Deadline). An example of the semantic interpretation of an input sentence is shown in Figure 4.

Figure 4
figure 4

Example of the semantic interpretation of a user utterance with the UAH system.

The labelling of the system turns is similar to the labelling defined for the user turns. To do so, 30 task-dependent concepts were defined:

  • Task-independent concepts (Affirmation, Negation, Not-Understood, New-Query, Opening and Closing).

  • Concepts used to inform the user about the result of a specific query (Subject, Lecturers, Doctoral-Studies and Registration).

  • Concepts defined to require the user the attributes that are necessary for a specific query (Subject-Name, Degree, Group-Name, Subject-Type, Lecturer-Name, Program-Name, Semester and Deadline).

  • Concepts used for the confirmation of concepts (Confirmation-Subject, Confirmation-Lecturers, Confirmation-DoctoralStudies, Confirmation-Registration) and attributes (Confirmation-SubjectName, Confirmation-Degree, Confirmation-GroupName, Confirmation-SubjectType, Confirmation-LecturerName, Confirmation-ProgramName, Confirmation-Semester and Confirmation-Deadline).

The UR defined for the task is a sequence of 16 fields, corresponding to the four concepts (Subject, Lecturers, Doctoral-Studies and Registration), eight attributes (Subject-Name, Degree, Group-Name, Subject-Type, Lecturer-Name, Program-Name, Semester and Deadline) defined for the task, the three task-independent concepts that the users can provide (Acceptance, Negation and Not-Understood), and a reference to the user profile.

Using the codification previously described for the information in the UR, every dialogue begins with a dialogue register in which every value is equal to 0 in the greeting turn of the system. Each time the user provides information, it is used to update the previous UR and obtain the current one, as shown in Figure 5. If there is information available about the user gender, usage statistics and skill level, it is incorporated to a user profile that is addressed from the user register, as was explained in "The intention recognizer" section.

Figure 5
figure 5

Excerpt of a dialogue with its correspondent user profile and user register for one of the turns.

Secondly, we assigned an emotion category to each user utterance. Our main interest was to study negative user emotional states, mainly to detect frustration because of system malfunctions. To do so, the negative emotions tagged were angry, bored and doubtful (in addition to neutral). Nine annotators tagged the corpus twice and the final emotion assigned to each utterance was the one annotated by the majority of annotators. A detailed description of the annotation of the corpus and the intricacies of the calculation of inter-annotator reliability can be found in [50].

Evaluation methodology

To evaluate the proposed model for predicting the user mental state discussed in "New model for predicting the user mental state" section, we have developed an enhanced version of the UAH system in which we have included the module shown in Figure 1.

Additionally, we have modified the dialogue manager to process mental-state information to reduce the impact of the user negative states on the communication and the user experience, by adapting the system responses considering mental states. The dialogue manager tailors the next system answer to the user state by changing the help providing mechanisms, the confirmation strategy and the interaction flexibility. The conciliation strategies adopted are, following the constraints defined in [51], straightforward and well delimited not to make the user loose the focus on the task. They are as follows:

  • If the recognized emotion is doubtful and the user has changed his behaviour several times during the dialogue, the dialogue manager changes to a system-directed initiative and adds at the end of each prompt a help message describing the available options. This approach is also selected when the user profile indicates that the user is non-expert (or if there is no profile for the current user), and when his first utterances are classified as doubtful.

  • In the case of anger, if the dialogue history shows that there have been many errors during the interaction, the system apologizes and switches to DTMF (Dual-Tone Multi-Frequency) mode. If the user is assumed to be angry but the system is not aware of any error, the system's prompt is rephrased with more agreeable phrases and the user is advised that they can ask for help at any time.

  • In the case of boredom, if there is information available from other interactions of the same user, the system tries to infer from those dialogues what the most likely objective of the user might be. If the detected objective matches the predicted intention, the system takes the information for granted and uses implicit confirmations. For example, if a student always asks for subjects of a certain degree, the system can directly disambiguate a subject if it is in several degrees.

  • In any other case, the emotion is assumed to be neutral, and the next system prompt is decided only on the basis of the user intention and the user profile (i.e. considering his preferences, previous interactions and expertise level).

In order to evaluate the benefits of including the mental-state prediction in the system, we have employed a user simulator to gather a corpus of new dialogues that allows obtaining a more detailed study with a higher range of emotional behaviours. Additionally, we have recorded a corpus of 150 dialogues with six recruited users to evaluate the system in more realistic conditions and to gather subjective judgments about it. Figure 6 presents a schematic representation of the corpora used and the users that recorded the dialogues.

Figure 6
figure 6

Scheme of the corpora used in the paper.

Evaluation with a user simulator

User simulators make it possible to generate a large number of dialogues reducing the time and effort that would be needed for the detailed evaluation of the quality of the services provided by a dialogue system [52]. With this aim, we had previously developed a technique which we have successfully applied to the simulation of other systems in the domains of help-desk assistance, railway information, booking facilities and health-care [53, 54]. This simulator carries out the functions of the ASR (Automatic Speech Recognition) and NLU modules. An additional error simulator module is used to perform error generation and the addition of ASR confidence scores [55]. The number of errors that are introduced in the recognized sentence can be modified to adapt the error simulator module to the operation of any ASR and NLU modules.

For these experiments, we have adapted this simulator to generate simulated user intentions following the semantics of the UAH system. As in the intention recognizer, the user simulation generates the user intention level, that is, the user simulator provides concepts and attributes that represent the intention of the user utterance. Additionally, we have added as a novel function the simulation of the output of the emotion recognizer. In order to do so, the selection of the possible users' emotions coincides with the set described for the development of our emotion recognizer for the system (boredom, anger, doubtfulness and neutral).

To generate the emotion label for each turn of the simulated user, we employ the rule-based approach shown in Figure 7, which is based on dialogue information similar to the threshold method employed as a second step in the emotion recognizer described in "New model for predicting the user mental state" section. In each case, the method chooses randomly (0.5 probabilities) between an emotion (doubtful, bored or angry) and neutral. The probability of choosing the emotion rises to 0.7 when the same emotion was chosen in the previous turn, which allows simulating moderate changes of the emotional state. Although the simulated users resemble the behaviour of the real users of the UAH corpus (the changes in the emotional state correspond to the same transitions observed in the dialogue states), they are more emotional, as the probability of neutral in the corpus was 0.85. This way, it is possible to obtain different degrees of emotional behaviour with which to evaluate the benefits of our proposal.

Figure 7
figure 7

Process for emotion generation for each turn of the user simulator. (#genUtt = number of utterances generated so far in the dialogue, #grounding = number of utterances corresponding to grounding actions, avg#turns = average number of turns of the generated dialogues).

A user request for closing the dialogue is selected once the system has provided the information defined in the objective(s) of the dialogue. The dialogues that fulfil this condition before a maximum number of turns are considered successful. The dialogue manager considers that the dialogue is unsuccessful and decides to abort it when the following conditions hold:

  • The dialogue exceeds the maximum number of user turns, specified taking into account real dialogues for the task.

  • The answer selected by the dialogue manager corresponds with a query not required by the user simulator.

  • The database query module generates an error warning because the user simulator has not provided the mandatory information needed to carry out the query.

  • The oral response generator generates an error when the selected answer involves the use of a data not provided by the user simulator.

The user simulation technique was used to acquire a total of 2000 successful dialogues, both including and not including the prediction module of the mental state in the architecture of the system (i.e. 1000 dialogues using the architecture shown in Figure 1, and 1000 dialogues without including the described mental-state prediction module).

A set of 40 scenarios were manually defined to consider the different queries that may be performed by users. Two main types of scenario were specified. Scenarios of type S1 defined only one objective for the dialogue (e.g. to obtain timetable information of a specific subject). Scenarios of type S2 defined two objectives for the dialogue (e.g. to obtain timetables of a specific subject and registration deadlines for the corresponding degree).

Evaluation with real users

Additionally, we evaluated the behaviour of the mental-state version of the UAH system with six recruited users using the same set of type S1 and S2 scenarios designed for the user simulation. Four of them recorded 30 dialogues (15 scenarios with the baseline system and 15 with the mental-state system), and two of them recorded 15 dialogues (15 dialogues with the baseline or the mental-state system only). Thus, as shown in Figure 8, a total of 150 dialogues were recorded in such a way that there were two dialogues recorded per scenario, three in the case of the five most frequent scenarios of each type as observed in the UAH corpus.

Figure 8
figure 8

Acquisition of dialogues with recruited users for the evaluation of our proposal.

Evaluation metrics

To compare the baseline and mental-state versions of the UAH system (with both the simulated and recruited users) we computed the mean value for the evaluation measures shown in Table 3, which we extracted from different studies [5658]. We then used two-tailed t tests to compare the means across the different types of scenarios and users as described in [56]. The significance of the results discussed in "Evaluation results" section was computed using the SPSS software with a significance level of 95%a.

Table 3 Evaluation measures based on the interaction parameters gathered from the dialogues of simulated and recruited users

In addition, we asked the recruited users to complete a questionnaire to assess their subjective opinion about system performance. The questionnaire had five questions:

  • Q1: How well did the system understand you?

  • Q2: How well did you understand the system messages?

  • Q3: Was it easy to obtain the requested information?

  • Q4: Was the interaction rate adequate?

  • Q5: If the system made errors, was it easy for you to correct them?

The possible answers for the questions were: Never, Seldom, Sometimes, Usually, Always. All the answers were assigned a numeric value between one and five (in the same order as they appear in the questionnaire).

Evaluation results

Table 4 shows the comparison of the different high-level measures for the mental-state and baseline systems.

Table 4 Results of the high-level dialogue features defined for the comparison of the mental-state and UAH baseline systems

As can be observed, on the one hand the success rate for the mental-state system is higher than the baseline. This difference showed a significance value of 0.025 in the two-tailed t test. On the other hand, although the error correction rates were also improved in absolute values by using the mental-state system, this relationship was not significant in the t test. Both results are explained by the fact that we have not designed a specific strategy to improve the recognition or understanding processes and decrease the error rate, but rather our proposal for adaptation to the user mental state overcomes these problems during the dialogue once they are produced. The absolute numbers in Table 4 indicate that the increment in the success rate is slightly higher for S2 dialogues compared to S1 dialogues regardless of the system, but this difference between dialogue types was not significant in the test.

Regarding the number of dialogue turns, the mental-state system produced shorter dialogues (with a 0.000 significance value in the t test when compared to the number of turns of the baseline system). As shown in Table 4, this general reduction in the number of turns is particularized also to the case of the longest, shortest and most seen dialogues for the mental-state system. This might be because users have to explicitly provide and confirm more information using the baseline system, whereas the mental-state system automatically adapted the dialogue to the user and the dialogue history.

The baseline dialogues have a higher standard deviation (3.80) given that the proportion of number of turns per dialogue is more disperse. The dialogues gathered with the mental-state system have a smaller deviation (3.20) since the successful dialogues are usually those which require the minimum number of turns to achieve the objective(s) predefined for both kinds of scenario.

Also, in the two types of scenario, the dialogues acquired using the simulation technique were shorter than those acquired with real users. This can be due to the restriction defined for a maximum number of turns per dialogue in the user simulation. Also, there were more dialogues in which the recruited users asked for more information than strictly required to optimally fulfil their scenarios.

Table 5 sets out the results regarding the percentage of different dialogues obtained. When we considered the dialogues to be different only when a different sequence of user intentions was observed, the percentage was lower using the mental-state system, due to an increment in the variability of ways in which the users can provide the different data required. This is consistent with the fact that the number of repetitions of the most observed dialogues is higher for the baseline system. As can be observed in the table, this flexibility has a bigger impact in the case of the S2 scenarios as the users must convey more information to the system. Also, recruited users seemed to benefit in a greater extent from the flexibility of the mental-state system than simulated users. This can be because of the user profile information that was stored in the system, which also takes into account the expertise of the user, as explained in "The UAH dialogue system" section.

Table 5 Percentage of different dialogues obtained

When emotions were also taken into account, i.e. when even with the same sequence of intentions two dialogues were considered different if the emotions observed were different, we obtained a higher percentage of different dialogues in the case of the simulated users. This is because of the more varied emotional behaviour endowed to the simulated users, which was one of the objectives of the user simulation, as described in the "Evaluation with a user simulator" section. However, this difference was low because our mental state recognizer tends to classify utterances as emotional rather than neutral, as described in the section "New model for predicting the user mental state".

We have previously described the differences between both systems in terms of number of turns. Figure 9 shows that there is also a slight reduction in the number of actions per turn for the dialogues of the mental-state system (with a 0.000 significance value in the t test). S1 scenarios contain 1.3 actions per user turn instead of the 1.5 actions in the baseline dialogues, whereas for the S2 scenarios the scores are 1.4 and 1.9, respectively. This is again because the users have to explicitly provide and confirm more information using the baseline system.

Figure 9
figure 9

Average number of turns per dialogue and actions per turn in the mental-state and baseline systems.

Regarding the dialogue participant activity, Figure 10 shows the ratio of user versus system actions. The dialogues of the mental-state system have a higher proportion of system actions due to a reduction of the confirmation turns (0.015 significance). It can be observed only a slight difference in the ration of user/system answers between recruited and simulated users, which was not significant in the t test.

Figure 10
figure 10

Ratio of user versus system actions in the mental-state and baseline systems.

Regarding dialogue style and cooperativeness, the histograms in Figures 11 and 12, respectively, show the frequency of the most dominant user and system dialogue acts in the dialogues collected with the mental-state and baseline systems. On the one hand, Figure 11 shows that users need to provide less information explicitly using the mental-state system; this explains the higher proportion of queries (both differences significant over 98%). It can be observed that there are also only slight differences between the values obtained for both corpora. There was a higher percentage of confirmations and questions in the corpus collected with real users due the higher average number of turns per dialogue in this corpus.

Figure 11
figure 11

Histogram of user dialogue acts in the mental-state and baseline systems.

Figure 12
figure 12

Histogram of system dialogue acts in the mental-state and baseline systems.

On the other hand, Figure 12 shows that there is a reduction in the system requests when the mental-state system was used. This explains a higher proportion of the inform system action in the mental-state system. There was a significant difference between both corpora in the percentage of turns in which the user makes a request to the system. The percentage of this kind of answers was lower in the corpus acquired with real users. This can be explained by the fact that it is less probable that simulated users provide useless information. In fact, there was a lower percentage of users' turns classified as "Other answers".

Additionally, we grouped all user and system actions into three categories: "goal directed" (actions to provide or request information), "grounding" (confirmations and negations) and "rest". Figure 13 shows a comparison between these categories. As can be observed, the dialogues provided by the mental-state system have a better quality, as the proportion of goal-directed actions is higher.

Figure 13
figure 13

Proportion of turns of goal directed actions, ground actions and rest of possible actions in the mental-state and baseline systems.

Table 6 shows the average results obtained with respect to the subjective evaluation carried out by the recruited users. As can be observed, both systems correctly understand the different user queries and obtain a similar evaluation regarding the perceived easiness in correcting errors made by the ASR module. However, the mental-state system has a higher evaluation rate regarding the user observed easiness in obtaining the data required to fulfil the complete set of objectives defined in the scenario, as well as the suitability of the interaction rate during the dialogue.

Table 6 Results of the subjective evaluation of the mental-state and baseline systems with real users (0 = worst, 5 = best evaluation)

Conclusions and future work

In this paper we have presented a method for predicting user mental states in spoken dialogue systems. These states are defined as the combination of the user emotional state and the predicted intention according to their objective in the dialogue. We have proposed an architecture in which our method is implemented as a module comprised of an emotion recognizer and an intention recognizer. The emotion recognizer obtains the user emotional state from the acoustics of their utterance as well as the dialogue history. The intention recognizer decides the next user action and their dialogue goal using a statistical approach that relies on the previous user input and system prompt.

We have evaluated the method with the UAH spoken dialogue system, implementing the mental-state prediction module between the NLU module and the dialogue manager. Additionally, we have enhanced the UAH system to deal with the mental-state information. In order to do so, we have improved the dialogue manager to take this information into account to compute and adapt the system responses.

The evaluation was carried out using a corpus of interactions between the system and an affective user simulator, and also with the interaction of real users with the mental-state version of the system. The results show that the improved version of the system performs better in terms of duration of the dialogues, number of turns needed to succeed in the dialogue and number of confirmations and repetitions needed. Additionally, the users judged the system to be better when it could adapt its behaviour to their mental state.

As a future work we plan to annotate the emotions of the corpus collected with real users interacting with the mental-state version of the system to refine the adaptation strategies of the dialogue manager. Using this corpus we will be able to evaluate the impact of the adapted dialogue management strategies, not only on the performance of the interaction and the subjective experience of the user, but also on the emotional state of the user. This way, we will check whether the adapted strategies can guide the users out of negative emotional states. Also, the annotated corpus, augmented with new dialogues, will offer us the possibility to employ stochastic approaches for optimized dialogue strategies tailored to the user mental states.

Moreover, we are interested in studying how to evaluate and optimize the proposed mental-state simulator. For the research presented in the paper, we have used the simulator to obtain more emotional dialogues with which to better analyze the benefits of our proposal, a study on the evaluation of the simulator itself constitutes a very challenging possibility for future work.

End notes

aThe degrees of freedom that SPSS employs for t tests are N - 1 in case the compared groups have the same number of samples (N), and N 1 + N 2 - 1 when they differ in the number of samples (N 1 and N 2). In these experiments, the degrees of freedom were 1,074 when comparing the baseline and mental-state system (N = 1,075) and 2,149 when comparisons were carried out between the simulated and the recruited users (N 1 = 2,000 and N 2 = 150, respectively).

Abbreviations

ASR:

automatic speech recognition/recognizer

DTMF:

dual-tone multi-frequency

MLP:

multilayer perceptron

NLU:

natural language understanding

POMDP:

partially observable Markov decision

UAH:

Universidad al Habla (University On the Line)

UR:

user register

References

  1. Jokinen K: Natural interaction in spoken dialogue systems. In Proceedings of the Workshop Ontologies and Multilinguality in User Interfaces. Crete, Greece; 2003:730-734.

    Google Scholar 

  2. Ábalos N, Espejo G, López-Cózar R, Callejas Z, Griol D: A Multimodal Dialogue System for an Ambient Intelligent Application in Home Environments. Volume 6231. Lectures Notes in Artificial Intelligence; 2010:484-491.

    Google Scholar 

  3. Ohkawa Y, Suzuki M, Ogasawara H, Ito A, Makino S: A speaker adaptation method for non-native speech using learners' native utterances for computer-assisted language learning systems. Speech Commun 2009,51(10):875-882. 10.1016/j.specom.2009.05.005

    Article  Google Scholar 

  4. Wolters M, Georgila K, Moore JD, Logie RH, MacPherson SE: Reducing working memory load in spoken dialogue systems. Interact Comput 2009,21(4):276-287. 10.1016/j.intcom.2009.05.009

    Article  Google Scholar 

  5. Evanini K, Hunter P, Liscombe J, Suendermann D, Dayanidhi K, Pieraccini R: Caller experience: a method for evaluating dialog systems and its automatic prediction. In Proceedings of the 2008 Spoken Language Technology Workshop (SLT 08). Goa, India; 2008:129-132.

    Chapter  Google Scholar 

  6. Miesenberger K, Klaus J, Zagler W, Karshmer A: Computers helping people with special needs. In Proceedings of 12th International Conference on Computers Helping People with Special Needs (ICCHP 2010). Lecture Notes on Computer Science 4061; 2010.

    Google Scholar 

  7. Ginzburg J: Dynamics and the semantics of dialogue. In Logic, Language and Computation. Volume 1. Edited by: Seligman J, Westerstahl D. CSLI Publications, Stanford, CA; 1996.

    Google Scholar 

  8. Jokinen K, Mc Tear MF: Spoken Dialogue Systems. Morgan and Claypool Publishers, San Rafael, CA; 2010.

    Google Scholar 

  9. Traum DR: Mental state in the TRAINS-92 dialogue manager. Working notes of the AAAI Spring Symposium on Reasoning about Mental States: Formal Theories and Applications 1993, 143-149.

    Google Scholar 

  10. Nisimura R, Omae S, Kawahara H, Irino T: Analyzing dialogue data for real-world emotional speech classification. In Proceedings of 9th International Conference on Spoken Language Processing (Interspeech 2006 -- ICSLP). Pittsburgh, USA; 2006:1822-1825.

    Google Scholar 

  11. Callejas Z, López-Cózar R, Ábalos N, Griol D: Affective conversational agents: the role of personality and emotion in spoken interactions. In Conversational Agents and Natural Language Interaction: Techniques and Effective Practices. Edited by: Pérez-Martín D, Pascual-Nieto I. IGI Global Publishers, Hershey, PA; 2011.

    Google Scholar 

  12. Sobol-Shikler T: Automatic inference of complex affective states. Comput Speech Lang 2011, 25: 45-62. 10.1016/j.csl.2009.12.005

    Article  Google Scholar 

  13. Schuller B, Batliner A, Steidl S, Seppi D: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 2011, in press.

    Google Scholar 

  14. Piccinini G: Functionalism, computationalism, and mental states. Stud Hist Philos Sci 2004, 35: 811-833. 10.1016/j.shpsa.2004.02.003

    Article  MathSciNet  Google Scholar 

  15. Katoh T, Hara H, Kinoshita T, Sugawara K, Shiratori N: Behavior of Agents Based on Mental States. In Proceedings of the 13th International Conference on Information Networking. Tokyo, Japan; 1998:199-204.

    Google Scholar 

  16. Beun RJ: Mental state recognition and communicative effects. J Pragmat 1994, 21: 191-214. 10.1016/0378-2166(94)90019-1

    Article  Google Scholar 

  17. Dragoni AF: Mental states as multi-context systems. Ann Math Artif Intell 2008, 54: 265-292. 10.1007/s10472-008-9100-y

    Article  MathSciNet  MATH  Google Scholar 

  18. Jonker CM, Treur J: A dynamic perspective on an agent's mental states and interaction with its environment. In Proceedings of the ACM first international joint conference on Autonomous agents and multiagent systems, Bologna. Italy; 2002:865-872.

    Chapter  Google Scholar 

  19. Fairclough SH: Fundamentals of physiological computing. Interact Comput 2009, 21: 133-145. 10.1016/j.intcom.2008.10.011

    Article  Google Scholar 

  20. K Das, Rizzuto D, Nenadic Z: Mental state estimation for brain-computer interfaces. IEEE Trans Biomed Eng 2009, 56: 2114-2122.

    Article  Google Scholar 

  21. Sindlar M, Dastani M, Meyer JJ: Mental State Ascription Using Dynamic Logic. In Proceedings of the 19th European Conference on Artificial Intelligence. Lisbon, Portugal; 2010:561-566.

    Google Scholar 

  22. Oztop E, Wolpert D, Kawato M: Mental state inference using visual control parameters. Cogn Brain Res 2005, 22: 129-151. 10.1016/j.cogbrainres.2004.08.004

    Article  Google Scholar 

  23. Lourens T, van Berkel R, Barakova E: Communicating emotions and mental states to robots in a real time parallel framework using Laban movement analysis. Robotics Auton Syst 2010, 58: 1256-1265. 10.1016/j.robot.2010.08.006

    Article  Google Scholar 

  24. Dyer JR, Shatz M, Wellman HM: Young children's storybooks as a source of mental state information. Cogn Dev 2000, 15: 17-37. 10.1016/S0885-2014(00)00017-4

    Article  Google Scholar 

  25. Lee L, Harkness KL, Sabbagh MA, Jacobson JA: Mental state decoding abilities in clinical depression. J Affect Disord 2005, 86: 247-258. 10.1016/j.jad.2005.02.007

    Article  Google Scholar 

  26. Osatuke K, Stiles WB: Relationship between mental states in depression: The assimilation model perspective. Psychiatry Res 2010, in press.

    Google Scholar 

  27. Batliner A, Burkhardt F, van Ballegooy M, Nöth E: A taxonomy of applications that utilize emotional awareness. In Proceedings of the 1st International Language Technologies Conference (IS-LTC 06). Ljubljana, Slovenia; 2006:246-250.

    Google Scholar 

  28. Bickmore T, Giorgino T: Some novel aspects of health communication from a dialogue systems perspective. In Proceedings of AAAI Fall Symposium on Dialogue Systems for Health Communication. Washington DC, USA; 2004:275-291.

    Google Scholar 

  29. Litman DJ, Forbes-Riley K: Recognizing student emotions and attitudes on the basis of utterances in spoken tutoring dialogues with both human and computer tutors. Speech Commun 2006,48(5):559-590. 10.1016/j.specom.2005.09.008

    Article  Google Scholar 

  30. Khalifa OO, Ahmad ZH, Gunawan TD: SMaTTS: Standard Malay Text to Speech System. Int J Comput Sci 2007,2(4):285-293.

    Google Scholar 

  31. Acosta JC, Ward NG: Responding to user emotional state by adding emotional coloring to utterances. In Proceedings of 10th Annual Conference of the International Speech Communication Association (Interspeech 09). Brighton, United Kingdom; 2009:1587-1590.

    Google Scholar 

  32. Boril H, Hansen JHL: Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments. IEEE Trans Audio Speech Lang Process 2010,28(6):1379-1393.

    Article  Google Scholar 

  33. Bosma W, Andre E: Exploiting emotions to disambiguate dialogue acts. In Proceedings of 9th International Conference on Intelligent User Interface. Funchal, Portugal; 2004:85-92.

    Google Scholar 

  34. Wilks Y, Catizone R, Worgan S, Turunen M: Some background on dialogue management and conversational speech for dialogue systems. Comput Speech Lang 2011,25(2):128-139. 10.1016/j.csl.2010.03.001

    Article  Google Scholar 

  35. Riccardi G, Hakkani-Tür D: Grounding emotions in human-machine conversational systems. In Proceedings of the 1st International Conference on Intelligent Technologies for Interactive Entertainment. Madonna di Campiglio, Italy; 2005:144-154.

    Chapter  Google Scholar 

  36. Boril H, Sadjadi O, Kleinschmidt T, Hansen JHL: Analysis and detection of cognitive load and frustration in drivers' speech. In Proceedings of Interspeech'10. Makuhari, Chiba, Japan; 2010:502-505.

    Google Scholar 

  37. Baker RSJd, D'Mello SKD, Rodrigo MMT, Graesser AC: Better to be frustrated than bored: the incidence, persistence, and impact of learners' cognitive-affective states during interactions with three different computer-based learning environments. Int J Hum-Comput Stud 2010,68(4):223-241. 10.1016/j.ijhcs.2009.12.003

    Article  Google Scholar 

  38. Gnjatovic M, Rösner D: Adaptive dialogue management in the NIMITEK prototype system. Lect Notes Comput Sci 2008, 5078: 14-25. 10.1007/978-3-540-69369-7_3

    Article  Google Scholar 

  39. Delaborde A, Devillers L: Use of non-verbal speech cues in social interaction between human and robot: emotional and interactional markers. In Proceedings of 3rd International Workshop on Affective Interaction in Natural Environments. Firenze, Italy; 2010:75-80.

    Google Scholar 

  40. Pittermann J, Pittermann A, Minker W: Emotion recognition and adaptation in spoken dialogue systems. Int J Speech Technol 2010, 13: 49-60. 10.1007/s10772-010-9068-y

    Article  Google Scholar 

  41. Bui T, Poel M, Nijholt A, Zwiers J: A tractable hybrid DDN-POMDP approach to affective dialogue modeling for probabilistic frame-based dialogue systems. Nat Lang Eng 2009,15(2):273-307. 10.1017/S1351324908005032

    Article  Google Scholar 

  42. Williams JD, Young S: Partially observable Markov decision processes for spoken dialogue systems. Comput Speech Lang 2007, 21: 393-422. 10.1016/j.csl.2006.06.008

    Article  Google Scholar 

  43. López-Cózar R, Callejas Z, Kroul M, Nouza J, Silovský J: Two-level fusion to improve emotion classification in spoken dialogue systems. Lect Notes Comput Sci 2008, 5246: 617-624. 10.1007/978-3-540-87391-4_78

    Article  Google Scholar 

  44. Callejas Z, López-Cózar R: Influence of contextual information in emotion annotation for spoken dialogue systems. Speech Commun 2008,50(5):416-433. 10.1016/j.specom.2008.01.001

    Article  Google Scholar 

  45. Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco; 2005.

    Google Scholar 

  46. Griol D, Hurtado LF, Segarra E, Sanchis E: A statistical approach to spoken dialog systems design and evaluation. Speech Commun 2008,50(8-9):666-682. 10.1016/j.specom.2008.04.001

    Article  Google Scholar 

  47. Griol D, Hurtado LF, Sanchis E, Segarra E: Managing Unseen Situations in a Stochastic Dialog Model. In Proceedings of AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems. Antwerp, Belgium; 2006:25-30.

    Google Scholar 

  48. Callejas Z, López-Cózar R: Implementing modular dialogue systems: a case study. In Proceedings of Applied Spoken Language Interaction in Distributed Environments (ASIDE 05). Aalborg, Denmark; 2005.

    Google Scholar 

  49. Callejas Z, López-Cózar R: Relations between de-facto criteria in the evaluation of a spoken dialogue system. Speech Commun 2008,50(8-9):646-665. 10.1016/j.specom.2008.04.004

    Article  Google Scholar 

  50. Callejas Z, López-Cózar R: Improving acceptability assessment for the labeling of affective speech corpora. In Proceedings of 10 Annual Conference of the International Speech Communication Association (Interspeech 09). Brighton, United Kingdom; 2009:2863-2866.

    Google Scholar 

  51. Burkhardt F, van Ballegooy M, Engelbrecht KP, Polzehl T, Stegmann J: Emotion detection in dialog systems--usecases, strategies and challenges. In Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII 09). Amsterdam, The Netherlands; 2009.

    Google Scholar 

  52. López-Cózar R, Callejas Z, McTear MF: Testing the performance of spoken dialogue systems by means of an artificially simulated user. Artif Intell Rev 2006,26(4):291-323. 10.1007/s10462-007-9059-9

    Article  Google Scholar 

  53. Griol D, Riccardi G, Sanchis E: A Statistical Dialog Manager for the LUNA Project. In Proceedings of 10th Annual Conference of the International Speech Communication Association (Interspeech 09). Brighton, United Kingdom; 2009:272-275.

    Google Scholar 

  54. Griol D, McTear MF, Callejas Z, López-Cózar R, Ábalos N, Espejo G: A methodology for learning optimal dialog strategies. Lect Notes Artif Intell 2010, 6231: 500-507.

    Google Scholar 

  55. Griol D, Hurtado LF, Sanchis E, Segarra E: Acquiring and evaluating a dialog corpus through a dialog simulation technique. In Proceedings of the 8th Annual SIGdial Meeting on Discourse and Dialogue. Antwerp, Belgium; 2007:29-42.

    Google Scholar 

  56. H Ai, Raux A, Bohus D, Eskenazi M, Litman D: Comparing spoken dialog corpora collected with recruited subjects versus real users. In Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue. Antwerp, Belgium; 2007:124-131.

    Google Scholar 

  57. Griol D, Callejas Z, López-Cózar R: A comparison between dialog corpora acquired with real and simulated users. In Proceedings of the 10th Annual SIGdial Meeting on Discourse and Dialogue. London, United Kingdom; 2009:326-332.

    Google Scholar 

  58. Schatzmann J, Georgila K, Young S: Quantitative evaluation of user simulation techniques for spoken dialogue systems. In Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue. Lisbon, Portugal; 2005:45-54.

    Google Scholar 

  59. Hansen JHL: Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Commun 1996,20(2):151-170. 10.1016/S0167-6393(96)00050-7

    Article  Google Scholar 

  60. Ververidis D, Kotropoulos C: Emotional speech recognition: resources, features and methods. Speech Commun 2006, 48: 1162-1181. 10.1016/j.specom.2006.04.003

    Article  Google Scholar 

  61. Morrison D, Wang R, Silva LCD: Ensemble methods for spoken emotion recognition in call-centers. Speech Commun 2007,49(2):98-112. 10.1016/j.specom.2006.11.004

    Article  Google Scholar 

  62. Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Aharonson V, Kessous L, Amir N: Whodunnit--searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 2011,25(1):4-28. 10.1016/j.csl.2009.12.003

    Article  Google Scholar 

Download references

Acknowledgements

This research has been funded by the Spanish Ministry of Science and Innovation, under the project ASIES TIN2010-17344: Adaptation in Smart Environments and Social Networks to Assist People with Special Needs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zoraida Callejas.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Callejas, Z., Griol, D. & López-Cózar, R. Predicting user mental states in spoken dialogue systems. EURASIP J. Adv. Signal Process. 2011, 6 (2011). https://doi.org/10.1186/1687-6180-2011-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1687-6180-2011-6

Keywords