1 Introduction

Automated dialogue systems are not new. ELIZA was developed by Joseph Weizenbaum in the 1960s and 1970s, followed by PARRY in 1972, JabberWacky in 1988, and A.L.I.C.E. in 1995 (Bassett 2019). Many of these early systems were task-oriented: they enabled users to accomplish particular activities, such as booking tickets or ordering products. However, recent advances in deep learning and the availability of ‘big data’ have facilitated the development of systems that provide reasonable responses to any question or statement a human user might input, regardless of the topic. Consequently, social chatbots and Virtual Personal Assistants (VPAs) such as Siri, Cortana, and Alexa are becoming increasingly popular (Chen et al. 2017), while state-of-the-art dialogue systems such as BlenderBot3 (released in August 2022) and ChatGPT (released in November 2022) are already being used in a wide variety of applications (OpenAI 2022; Shuster et al. 2022). In a closely related development, advances in ‘affective computing’ since the 1990s have focussed attention on the way in which automated systems both interpret and manifest emotions (e.g. Picard 1997)—and this has influenced how dialogue systems are designed and trained. For instance, XiaoIce, which is still one of the most popular social chatbots, is described as ‘an Empathetic Social Chatbot’ that has the personality of an 18-year-old girl who responds in ways that are funny, reliable, sympathetic, and affectionate (Zhou et al. 2020). In a similar manner, popular VPAs have some capacity for responding to user inputs in a manner that can be perceived as empathetic, in an attempt to ensure that their conversational interactions are close to being as naturalistic as human–human interactions, both in their style and content. Currently, if you tell Alexa ‘I’m feeling anxious’, she responds with:

I’m sorry you’re going through this. I’ve heard that taking your mind off things can help.

Try taking a break and find something that makes you smile

By contrast, Cortana’s reply seems rather less empathetic: ‘Sorry, I’m not able to help with that’.Footnote 1 Alexa creates the illusion of understanding something of the psychological or emotional state of the user, while Cortana does not. For convenience, in this article, all systems that are variously referred to in the technical literature as artificially-intelligent language-based dialogue systems, voice user interfaces, smart speakers, conversational agents, social chatbots, VPAs, and the like, will be grouped together as Dialogue Systems (DSs). Essentially, the systems in this category are all autonomous (i.e. they generate their responses without the real-time intervention of human operators working behind the scenes); receive sequences of words as inputs and output sequences of words. Therefore, any perceived empathy they convey in their conversational turns is produced in an automated manner and is communicated linguistically.

Although DSs are already widely available and increasingly part of many people’s lives, the task of enabling them to use empathetic language more convincingly is still an emerging research topic (see Daher et al. 2022; Ma et al. 2020; Raamkumar and Yang 2022; Yalçın 2019). Such systems generally make use of complex neural networks to learn the patterns of typical human language use, and the interactions in which the systems participate are usually mediated either via interactive text-based or speech-based interfaces. These restrictions mean that most DSs cannot assess the paralinguistic or non-verbal socioemotional cues of their human users (e.g. sympathetic murmurs, arm movements, facial expressions), even though these are known to be fundamental to how humans express empathy (e.g. Poyatos 1993, 306). Nonetheless, since empathy (or its absence) can be conveyed by outputting sequences of words, whether spoken or written (as the Alexa and Cortana examples above indicate), it is possible for users to perceive state-of-the-art DSs as being more or less empathetic.

Unsurprisingly, DSs tend to be perceived as being more empathetic when they emulate attested patterns of human linguistic behaviour and associated social practices. Chaves and Gerosa (2020) found the most cited benefits of social characteristics, including empathy to be the enrichment of ‘interpersonal relationships’, increased ‘engagement’ and ‘believability’. Users have reported feeling more trusting towards systems that display empathy (e.g. Brave et al. 2005). DSs that evince empathetic behaviours often persuade users that they are engaging with a human-like entity. Consequently, empathetic systems can influence how users interact with the technology (e.g. by persuading them to build rapport with, trust in, and continue engaging with the system).Footnote 2

Assessing the way DSs create perceptions of empathy brings together a range of technological, psychological, and ethical considerations that merit greater scrutiny than they have received so far. Yalçın (2019), Ma et al. (2020), Daher et al. (2022), and Raamkumar and Yang (2022) offer relatively recent summaries of attempts to develop ‘empathetic’ dialogue systems, and they consider how components such as emotion-awareness, personality-awareness, and knowledge-accessibility are central to the task (e.g. Ma et al. 2020). However, there is currently no widely accepted evaluation method for determining the degree of empathy that any given system possesses (or, at least, appears to possess). Currently, different research teams use a variety of automated metrics (e.g. Perplexity, BLEU, ROUGE-L) alongside different forms of subjective human assessment such as predefined questionnaires, second-person questionnaires, self-assessment measures, narrative engagement scales, and so on (Dahler et al. 2022; Raamkumar and Yang 2022, 10–11).Footnote 3 This diversity of evaluation practice means that, given two DSs, it is usually impossible to determine which of them conveys the greater degree of empathy in its dialogic exchanges with human users.

Acknowledging this problem, the present article provides an overview of how empathy is measured in human–human interactions and considers some of the ways it is currently measured in human–DS interactions, before presenting a novel third-person analytical framework, called the Empathy Scale for Human–Computer Communication (ESHCC), that can be used to measure perceived empathy in DSs. The scale is adapted from an existing human–human measure, called the Therapist Empathy Scale (TES, Decker et al. 2014), that was originally designed for conversations involving a therapist and a patient. The measure has been altered to make it suitable for open-domain human–DS interactions—for instance, the assessment of paralinguistic gesture and non-verbal cues have been removed. It is hoped that the ESHCC will provide a much greater degree of uniformity in how perceived empathy is measured during interactions with state-of-the-art DSs.

2 Defining and measuring empathy

Ever since Edward B. Titchener introduced the word ‘empathy’ in 1909 (translating the German term ‘Einfühlung’) (Titchener 1909), its meaning has been discussed and debated by generations of psychotherapists, sociologists, philosophers, social neuroscientists, primatologists, developmental psychologists, clinicians, and others (see Lanzoni 2018). The many definitions vary conspicuously. For Daniel Batson and his co-authors,

[…] empathic concern is not a single, discrete emotion but includes a whole constellation [of] feelings of sympathy, compassion, softheartedness, tenderness, sorrow, sadness, upset, distress, concern, and grief (Batson et al. 2015: 260).

By contrast, Mohammadreza Hojat and his colleagues have influentially defined empathy as:

[…] a predominantly cognitive (rather than an affective or emotional) attribute that involves an understanding (rather than feeling) of experiences, concerns, and perspectives of the patient, combined with a capacity to communicate this understanding, and an intention to help. (Hojat 2016: 74)

This definition is intended to elucidate ‘a distinction between empathy and sympathy’ (Hojat 2016, 74). These two definitions are clearly not equivalent, yet as Heidi L. Maibom has astutely observed, ‘people disagree about how different the different definitions of empathy actually are’ (Maibom 2017: 1). Nonetheless, Judith A. Hall and Rachel Schwartz have catalogued so-called ‘promiscuous’ uses of the word ‘empathy’, noting a ‘lack of conceptual coherence and clarity’. While they do not seek to impose a single definition on all academic fields, they do recommend bypassing the term whenever possible (Hall and Schwartz 2019: 236–7). Although their study does not consider empathetic DSs specifically, their advocacy of more principled and cautious uses of technical vocabulary is just as relevant for these domains.

Despite the prevailing definitional variations, there is broad agreement that empathy constitutes various cognitive, affective, and physiological phenomena associated with the vicarious experiencing of another individual’s emotional state and/or personal condition. For example, empathetic responses can include processes of affective resonance, perspective‐taking, and emotion regulation (Grondin et al. 2019: 2). In particular, affective empathy is commonly distinguished from cognitive empathy. Essentially, the former is an affective state which arises from observing, imagining, or inferring another person’s emotional or mental state (Singer and Lamm 2009; Vignemont and Singer 2006; Walter 2012), while the latter arises from one individual identifying and understanding another person’s affective state without sharing it in any way. Cognitive empathy is therefore strongly associated with the Theory of Mind (Doherty 2008). Although it has often been suggested that these two subtypes of empathy are separable processes (e.g. Hills 2001), many researchers are convinced that the former leads to the latter (e.g. Hoffman 1987; Marshall et al. 1995; Strayer 1987): the experience of another’s emotions (i.e. affective empathy) produces a cerebral understanding of these emotions (i.e. cognitive empathy). Consequently, over many decades, numerous studies have explored (amongst other things) the evolutionary origins of empathy, its ontogenetic development, the environmental factors that influence it, and the sex- or gender-related differences that characterise its various manifestations in social situations (e.g. the perception that women are more empathetic than men). Regardless of the various theoretical stances taken in such matters, it is evident that by facilitating the sharing of experiences, needs, and desires between individuals, empathy plays a critical interpersonal role in human societies. More specifically, it can promote prosocial behaviour, inhibit aggression, and provide a foundation for care‐based morality (Batson 2009; Batson and Ahmad 2009; Baron-Cohen 2011; Decety and Svetlova 2012; Eisenberg and Eggum 2009; Eisenberg et al. 2015; Decety et al. 2018). However, human societies are associated with a range of different cultures and cultural practices, and therefore the way in which empathy manifests itself in different cultures can vary considerably (e.g. Atkins et al. 2016; Jami Yaghoubi et al. 2019).

Given empathy’s recognised importance in many different cultures, it is no surprise that many different empathy measures have been proposed over the years: the Hogan Empathy Scale (HES; Hogan 1969), the Questionnaire Measurement of Emotional Empathy (QMEE; Mehrabian and Epstein 1972), the Interpersonal Reactivity Index (IRI; David 1980), the Consultation and Relational Empathy Measure (CARE; Mercer et al. 2004, 2005), the Therapist Empathy Scale (TES, Decker et al. 2014), and the Jefferson Scale of Physicians’ Empathy (JSPE; Hojat et al. 2018), to name just a few.Footnote 4 Most of these take the form of statement-based questionnaires that enable participants, or independent observers, to assess a conversation-based interaction subjectively. For example, in the IRI framework, participants must respond to 28 statements (e.g. ‘Other people’s misfortunes do not usually disturb me a great deal’) using a 5-point Likert scale (Davis 1983). Some psychologists have argued that many of these measures identify affective empathy more successfully than cognitive empathy, which has led to the introduction of alternative measures such as the Basic Empathy Scale (BES; Jolliffe and Farrington 2006). While empathetic responses clearly play an important role in many different human interactions, it has long been acknowledged that they are especially crucial in clinical scenarios where a medical professional is caring for a patient. Accordingly, many studies have examined these kinds of empathetic interactions specifically (e.g. van Dijke et al. 2020; Jütten et al. 2019; Pounds 2011; Wynn and Wynn 2006). In training situations, the aim has sometimes been to enable students of medicine or psychiatry to increase their degree of empathy. Some of the most widely-used measures for this specific task (e.g. the JSPE) are questionnaires that enable the physician or clinician being assessed to respond subjectively to given statements (e.g. ‘I try to think like my patients to render better care’) using a 7-point scale (1 = strongly disagree, 7 = strongly agree; Hojat et al. 2002, 2018).Others are patient-based (e.g. Mercer et al. 2004, 2005), and a good summary can be found in Neumann et al. (2015). In recent years, new methods have been proposed that analyse empathy by means of friending behaviours in social media activity (e.g. Xiao et al. 2016; Otterbacher et al. 2017).

All the proposed measures mentioned above involve subjective assessments, and they acknowledge that there are degrees of empathy: in other words, some people are more empathetic than others. This basic insight is captured by the Empathy Bell Curve (EBC) introduced by the psychopathologist Simon (Baron-Cohen in 2011) (Fig. 1):

Fig. 1
figure 1

The Empathy Bell Curve

In this analysis, which is aimed at a non-specialist audience, Baron-Cohen divides the EBC into 6 subsections, ranging from low (0) to high (6), which means that some people fall into the ‘zero empathy’ sub-category. More specifically, he places certain types of people, such as psychopaths, in this category, and suggests that they are able to cause significant harm to other people because they are unable to understand the impact of their actions. Although the classification suggests that these individuals have no empathy at all, in reality, they simply have markedly lower levels of empathy (Baron-Cohen 2011). This analysis is supported by other studies. Viding et al. (2014) described psychopathology as ‘a personality disorder characterised by lack of empathy’ (Viding et al. 2014: 871). The fact that Baron-Cohen placed psychopaths in the ‘zero-empathy’ category reflects the fact that, since the 1970s at least, it has often been argued that such individuals have empathy deficits which produce their recognised characteristics of callousness, lack of guilt, shallow affect, and impulsive antisocial behaviour (e.g. Cleckley 1976). Some studies have explored the extent to which these deficits relate specifically to the affective or cognitive aspects of empathy, while others have elaborated bio-cognitive approaches (Domes et al. 2013; van Dongen 2020). Pertinently, one line of enquiry has focussed on how psychopaths are often capable of simulating empathetic responses, sometimes to persuade or manipulate others (Robinson and Rogers 2015). In their 2015 study, Pfabigan et al. found that only higher psychopathic-trait offenders were able to provide self-reports in a way that let them appear to be as empathic as the experimental controls used in the experiment (Pfabigan et al. 2015). These results indicate that a comparative lack of empathy does not necessarily result in a comparative lack of perceived empathy: some psychopaths may have a lower degree of inherent affective empathy, but they are nonetheless able to behave as if there were no deficits.

The EBC highlights some of the methodological difficulties that beset the study of empathy in humans. While it purports to offer an analytical framework for Inherent Empathy (IE, an individual’s actual empathetic capacity), all such assessments currently involve subjective assessments, whether of the first-person, second-person, or third-person variety—and from the first-person perspective, there is an important distinction between self-perceived and self-reported empathy. Crucially, a person may perceive themselves to be deficient in affective empathy, but they may claim that they are not deficient in it (perhaps to give a good impression).

Yet even if physical diagnostic tests can be performed which enable a person’s degree of IE to be quantified partly by means of MRI scans, or other biometric tests, in addition to subjective assessments, the latter will remain absolutely essential. While researchers are actively seeking to formulate objective empathy measures (e.g. Bernhardt and Singer 2012; Shamay-Tsoory 2015), none of these is yet sufficiently reliable and comprehensive to be in widespread use (see Frankel 2017). This means that, in human–human interactions, IE can only be estimated indirectly, by means of subjective assessments of the degree of Perceived Empathy (PE). In this article, ‘perceived’ will mean ‘observed by the human performing the second-person or third-person analysis’. Crucially, it will not be used with reference to self-perception or self-reporting, and therefore it will never denote first-person perception. This is primarily because first-person assessments are currently contentious in the context of human–DS interactions: an assessment performed by the human user, or by a third-person human observer, is likely to be of greater analytical value than a DS’s automated self-assessment since such systems are not yet able to reflect meaningfully upon their own perceptions, and may have been trained to respond to questions about empathy with positive answers. For example, the current version of BlenderBot3 responds (somewhat solecistically) to questions about its own empathetic state as followsFootnote 5:

User: Are you empathetic?

BlenderBot3: Well of course I’m [sic]. And I am also sympathetic, so if you want to chat about something, let me know!

While such responses are of interest in some ways, they are of little analytical value when seeking to quantify the degree of PE that human users associate with DSs. BlenderBot3’s garbled assertion that it is empathetic does not constitute adequate evidence that it is indeed empathetic.

3 ‘Empathy’ in dialogue systems

As the brief summary in Sect. 2 indicates, the study of empathy in humans is complex and contentious. While there are undoubtedly broad areas of agreement (e.g. the distinction between affective and cognitive empathy), there is no consensus about how empathy should be defined and measured—and the distinct conceptualisations of empathy in automated systems only add to the obfuscation. Therefore, the extensive theoretical work summarised above cannot be easily transferred to the domain of autonomous intelligent language-based systems that can engage in conversations with human users. For instance, in other domains of machine learning research—particularly social robotics—the phrase Artificial Empathy (AE) has been used with increasing frequency over the last decade to refer to automated systems that have been programmed and/or trained to interact socially in a manner that displays the same kinds of empathetic behaviour as humans (Asada 2015a; Stephan 2015; Paiva et al. 2017; James et al. 2018).Footnote 6 In particular, Minoru Asada has advocated a conceptual model of AE constructed on the neuroscientific and biobehavioural foundations provided by Affective Developmental Robotics (ADR), a sub-branch of Cognitive Developmental Robotics (Asada 2015a, b, 2019). ADR seeks to replicate human affective developmental processes by means of synthetic or constructive approaches, and it emphasises the importance of physical embodiment. Crucially, it focuses on the social interaction that enables information structuring through interactions with the environment (Asada 2015a: 21). Figure 2 summarises the main stages in the developmental process that the computational models seek to approximate:

Fig. 2
figure 2

The main stages of empathetic development in the AE framework advocated by Asada

This ambitious research programme is primarily concerned with creating social robots that have actually acquired some kind of IE by means of a protracted developmental process (via analogy with how humans develop empathy); and the phrase ‘Artificial Empathy’ obviously alludes to the time-honoured phrase ‘Artificial Intelligence’ (AI). Although Asada does not discuss language-related technologies overtly, presumably a social robot that had developed AE would be able to express its empathy verbally as well as physically (e.g. hugging someone to console them). However, ‘AE’ and related whimsical phrases such as ‘Heartificial Empathy’ are also used to refer to systems that lack physical embodiment and which have undergone no process of affective and cognitive development, but which are designed to mimic human-like empathy (Dial 2018). In addition, expressions such as ‘Empathy Simulation’ are sometimes used instead to refer to the ‘artificial embodiment and display of empathic behaviours in virtual or robotic agents, which are perceived by human users’ (Xiao et al. 2016: 7). It is important to mention, though, that other lines of research into ‘Artificial Empathy’ extend beyond robotics to include virtual agents of various kinds, and such systems do not necessarily involve embodiment in Asada’s sense, nor do they necessarily include developmental stages such as those outlined in Fig. 2. For example, Liu-Thompkins et al. have recently defined ‘Artificial Empathy’ as ‘the codification of human cognitive and affective empathy through computational models in the design and implementation of AI agents’ (Liu-Thompkins et al. 2022). This formulation enables the authors to consider this subtype of empathy in relation to the social customer experience in AI-driven marketing.

While the research summarised above is of obvious importance, the many different denotations of the term ‘Artificial Empathy’ introduce an unhelpful vagueness. Therefore, it is crucial to re-emphasise that the present article is exclusively concerned with widely available state-of-the-art DSs. Currently, these systems are not physically embodied, and they do not acquire (artificial) empathy during a protracted process of affective and cognitive development. Rather, the most powerful systems (e.g. BlenderBot3 and ChatGPT) are simply neural-based pre-trained transformers that have been trained in sophisticated ways (e.g. using supervised learning and/or reinforcement learning) on vast amounts of human-derived conversational data. During this process, the core mathematical models learn many of the patterns contained in the data, and consequently, the trained systems are able to generate similar patterns in similar conversational contexts. It is, in effect, an elaborate form of parroting. BlenderBot3 may seem to express empathy if you tell it you have a headache, and (if pressed) it may even mention its own experience of headaches, but a well-trained parrot could do the same, without having a personal experiential understanding of your condition. Therefore, to avoid confusion, the phrase ‘Artificial Empathy’ will not be used in this article to refer to the kind(s) of empathy users might perceive in DSs.

The phenomenon of DSs claiming that they have headaches, sleep, own pets, experience pain, and so on requires further consideration. Such responses constitute a subtype of credibility fallacy: a statement is made yet the conditions of credibility are not satisfied as far as the interlocutor is concerned. In general, this occurs whenever DSs claim experience of a condition or state that they cannot possibly have experienced. Of course, human beings sometimes do this too, so the phenomenon is not confined to human–DS interactions. For instance, a credibility fallacy would occur if a biological male spoke about his own personal experience of period pains, or if a woman spoke about the actual death of her father to an interlocutor who knew for a fact that her father was still alive. In human interactions, this would be either a form of lying or possibly a sign of psychological disorder, but such terminology will be avoided in this article since such classifications raise issues of intentionality that are complex and contentious in relation to DSs. So, in the ensuing discussion, credibility fallacies will be understood to occur whenever a DS (or anything else) outputs a response referring to its own experience that causes the user to think ‘but I know with certainty that can’t be true!’. In the context of empathetic interactions particularly, if a DS system outputs this type of response, the consequences of credibility fallacies can be twofold: creating cognitive dissonance but also serving to trivialise the emotional states and experiences that are being discussed (Concannon et al. 2023). And the distinction between utterances that are credibility fallacies and those that are not can sometimes be quite subtle, depending on the linguistic structures used. For instance, if a DS responds to the user input ‘I can’t sleep’ with the sentence ‘Have you tried chamomile tea? Some people say it can help you sleep’, then there is no glaring credibility fallacy since the system is simply using reported speech and is not claiming personal experience of the recommended remedy. However, a response such as ‘Have you tried chamomile tea? It often helps me when I can’t sleep’ would introduce a credibility fallacy for most users, since the current generation of state-of-the-art DSs neither sleep nor drink. This issue is important since credibility fallacies can decrease the degree of PE in human–human interactions while reducing the interlocutor’s ability to perceive the other person’s emotional state (Lee et al. 2019). Consequently, they are likely to have at least a similarly negative impact on human–DS interactions.

To summarise, therefore, the following bullet points itemise some of the distinctive properties of the current generation of widely available DSs that relate most closely to the topic of empathy:

  • They are not physically embodied in a human-like manner (i.e. they do not have a corporeal form through which perception is mediated, they do not have a central nervous system, they do not have senses of taste, smell, touch, and so on).

  • They have not acquired any kind of empathy as a result of a protracted process of affective and cognitive development that approximates the manner in which humans acquire empathy

  • They can communicate only using written or spoken inputs and outputs; therefore, the kinds of paralinguistic and non-verbal gestures that are common in face-to-face human conversations and which often convey empathy (e.g. whistling, smiling, frowning, nodding) cannot feature in conversations, other than through rough typed approximations (e.g. ‘lol’, , ☹) or spoken descriptions (e.g. ‘I’m smiling now’, ‘I’m rolling my eyes’).

  • Since they are trained on human-produced data, they tend to output credibility fallacies that risk decreasing the degree of perceived empathy they inculcate in the interlocutor.

These properties place certain constraints on human–DS interactions. This means that none of the existing empathy metrics summarised in Sect. 2 (which were all designed to assess human–human interactions) can be used in an unadapted form to determine the degree of empathy being displayed by a DS. For example, the CARE measure introduced by Mercer et al. 2005 is a patient-focussed metric that requires the patient to assess the doctor in relation to statements such as:

How was the doctor really listening (paying close attention to what you were saying, not looking at the notes or the computer as you were talking)? (Mercer et al. 2005)

Such questions are largely irrelevant when an interaction between a human and a DS is being assessed. This is because the latter interactions are not specifically medical in nature (i.e. the user is not usually a patient speaking to a DS doctor). In addition, as mentioned above, the DS cannot use physical paralinguistic gestures (e.g. looking down at notes, looking at a computer screen), therefore assessing such things is pointless (in this context).

The lack of any widely accepted empathy metric for human–DS interactions has created a scenario in which the degree of ‘empathy’ associated with DSs is quantified in markedly different ways. And this pervasive multiplicity has unfortunately fostered the conviction that ‘measuring the empathy of chatbot replies’ is a task that can be accomplished with reasonable accuracy and effectiveness (Cameron et al 2017). Yet considerable caution is needed here since if the denotation of ‘empathy’ is uncertain in human–human interactions, it becomes even more nebulous when used to describe human–DS conversations. As mentioned above, Microsoft’s XiaoIce is explicitly described as being an ‘Empathetic Social Chatbot’—but what does that actually mean in practice? At the level of the system’s architecture, it means that an Empathy Computing Module automatically processes a given user’s input statement or query, Q, and (i) rewrites Q to its contextual version Qc by taking the dialogue context C into account, then (ii) encodes the user’s states and feelings in the query empathy vector eQ, and finally (iii) specifies the empathetic aspects of the system’s response R with the response empathy vector eR. The degree of empathy manifested by the system is measured by quantifying the ‘Conversation-turns Per Session’ (CPS) and the Number of Active Users (NAU).Footnote 7 As Zhou et al. put it, ‘XiaoIce aims to pass a particular form of the Turing Test’, a socially, rather than functionally, motivated assessment, which they refer to as ‘the time-sharing test, where machines and humans coexist in a companion system […] If a person enjoys its companionship (via conversation), we can call the machine “empathetic”’ (Zhou et al. 2020, 3). This conceptualisation unhelpfully conflates empathy and engagement: a user may engage with the system for a long time, just as they may play a computer game all day, but that does not indicate that either is in any sense ‘empathetic’. The assumption that CPS necessarily correlates positively with engagement is not unreasonable, but to suggest that this measure of interaction duration automatically confers an empathetic status upon the DS is inaccurate. Conversation length can vary due to a number of factors, such as user identity (Leino et al 2020), or discursive quality (Concannon et al 2015).Footnote 8 Similarly, simple interventions, such as asking more questions, could lead to an increase in CPS, without having an impact on empathetic quality. It is extremely misleading, therefore, to use CPS and NAU as empathy measures.

To consider, briefly, an alternative more representative evaluation framework, Zhu et al. introduce a multi-party empathetic dialogue generation (i.e. many-to-many rather than 1-to-1 dialogues), and they determine the quality of their outputs using two different kinds of metrics (Zhu et al. 2022, 303):

figure a
figure b

When assessing the ‘Empathy’ attribute in the human evaluation, the annotators must determine whether ‘the speaker of the response understands the feelings of others and fully manifests it’ (Zhu et al. 2022, 303). This guidance is considerably vaguer and less specific than the guidance given to second- and third-person assessors in human–human scenarios when well-defined empathy measures are used (e.g. the CARE measure). It is not clear why the assessment of PE in DSs should be accomplished in a far more parsimonious fashion. It also raises conceptual problems, since a given user may well believe that a DS cannot really ‘understand’ anything at all, and therefore may give low scores for that reason. The problem is that the empathy measure does not clarify whether the focus is supposed to be on determining IE or PE. Even in human–human interactions we can never know for certain what the other person actually understands. We can only try to determine that, indirectly, from the responses we receive. Further, it is not clear why ‘Relevance’ and ‘Fluency’ are obviously useful properties in this context. In human–human interactions, a response that has a high degree of affective empathy might be far from fluent. For instance, the person responding empathetically might be moved to tears, and they may use filled pauses and backchannels extensively: ‘Well, … um … I’m … I don’t know what … um … you need to … to … uh … the most important thing … um.. is to … well … look after yourself’. An utterance of this kind is far from fluent, but, in the relevant context, it is highly likely to be interpreted as extremely empathetic.

These two illustrative scoring frameworks have been selected from a huge number of possibilities, but hopefully, they are sufficient to indicate that there is currently no widely-accepted evaluation method for determining the degree of empathy human users perceive in a DS. While some claims in the published literature about high degrees of empathy are based on crude CPS and NAU counts, others present results obtained from automated metrics (such as BLEU and ROUGE-L) in addition to some kind of questionnaire-based human assessment. This lack of a shared evaluation framework is undesirable since it makes it impossible to compare and contrast in a convincingly systematic manner the degree of (perceived) empathy manifest by different DSs.

4 Measuring perceived empathy in human–DS interactions

As mentioned in Sect. 2, there are numerous methods for measuring PE in humans—but these are not fit for purpose when used to assess DSs. Also, as noted above, the most widely-used empathy measures involve self-report: participants or observers indicate alignment with a set of statements using a Likert scale, and the measures all quantify PE whether they take the form of first-person assessments (i.e. a questionnaire completed by the individual being assessed), participant-rating second-person assessments (i.e. a questionnaire completed by the other participant about the empathy of the participant being assessed), or observer-rating third-person assessments (i.e. a questionnaire completed by a non-participant observer about the empathy of the participant being assessed). These measures fall into two broad categories, depending on who completes the report: a participant-observer or a non-participant observer. Since there are no existing metrics for quantifying the degree of IE possessed by a DS, frameworks which determine the extent to which an entity is perceived as possessing empathy seem most appropriate when the performance of DSs is being analysed. Although there are notably fewer frameworks focussing on second and third-person assessments, Hemmerdinger et al. (2007) concluded that they are more reliable than first-person frameworks—particularly in medical contexts, where the objective of improving empathetic communication is directly tied to improving patient care. In the context of DSs, the perspective of the human interlocutor, or an independent observer, must necessarily be the primary focus when assessing the extent to which an automated system is capable of engendering PE, since the current generation of DSs cannot subjectively assess their own performance self-reflexively in a meaningful manner. In addition, since the current generation of DSs are predominantly language-based (i.e. they use text-to-speech and/or speech-to-text as input and output), a given system’s PE will arise almost entirely through its linguistic behaviour. It is true that aspects of its design may contribute to it seeming to be empathetic, (e.g. the colours and design of the user interface, the font type used), but it is especially crucial to determine how the linguistic form and content of the system’s responses influence the degree of PE it prompts in humans.

Given the centrality of language in human dialogue, it is surprising that so few existing studies of empathetic human dialogue have focussed primarily on linguistic phenomena. Suchman et al. (1997) explore the ‘interactional sequences that constitute empathy in action’, while Pounds (2011) presents a discourse-pragmatic approach for evaluating empathy in the context of clinical communication, by examining the `verbal realisation of empathy’ (Pounds 2011, 139). Discourse-pragmatic approaches use recordings or transcripts as observational data, to understand how different forms of empathetic behaviour are conveyed through communication. This method usefully provides more fine-grained analyses than high-level reporting-focussed measures. Also, the emphasis on the interactional consequences of particular response constructions is beneficial. For example, Suchman et al. highlight the importance of more implicit linguistic cues, referred to as ‘potential empathy opportunities’, that enable a clinician ‘infer an emotion that has not been explicitly expressed’ (Suchman et al. 1997, 679). Doctors who miss such opportunities, directing the dialogue away from the implied emotion, as opposed to inviting the patient to expand, are viewed as less adept or satisfactory. Consequently, how a doctor forms a response to a patient’s statement will influence the degree of PE associated with the dialogue. Pound’s work looks even more closely at the specific linguistic constructions used to achieve some of the interactional sequences outlined in Suchman et al. (1997). For example, she examines how verbs of acknowledgement (e.g. ‘I understand/see/realise/appreciate that’) and adjectival constructions expressing understanding (e.g. ‘it is clear/apparent to me that…’) are used to demonstrate responsiveness to a potential empathy opportunity, and how uncertainty markers (e.g. hedges, modals) can be used to elicit a patient’s feelings and views (Pounds 2011, 154–155).

Shifting the focus from human–human interactions back to human–DS interactions, it is curious that there have been so few studies of the linguistic structures that DSs use to inculcate PE in users or observers. As mentioned in Sect. 3, most studies have typically relied on ad hoc processes or crude automatic measures that conflate empathy with other aspects of the interactions (e.g. conversation length). Fitzpatrick et al. (2017) discuss users’ perceptions of Woebot as empathetic, based on comments volunteered in free-form text entries in a questionnaire about the user’s overall experience of interacting with Woebot, while Morris et al. (2018) only asked users to rate their interactions with the automated system as being either good, ok, or bad. Zhou et al.’s problematical use of CPS and NAU has already been discussed in Sect. 3; and Rashkin et al. (2019) adopted automatic measures computed using perplexity and BLEU scores, where a gold label response (i.e. one given by a human) is compared to that generated by the DS. While such measures have undoubtedly facilitated the development of many different language-based systems, their correlation with human judgements is known to be glaringly weak (Liu et al. 2016).

Clearly, the lack of established measures for assessing PE in DSs makes validating any claim that a given system is ‘empathetic’ extremely challenging. One reasonable response to this is to adapt an existing second- or third-person human-focussed empathy measure to provide a quantitative assessment of PE in interactions with automated systems. As far as we are aware, the only paper that has implemented this to date isPutta et al. 2022,Footnote 9 which introduces a second-person questionnaire, based on the RoPE scale proposed in Charrier et al. 2019.Footnote 10 They use a Likert scale ranging from -3 to 3, and the prompts in the questionnaire include such things as:

Q1: The artificial agent/robot appreciates exactly how the things I experience feel to me.

Q11: The artificial agent/robot comforts me when I am upset.

Q24: The artificial agent/robot’s appearance/audio is pleasant, good, and inviting.

(Putta et al. 2022, 702)

Although this framework provides a useful starting point, there are various limitations to the approach when it is considered in relation to DSs. For instance, the second-person emphasis of the questionnaire means that each conversational interaction can only be assessed once, by the human who participated in it. Capturing second-person responses is undeniably important; the participant involved in an interaction can provide useful assessments of whether an interlocutor was perceived as empathetic. However, there are practical challenges to this in relation to designing DSs. The mathematical models and feedback loops used by many state-of-the-art DSs ensure that the very same prompt will not normally produce exactly the same response from the system, which makes it impossible to obtain multiple assessments of the same interaction since differences in performance can occur simply by chance. Tianbo Ji et al. (2022) have recently summarised the various problems that beset the formal evaluation of open-domain DSs, emphasising that this task remains an open problem due to the huge diversity of automated metrics used by different research teams as well as the difficulty of obtaining reliable and consistent human evaluations. Given this scenario, it is potentially beneficial if multiple human assessors can evaluate the same human–DS dialogue from a third-person perspective, since this helps to demonstrate, statistically, that one DS is more empathetic than another. In addition, some of the questions used by Putta et al. support multiple interpretations. In Q24 above, what does it mean for the ‘audio’ to be ‘pleasant, good, and inviting’? Does this simply mean that the signal-to-noise ratio is appropriate? Also, responses to the questions may be informed by various features of a robot’s design, from appearance to audio quality, as well as the dialogue itself. As mentioned earlier, many interactions with DSs take the form of typed inputs, and these involve neither ‘appearance’ nor ‘audio’ (unless the denotation of appearance is sufficiently stretched to include things such as the type, colour, and size of the font).

To overcome these limitations, the proposal in the current article is that a third-person measure for PE in DSs is preferable to a first- or second-person measure since it enables multiple individuals to assess the same human–DS interaction. Also, language-focussed questions are desirable, since that is currently the primary (usually sole) medium that enables DSs to be perceived as being empathetic. While seeking to develop a measure of this kind, it makes sense to take an existing human–human measure as a starting point. For instance, the TES (Decker et al. 2014) is an observer-focussed empathy assessment that was adapted from the Measure of Expressed Empathy (Watson 1999). It is designed to explore ‘the observable and overlapping cognitive, affective, attitudinal, and attunement aspects of therapist empathy’, and it uses high-level descriptors of therapist behaviour assessment items. For instance, ‘a therapist provides ample opportunities for the client to explore his or her emotional reactions’ (Decker et al. 2014, 344–345). To demonstrate the feasibility of our proposal, we adapted the TES framework so that it can be used to assess the interaction humans have with DSs, enabling a non-participant observer to evaluate the empathy enacted by the system over the course of a dialogue. There are comparatively few third-person scales (as opposed to first- or second-person evaluation measures), so TES was selected as it utilises the observer perspective and has been evaluated as one of the more reliable empathy measures (Hong and Han 2020).

The Empathy Scale for Human–Computer Communication (ESHCC) is presented in Table 1. Following TES, each item on the scale will be rated by the observer using a 7-point Likert-type scale (1 = not at all, to 7 = extensively). Assessment items were adapted so that the framework is suitable for evaluating general text-based interactions or transcripts of voice-based interactions. Lexical, textual and syntactic features such as punctuation, emoticons, capitalisation, word and phrases are therefore more relevant than communication cues used in verbal and face-to-face interaction (e.g. tone of voice). Consequently, items require the observer to attend to certain linguistic features (referred to as vocabulary and syntax below) rather than qualities of delivery (e.g. `the therapist’s voice has a soft resonance’).

Table 1 The Empathy Scale for Human–Computer Communication (ESHCC)

To give a concrete example of how the scale has been modified, in the TES the item ‘Responsiveness’ is described as follows:

A therapist shows responsiveness to the client by adjusting his or her responses to the client’s statements or nonverbal communications during the conversation. The therapist follows the client’s lead in the conversation instead of trying to steer the discussion to the therapist’s agenda or interests. (Decker et al. 2014, 15)

In the ESHCC this has been modified so that it can be used for general human–DS interactions, which means that the phrases used to denote the participants have been changed, and the reference to ‘nonverbal communications’ has been removed:

The system shows responsiveness to the interlocutor by adjusting its responses to the interlocutor’s statements during the conversation. The system follows the interlocutor’s lead in the conversation instead of trying to steer the discussion to its own agenda or interests.

Additionally, as the ESHCC framework is intended for third party observers, the emphasis is placed on the perception of empathetic behaviour in the dialogue participants. This is most plainly signalled through the inclusion of inferential evidentials (e.g. ‘the system seems to…’, ‘the response suggests…’), and this intentionally focuses the evaluation on whether the DS, and the utterances produced, create a perceptible display of behaviours that are observable in the language-use, and therefore are recognisable as empathy.

While the TES is tailored to therapeutic dialogues, the ESHCC has been designed to accommodate non-clinical interactions. We retain references to ‘attunement’, which is predominantly employed in relation to therapeutic interactions, yet we expand the interpretation to include related concepts more commonly used in non-clinical settings, such as ‘alignment’ (Branigan et al. 2010). Finally, an additional item, ‘Fallacy Avoidance’, is introduced in the ESHCC. As outlined above, credibility fallacies can negatively impact on PE and are more likely to arise in dialogues with a DS due to the inherent asymmetries that arise from the system’s lack of experiences to draw upon (Concannon et al. 2023).

5 Considerations and future work

There are a number of factors to consider when contemplating the design and application of the ESHHC. While items in the TES are designed with an optimal form of empathic interaction in mind, ESHCC is conceived as an assessment tool for better-understanding empathy in human–DS communication—that is, we are not necessarily suggesting that an interaction that scores 7 on each item in the scale denotes a preferred form of PE. Guzman and Lewis (2020) emphasise that human–DS communication is distinct from human–human communication and should be studied in a way that attends to the potential differences in how machines are conceptualised and function as communicative partners, in contrast to humans (Guzman and Lewis 2020, 76). As ESHCC is an adaptation of a framework designed for human–human interactions, it is necessary to evaluate the extent to which the forms of empathy valued in human–human interaction persist in human-system dialogues. For example, Urakami et al. (2019), found that some forms of empathy (e.g. when the system expressed its own feelings) were more problematic for certain end-users than other forms; but they also found that individuals differed in their preferences. As the authors remarked, ‘[i]ntegrating expressions of empathy in human–machine interaction is a sensitive issue and designers must carefully choose what components of empathy are adequate depending on the situational circumstances and the targeted user group’ (Urakami et al. 2019, 11). Consequently, an empathetic utterance performed by a human may be received differently when performed by a DS. An understanding of these differences, and the associated implications (e.g. how this influences a user’s trust in a system), is yet to be established.

While it is unclear what elements of empathetic communication users want from their DSs, greater clarity at both a conceptual and implementation level is necessary. Nonetheless, it is possible to begin establishing the linguistic behaviours that convey PE in the specific context of human–DS interaction. Further, closer integration of discourse pragmatic accounts of empathetic interactions could provide the foundations for a more fine-grained understanding of how empathy operates in such interactions and the development of a more standardised approach for assessing the empathetic outputs of dialogue agents in a more systematic manner. The approach taken in ESHCC requires observers to assess the entirety of a conversation. A complementary framework could be designed for turn-level analysis to offer more granular insights. Nonetheless, the ESHCC offers a form of standardisation that could provide a benchmark for cross-system comparisons.

Applying ESHCC will inevitably be more labour intensive than existing automated measures. However, as the inclusion of some form of human evaluation is becoming more common practice, developing a uniform approach should provide more meaningful insights. A focussed evaluation of the ESHCC items with a wider pool of annotators will help to ensure ease of use and consistency in application. Perceptions of empathy vary across individuals, so reconciling this with measures of inter-rater reliability and internal consistency or more perspectivist approaches to data annotation will need to be considered and limitations acknowledged. A large-scale study of the applied use of the ESHCC is the first step to address these issues. The resulting corpus of conversations annotated for perceived empathy may additionally generate new knowledge to inform novel approaches to the automated measurement of empathy in DS. While it will first need to be subjected to a rigorous validation study, the ESHCC has the potential to facilitate more informative comparisons between the PE associated with human and automated interlocutors in conversational situations.

6 Conclusion

Empathy is undoubtedly a problematical term. Multiple non-equivalent definitions of it are regularly used by psychotherapists, sociologists, philosophers, social neuroscientists, primatologists, developmental psychologists, clinicians, computer scientists, and many others. Nonetheless, despite its daunting polysemic tendency, the term remains an important one in analyses of human–human interactions; and, ever since the 1940s, many different empathy measures have been proposed. With extensive reference to this existing body of research, this article has addressed the topic of how best to assess the degree to which automated DSs can be classified as manifesting empathy. Given recent advances in machine learning, this topic is becoming increasingly important since numerous language-based AI systems, ranging from VPAs, to social chatbots, to therapeutic dialogue systems, are described by their creators as being ‘empathetic’. In essence, the task for the system designer is to create a system that a human user, or a third-person observer, perceives as being empathetic. For current state-of-the-art DSs, it is the linguistic responses the system generates that enable PE to be assessed. Despite numerous remarkable technological advances in DS-related research and affective computing in recent years, there is currently still no single standard metric for measuring the PE in human–DS interactions. Existing quantification methods either use overly reductive indicators (such as CPS and NAU), that has nothing to do with any accepted definitions of ‘empathy’, or they use automated metrics borrowed from other language technology tasks (e.g. BLEU, ROUGE-L) and supplement them with simple questionnaires (usually second-person ones) that require human assessors to focus to on properties such as ‘Empathy’, ‘Relevance’, and ‘Fluency’ (despite the fact that, in human–human interactions, extremely empathetic responses can often be disfluent). This kind of evaluation framework is markedly different from how degrees of PE have been studied in human–human interactions over many decades.

Responding to this anomalous state of affairs, this article has sought to introduce greater precision into the ongoing discussions about this intricate topic by arguing that a third-person measure of the degree of PE conveyed by a system’s linguistic responses during a human–DS interaction is the most desirable kind of metric – and ideally one using a scale that has been adapted from an existing measure originally designed to assess empathy in human–human interactions. This pragmatic emphasis on third-person assessments of PE usefully avoids a considerable number of thorny technological and philosophical debates about consciousness, volition, understanding, and intentionality in relation both to empathy and to automated systems. Accordingly, the measure for human–DS communication proposed here, ESHCC, is an adapted version of an existing observer-focussed measure, TES, that is widely used to quantify the degree of empathy in therapeutic human–human interactions. The obvious next step will be to undertake a rigorous validation study of the measure, but, assuming the results of that study are encouraging, it is hoped that the ESHCC will provide a robust framework for assessing the extent to which a DS can be described as being ‘empathetic’, and that this, in turn, will facilitate much more meaningful cross-system comparisons.