Today I’m joined by Vera Liao, Principal Researcher at Microsoft. Vera is a part of the FATE (Fairness, Accountability, Transparency, and Ethics of AI) group, and her research centers around the ethics, explainability, and interpretability of AI products. She is particularly focused on how designers design for explainability. Throughout our conversation, we focus on the importance of taking a human-centered approach to rendering model explainability within a UI, and why incorporating users during the design process informs the data science work and leads to better outcomes. Vera also shares some research on why example-based explanations tend to out-perform [model] feature-based explanations, and why traditional XAI methods LIME and SHAP aren’t the solution to every explainability problem a user may have.
- I introduce Vera, who is Principal Researcher at Microsoft and whose research mainly focuses on the ethics, explainability, and interpretability of AI (00:35)
- Vera expands on her view that explainability should be at the core of ML applications (02:36)
- An example of the non-human approach to explainability that Vera is advocating against (05:35)
- Vera shares where practitioners can start the process of responsible AI (09:32)
- Why Vera advocates for doing qualitative research in tandem with model work in order to improve outcomes (13:51)
- I summarize the slides I saw in Vera’s deck on Human-Centered XAI and Vera expands on my understanding (16:06)
- Vera’s success criteria for explainability (19:45)
- The various applications of AI explainability that Vera has seen evolve over the years (21:52)
- Why Vera is a proponent of example-based explanations over model feature ones (26:15)
- Strategies Vera recommends for getting feedback from users to determine what the right explainability experience might be (32:07)
- The research trends Vera would most like to see technical practitioners apply to their work (36:47)
- Summary of the four-step process Vera outlines for Question-Driven XAI design (39:14)
Quotes from Today’s Episode
- “Ultimately, we have to develop explainability features by [asking] can they help people understand and evaluate it? Did it really improve people’s understanding?” — Vera Liao (04:52)
- “A lot of product development challenges come from [the fact that] this user-centered picture came too late. You already develop your model, you already choose your technique, then maybe you bring [a] designer [...] in too late and they’re only doing this very isolated task of UI design. And that really is a huge missed opportunity for UX design to shape your whole experience from the beginning, [to] choose to right model, choose the right data, right techniques.” – Vera Liao (15:25)
- “Don’t assume there’s a one-fits-all solution [to XAI]. You have to start with your user research, get the question right.”– Vera Liao (17:14)
- “Do your user research. [...] [Ask] your user, what the questions are that they want to ask to understand this product? And the question itself [informs] people’s explainability needs. [Then we can] provide the designer and product team guidance in terms of whether people have a “why” question, or a ‘what i”f question and how that can be mapped to an [XAI] algorithm that you can choose from.” – Vera Liao (12:16)
- “The unfortunate news is, we have been working on this topic of Explainable AI for five or more years in research, but in a lot of end-user-facing product, there has still hasn’t been a lot of success stories of explanation that [are] actually widely used and adopted.” – Vera Liao (22:11)
- “Another good benefit [that example-based explanation] offers is this design decision that you can show both the ground truth and the prediction. [Often, solutions] don’t choose to show the ground truth, but we actually find showing ground truths provides this very, very useful additional signal.” – Vera Liao (28:59)
- Re: what technical folks need to learn about designing UXs for XAI: “Think about what are the questions that may most likely emerge in your product? And then use that to drive the choices. Don’t jump directly to LIME or SHAP [when you need explainability].” – Vera Liao (39:01)
- “Human-Centered XAI: From Algorithms to User Experiences”: https://www.youtube.com/watch
- “Human-Centered XAI: From Algorithms to User Experiences” slide deck: https://designingforanalytics.com/wp-content/uploads/2023/08/VeraLiao_HCXAI_2023.pdf
- “Human-Centered AI Transparency in the Age of Large Language Models”: https://arxiv.org/abs/2306.01941
- MSR Microsoft Research: https://www.microsoft.com/en-us/research/lab/microsoft-research-montreal/
- Personal website: https://qveraliao.com
Brian: Welcome back to Experiencing Data. This is Brian T. O’Neill and today I have Vera Liao on the line from Microsoft. How are you?
Vera: I’m great. How are you, Brian?
Brian: Doing well. You’re a principal researcher at Microsoft. Is that specifically in the UX and/or AI or both areas or do you not even get that narrow with your title, really?
Vera: We don’t have a narrow title, but I am in a group called FATE, which stands for Fairness, Accountability, Transparency, and Ethics of AI, and you will hear most of my research is around this letter transparency, which includes topics like explainability, interpretability.
Brian: And that’s very much, today, what I wanted to jump in on because you’ve done a lot of work here. I forget how I originally found either your deck or I saw something, maybe it was a webinar, but it was called “Human-Centered XAI: From Algorithms to User Experiences”. And it’s very much mapped to just some of the mission and stuff that I’m on, particularly with my data science audience here, to drive more human-centered solutions in this AI space, not just for business reasons because it helps increase adoption and usage of these solutions, but because we need to be doing these things with a human-centered intent so we’re aware of the effects of the work that we’re doing and all of that. So, I really liked how you were, kind of, seeing that world from both the engineering and data science mindset, as well as how maybe a designer user experience professional might interpret this and actually execute explainability within an experience. So, that’s very much what I wanted to dig into today.
So, you had a slide in there, I think, that said, “Explainability should be at the core of your applications.” Can I argue against that or just give an example? You know, aren’t there examples of things, like, you know, maybe recommenders, or something like this in an e-commerce website where explanations aren’t always necessary? Like, maybe you can talk to me a little bit of why you think that that should be the core of the application?
Vera: Yeah, that’s a great question to start. So, I also come to this question from my background in human-computer interaction. So, that’s my research training; I did a PhD in that. So, I think whenever we come to interaction or interface side of application or system, there is always a side of helping people understand what the system is, what it can do, and then people can take action. So, I know, like, Explainable AI or interpretability has been sort of a buzzword in AI in the past few years, but I think as an HCI researcher, we have been thinking about this topic for much longer time under other related topic, like, people’s mental model, how we shape users mental model. So, I think that a component of helping people understand is what is there.
Brian: Got it. And is your mental model for explainability, what—the one that I’ve seen repeat quite frequently is this idea of model explainability versus interpretability.
Brian: I find this a confusing way because the words are so similar and I’m not sure it’s always obvious to audiences, even audiences that really understand it. Like, “Well, what do you mean, the differences there?” Do you model this that way when you talk about this? And maybe you can just kind of dive into your—double-click on your interpretation of all that.
Vera: Yeah, yeah. So, that’s a question I got very frequently. I think is indeed confusing that people use these two words differently, and then if you ask different scholar, they maybe you draw from kind of different reference, different historical background, and put this two words differently. I’m actually not a… huge fan of having that debate because I think is not very productive, right? We can define the words in different ways.
Another point is also for me, it’s really about understanding. So, I tend to, whenever I start my talk, I have this slide [unintelligible 00:04:39] I’m going to take a very broad view of explainability, which is really about helping people understand and we also highlight why this is such a human-centered topic. Ultimately, we have to develop explainability feature by can they help people understand and evaluate it? Did it really improve people’s understanding?
Brian: Got it. So, it sounds like you’re against that, which I tend to be in the same camp. I’m not sure it’s productive when it really comes down to, like, to someone understand this at the end and all the human experience there, regardless of whether we’re looking at the gears turning inside the model and trying to understand it or the thing that it spit out at the end and whether or not we can understand [laugh] that. Some or both, all of one or a mix of the two may be required in order to achieve that goal of interpretability. So.
Brian: So, in this presentation, I assume there was a presentation that went along with the deck here that I saw, but—
Brian: You advocate for a human-centered approach to explainability in these AI applications. So, what is the non-human-centered approach that is counter to what you’re trying to address? Let’s start there and then maybe you can broadly give me what the other one, is the opposite [laugh]?
Vera: Yeah, that’s a great question. I think it might be also helpful to kind of talk a little bit of my personal history, how I kind of got into this topic.
Vera: So, I’m currently a researcher at MSR. Before this, I actually spent five years at IBM Research. So, I think in 2019, I wasn’t working on this topic at all. I kind of got into it by accident because at that time, there are a group of AI researcher at IBM. They wanted to develop a toolkit called AI Explainability 360 Toolkit. You can still find it. I still think is a very useful if you’re a practitioner, you want to use some kind of open-source data—sorry, open-source code to plug in your own model, your own data, and now you can leverage state-of-the-art XAI algorithm.
So, I thinks sort of also a reflection of this, how scientific community came to this topic. It starts with a lot of technical algorithmic research, developing a lot of algorithm. I think, if you count it, probably now they’re more than hundreds of algorithm trying to generate different kinds of explanation. I consider it not very human-centered, in the sense that these algorithm are often not driven by—[why 00:07:06] is not driven by specific use case, they are not typically have a very comprehensive or rigorous evaluation of how people actually interact with them, they’re often kind of come into existence because researcher have certain intuition this may be a good use for explanation, right?
So, that kind of—at that time when I started working on this toolkit, working as sort of amateur designer, thinking about a user experience side, I realized we actually know very little how practitioner wants to incorporate explanation. What are there so many different AI application, what kind of explanation might be useful? And we know even less about how end-user will actually use explanation. And there are two sides. One is, we have many different kinds of algorithm, right? Which one’s useful for what situation?
And the other side is also how people actually interact with them? How people actually use them? Are their limitation, pitfall, that we don’t know because the interaction part we don’t understand enough? So, that kind of motivated myself to get into the space to do what I consider human-centered studying in specific application, studying how people interact [within the 00:08:26] application. And also [thread of 00:08:29] all my research is really by engaging [unintelligible 00:08:31] practitioner, talking to designers, designer of different application, have a broad view of this overall design space: what are different kinds of explanation people might need and how do we have a user-centered way to make choices from the algorithm and also design them, right? We don’t stop at oh, there’s this algorithm, there’s a kind of output, and people just interact with that overall output.
Brian: Got it. So, if I was to summarize that, are you saying the non-human-centered way to do this is basically you find an off-the-shelf or open-source, you know, XAI toolkit that, you know, here’s a way to give explainability for X need and you plug that thing in, and then you’re done [laugh]?
Vera: I think that’s a good way to put it, but of course, I don’t want to kind of blame individual practitioner that they’re not doing the right thing. I think—
Vera: —there are so many different factors to shape on why certain algorithms become so popular and people doing it that way.
Brian: Got it. Got it. One of the things I would—that I found when I was reading on your work, especially we talk about what—responsible AI and all the different facets of this, the number of different facets to need to consider and a solution seems quite high. And we’re talking about this thing that’s probabilistic; there’s, like, this infinite number of results that can come out. And I guess one concern I would have on behalf of, like, a data product leader who’s listening to this who may not be the person in the trenches working on, you know, day-to-day how the system is going to work or quote, “The role of the designer,” so to speak, is that it seems like this could just go on infinitely.
How would I ever get something out the door that’s even possibly, right? It’s kind of like, “I don’t even know where to start because there’s so much work. I can’t just plug in a tool and get the result. And this user is asking these questions, and now this next user asked these questions and that’s a completely different—you know, I need to rejigger my model in order to even explain anything that has to do with that aspect of it.” Do you have any comments for someone that’s like getting into this about where to begin this process such that you can make some progress with it towards that, even though it may not be done, so to speak?
Vera: Yeah, that’s a great question. I think you touch on sort of a core challenge I’m hoping to tackle in my research, as in, me coming from more of an academic research background, that’s something I myself is trying to learn. Like, if you think about people working on the more ethics domain, of course, there’s so many high-level principles and we encourage people to, like, take this problem as socio-technical problem, we have to really study the domain, study people really carefully. So, I think coming from research background, we tend to do things very slowly. When I started doing this kind of work more engaging with practitioner, I understand a lot of the time, the challenges on the ground is this kind of tension that we want to think about the ethical issue, we want to think about the risk limitation, but on the other hand, we have to put something out the door, we have time and resource constraints, right?
So, part of my research, I want to hope to bridge this, is to kind of do more what I call translated research, where I think about more kind of research principles or more theoretical aspects of things, but how do we translate them to more practical tools, right? In the domain of Explainable AI, for example, we’ve worked on developing a more human-centered design process, yet there are so many different algorithms we need to think about, many different needs, but the idea for that work is, we want to make it a process that’s relatively easy to follow. For example, start with do your user research and you can add something like a lightweighted exercise by asking your user, what are the questions your user wants to ask to understand this product? And the question itself becomes sort of this probe or [North 00:12:32], that you ground people’s explainability needs, and then we provide also designer product team this kind of guidance in terms of people have a why question, or what if question and how that can be mapped to an algorithm that you can choose from. So, that makes this kind of discussion between designer and data scientists easier to pick the right kind of algorithm.
But still, we’re hoping that really to foreground doing your user research first and get a user question user needs right and use that to drive the whole design development process. But that’s just one example. I think a lot of people—you might feel [to also 00:13:14] think about other kinds of design process, checklists, as well as specific tools or artifacts to hopefully help practitioner do this kind of work a little easier. But again, I think that’s just one piece of the whole picture. There’s many also other challenges, as you mentioned a kind of organizational challenge. How do we—one question I’m interested in is also how to encourage the organization to place design in a more central place when we think about the responsible AI mitigating risk [unintelligible 00:13:49] AI.
Brian: Got it. Is it safe to assume that you advocate for doing this type of qualitative research, like asking questions about what the questions are that an end-user would be asking about the solution that this is something that happens fairly upstream and ideally not before we’ve entirely baked the model—
Brian: —development and all of that. Should be happening in tandem with the model work is that correct?
Vera: Yeah. So, that there are two pieces of it, I think, are specific to explainability. There is a whole social science HCI research basically suggesting that what kind of goal people have with explanation can be expressed by questions, right? So, a why question, a what if question, a how question will require a different answer. So, kind of build on that kind of theories to use question as this kind of North to represent different kinds of needs people have with it for explainability.
So, in some cases, right, you may realize, oh your users’ primary question is actually a why question, but in other cases, you realize your users’ primary question is a question about data instead of the actual model decision or the model prediction. So, that kind of gave you a way, like, without looking at the technical detail, without even developing the model or a prototype to get to the point of what is the most prominent needs for explanation of my particular user, of my particular use case. There’s a kind of broader motivation here as well that you touch on, is this point, I think, a lot of product development challenge comes from this user-centered picture came too late. You already develop your model, you already choose your technique, then maybe you bring designer—when I say designer, I use it more broadly, like, UX practitioner, UX researcher, so anyone doing this user experience side of things—they maybe come too late and they’re only doing this very isolated task of UI design. And that really is a huge missed opportunity for UX design to kind of shape your whole experience from the beginning of choose to right model, choose the right data, right techniques.
Brian: Got it, I’m going to try to summarize one of the slides that you had. So, you talked about, like, who uses XAI, and you kind of model this into five personas: you have your developers and quote, “Makers” of the solution, decision-makers, impacted groups, business owners, and regulatory bodies. But instead of, like, designing for one of those, it sounds like what you’re recommending is you’re looking at where do specific questions that each of these groups ask overlap, and maybe that’s a hint about where to start. So, if everyone’s asking what are the limitations of the predictions that could come out of this system, and if we find out that everyone is interested in that, then maybe that’s a place where we put design and data science effort into the solution as early as possible. Is that a safe summary of what I interpreted from your slides?
Vera: That’s a really good point that may can come into the future research [laugh].
Vera: We haven’t looked at exactly like what’s the most frequently overlap question, but the point was also highlighting there are different personas, they may want different kinds of explanation, they may ask different questions. So, it’s more of a motivator of don’t assume there’s a one-fits-all solution. You have to start with your user research, get the question right. But that’s a really good point. Maybe if we got a chance of doing this kind of research for different persona, we may discover there are some overlap, but I’m sure there’s also differences among these persona as well.
Brian: Sure. I mean, that’s—I’m again, thinking about this from the perspective of, like, the, you know, I broadly talk about these as data products, but the data product manager or somebody who has, again, limited resources and time and they’re like, if I spent all my time on fully baking this thing as explainable, we could be here for 15 years trying to get it right. So, you know, it’s a classic product management issue of prioritization. Like, one might be question asked frequency, versus, you know, what’s the business impact if we can’t explain this to a regulatory body? Oh, well, that’s actually mega lawsuit, mega money, big, big-time problems if we can’t explain it, we’re also breaking the law. So, that one’s going to have to come first, and then—you know, or whatever. Do you see it this way as this kind of a classic product management prioritization challenge to kind of figure out what’s going to get the love [laugh] and win?
Vera: Yeah, that’s definitely will be super helpful. Again, we haven’t really done the empirical research to look at different kinds of persona. And then another point I want to highlight is also just think about personas isn’t often not enough. People are different usage point. They may also have different question.
A lot of my own work is in this area of decision support, I will say a lot of this kind of business AI is to support some end-user to make some kind of decision. Even in that kind of end-user, you consider them as a decision-maker, a particular group of end-user, we constantly see at the beginning when they’re just are using this kind of system, they often ask about sort of how question, how the model work, kind of global explanation. They want to ask about data, they want to ask about performance, so that they can have a kind of general mental model of how this thing work. But as they start to interact with specific decision, their demand for explanation can drop, but often it appear when the model makes a mistake or the model make a recommendation that surprised them, that at that point, it may ask why or why not? Why not the one I expected? So, you will see this kind of needs also vary by different usage points.
Brian: Got it. Got it. I just wanted to compliment you. I really liked the slide you called, “Explanatory Goals Expressed as Questions.” And it’s just a three-column table. The first column is ‘task goal.’ For example, “My goal is to debug the model.” The second column is, “Well, who does that task?” Well, in that case, domain experts and model developers. And the third column is example questions that that persona might ask about the AI. “Is the AI’s performance good enough? How does the AI make predictions? How might it go wrong?”
I really like this focus on tasks and activities from a human standpoint and then you work it backwards to something broad, like, debugging the model or make an informed decision. Because when that’s a requirement, it’s so vague, it’s like, you can make an argument that whatever feature development you did sort of helps with making a decision, but it’s so big and vague that doesn’t really help the team make some progress. So, I really liked how you broke this down into very specific persona-based questions. And that becomes success criteria for the explainability, right? Can they understand if the performance is good enough? Well, what metrics do they need to decide about the performance there? It kind of it makes it very concrete. So, I really like that mapping.
Vera: Yeah, thank you. So, that’s also an illustration want to see why questions are useful because question is really, kind of, the most fine-grained grounding here and it’s something you [can’t 00:21:11] get from direct user research, right? It may be hard to kind of narrow down, “Oh, I have this group of user. What kind of experimentation they need?” But it’s very easy to go someone and say, “What kind of questions do you have?” And then you can ask ten people and then you gather their question and you do some analysis that really help you to get which question your user, for this particular usage point, are most likely to ask?
Brian: No, I think that’s a great way to inform this idea of you’re going out to ask questions about questions [laugh].
Vera: Yeah [laugh].
Brian: Not to get answers—
Brian: —but to get questions. And answers will come later [laugh].
Vera: Yes. That hopefully your explanation will help address.
Brian: Yeah, yeah. In the years you’ve been working on this, are you seeing changes in what end-users may need or want from explainability of applications, or does this tend to just keep continuing, maybe because we’re, I don’t know, fairly at an infancy with some of this? Like, are you seeing any change in that?
Vera: So, I think, sort of the unfortunate news is, we have been working on this topic of Explainable AI for five or more years in research, but if you think about it, in a lot of end-user-facing product, there has still hasn’t been a lot of, like, success story of explanation that actually widely used and widely adopted. I would say, my view is at current time, is probably a more prominent feature in more kind of analyst-faced application that the explanation can be used as part of the kind of analytical feature. It’s also used in some of the debugging tools for data scientists to look at is the model wrong? Why is the model wrong?
But for end-user, you may think about things like Amazon tells you why it’s giving you that kind of recommendation, but I think a recommender system is a special case that has been more receptive of explanation, just for longer time. Other than that, you actually don’t see a lot of end-user-facing examples. My personal view is there are several reasons why is—in my own research, right, one thread of our research is actually to showcase that this very popular feature importance explanation is actually not very useful for end-user decision-making. It has this pretty robust finding that we find that other researchers in the field find it has this pitfall of making people over-rely on AI. If the AI gives you a wrong prediction, showing a feature-importance explanation versus not showing explanation actually increase people’s this tendency to blindly follow the model.
So, at least showing feature importance as it is, this very precise, quantified information, is actually not a very useful. One reason is, it’s actually not very easy for people to reason about, especially lay user. If, for example, you may have some kind of intuition to say, “Feature A is important,” but it’s actually hard for you to say, “Feature A should be two shades darker than Feature B,” or, “Feature A should be 10% more important than Feature B.” And when you’re confronted with this kind of very precise, quantified explanation, people tend to actually get confused. So, that’s one reason.
And another reason is this kind of very quantified, precise explanation, A lot of times just some people don’t actually carefully engage with it, right? And then they may say, “Oh, this model is [unintelligible 00:24:49] and I’m just going to follow it.” So, that’s one fundamental challenge that we don’t know if it’s useful. We have some research showing other kinds of explanation might be a little bit helpful, but it’s very much still something actively researched. And another challenge is this question of—so for example, like, the most popular algorithm, like, SHAP or LIME, they are these, kind of, post-hoc explanation, right?
So, you have a deep-learning model that’s a total black box. You cannot directly give you an explanation, even if I expose the underlying model process, it’s not directly explainable. So, some of the very popular algorithm is actually build a simplified model to, let’s say, a linear model in the local region and use a simplified model to approximate this black box model. And raised the question of the explanation itself may not be faithful. There are ways to quantify to give you a kind of a counter-example of this one is not faithful enough. But still, there is a question of what is the threshold of acceptable faithfulness and what is the impact in critical decision-making domain that can and use actually reliably use this kind of explanation? These are still pretty much open question in research, I feel.
Brian: Maybe I took this away incorrectly from your slides, but I thought I saw some strong evidence or even maybe a recommendation that example-based trumps feature-based in terms of helping users actually understand explainability. So, first of all, can you explain the difference between a feature-based explanation and example-based? And then correct me if I’m wrong about understanding your preference for that, or maybe not preference, but seeing the research data suggesting that it works better?
Vera: Yeah, yeah. That’s a good question and that’s the one I mentioned that some of our research shows—suggests that might—that’s something to look at. So, feature importance is straightforward. It shoulding tell you this model, this particular prediction is because the certain features of the input, that’s taking example, say in XAI, we often think about this a loan application use case, right? So, you have [unintelligible 00:27:09] this loan—of course, this is a very [laugh] controversial, problematic case, but I’m going to use it—say this person’s loan gets rejected because this person may had unpaid loan before. So, that’s a feature that really contribute to this decision, so this feature importance explanation.
Example-based explanation actually, there are different forms of it, but the general gist is to show you similar example. “Look at this person get rejected because there are people that look similar to this person and these people fail to repay.” And there are different algorithms, some of them search similar example from the training data, some may be able to actually construct fictional cases, and that there’s another design decision to make here is also, you can show both the ground truth. This example, did they actually repay or not as well as the models prediction of it. And there are at least two benefits example-based explanation offer.
One is, they’re actually easier to process because case-based reasoning are easier for people, sometimes. You are also not forcing people’s attention to only look at, oh, this person has this particular feature. People can have a more natural process and think about this case, this particular person whether they are similar, what are the reason this person failed. So, you give people this more opportunity to make a more holistic decision or form their judgment looking at a particular input without forcing their attention.
Another good benefit it offers is this design decision that you can show both the ground truth and the prediction. And that can often give signal if the model tend to make mistake on this example, that give people more signal that this model tend to make mistake in these kind of cases. And a lot of time, people can use that relatively reliably as a signal that model is more likely to make mistake.
Brian: So, if I was to help listeners visualize this, are we talking about something where if I’m bank loan officer Brian and you’re Vera looking for a mortgage and I’m talking to at the office and I’m looking at my screen, if the application I use to quote, “Approve,” your loan was example-based, are you saying, “Well, here’s the prediction of what Vera’s creditworthiness is and whether we should give the loan. The answer is no and here are some red flags that, you know, in the feature detection.” And then I would also see the same feature output for three or four other bank customers like you—
Brian: So, that I can have a better holistic understanding? Like, oh, it’s not just this one thing. It’s also their, I don’t know, whatever it is, income was also, like, not quite as high—like, all these people tend to have income below whatever or—and so you’re hoping the loan officer is kind of seeing it’s not just this one person, this whole cohort—is that kind of the idea here?
Vera: Yeah. That’s exactly the idea. I think the challenge with decision support use cases, you are actually hoping to detect error from model explanation.
Vera: And then feature-based explanation is sometimes confusing because it works more as a justification. So, it’s direct your people’s attention to oh, this person has this feature and that’s why the model think the person should be rejected. But then you’re not paying attention to other feature that’s not being highlighted. Versus if I’m looking at a example, I’m looking at other people similar to where as more holistic view and I may consider, “Oh, actually, this person has some exception because this person is actually a new customer.” So, that gives you an opportunity to have more holistic judgment-making process.
Brian: And in this fictitious design example I gave you, are you saying these other examples, like, this three to four other customers that I, the loan officer, I’m looking at, am I seeing ground truths for each of them or I’m seeing a prediction on whether the system would have given them a loan if they were standing in front of me right now?
Vera: Yeah, so that’s something we showed our study as well. Again, that’s a design decision, you can make. A lot of work on the algorithm side, they don’t choose to show the ground truth, but we actually find showing ground truths provide this very, very useful additional signal. The point is, it’s also a very obvious signal that people really gravitated towards to. If you gave me this example, you gave me models prediction, people directly look at, “Oh, is this consistent?” If not consistent, then model must not be reliable. So, that’s definitely a recommendation I would make to add as well.
Brian: Another slide I really liked in your deck was this concept of getting end-user human input feedback from these users to understand what the right UI and UX for explainability might look like. And this is—you gave this example of showing the car crash. What caused the car crash? And you gave some great examples, like, well, they were going too fast or while the driver was drunk or there was a cat on the dashboard or whatever the three things were. Can you talk a little bit about this idea of how you go about getting the feedback from users about the questions to then determine what the right explainability experience might be?
Vera: Yeah. Yeah, sure. So, caveat is, it’s a very new research thread I just started. So, things are [laugh] kind of slow at the exploratory stage. But this is something I’ve always been interested is this criticism that the current explanation, right, I have this algorithm to identify this feature importance, and I just give you this list of 20 feature importance, it’s just not very similar to how people explain.
And there are broader literature looking at in social science literature, how people explain and there are many different fundamental property of human explanations are not there. One of the human explanation property is being selective. For example, if they’re in a car accident, there may be dozens of events leading to this point. And while you’re explaining to your friends or a police officer why this accident happened, you’re not giving just this 20 [laugh] events that come to this point. You’re going to be selective.
For example, if your friend is very surprised the car had a very bad damage, you’re going to pick the course that’s most relevant to this damage. So, you may say, “Because the speed is very high,” versus when a police officer is asking you this question and they want to diagnose the reason often, so you might pick something that is considered just very abnormal, like, the car suddenly made a lane change, right? So, I’ve alway been interested in how do we develop new algorithms or approach to mimic this kind of selectivity behavior of how human doing explanation. So, the general idea is, now we have this feature importance and I have this 20 feature importance. How do I present them in a way that’s more selective and easy for you to understand.
So, we have this kind of framework we call selective explanation. Basically, the algorithm is trying to make this selection from this 20 feature importance based on inferred belief that you have, right? So, let’s take an example in the [paper 00:35:03] is kind of a sentiment analysis that if you’re looking at a movie review and the model may say it’s a positive or negative, but not everything, not every words you would consider as relevant to this decision of movie review. But we can give you say, ten reviews, and then you gave us some input-slash-annotation of what kind of feature, what kind of words, you consider as relevant to movie review. Now, I build another prediction model that can predict your belief in unseen movie review, what words you would believe as relevant. And then we use that inference model of your belief to augment the output of feature importance algorithm. So, now we only gave you feature out of this 20 feature that we believe you think would be relevant.
Vera: Yeah. So, that’s kind of a general idea.
Brian: Right. So, in that movie—you told me that this concrete example works, just for our listeners. This could be a situation where you read a movie review about, like, should I watch Spaceballs or not, and the review is just, like, “This is just—the sense of humor is, like, first grade. It’s a terrible movie that clearly didn’t spend a lot of money on the special effects. But if you love Star Wars, you won’t want to miss it.” And I love Star Wars.
This is the kind of thing you’re talking about, right, where overall the system might predict this review is, like, this is a terrible film, you won’t like it. But then there’s a super heavy feature in there [laugh] that might be really relevant to me, but not necessarily somebody else, and so finding a system that can design for this use case is going to be more powerful.
Vera: That’s a good example, yeah.
Brian: I want to just—I know we’re getting up to our time here. You have a great slide with just four steps on how to do question-driven XAI design. I’m going to just kind of summarize that and let you comment on it. But before I do that, I wanted to ask you, are there any general trends or things that you wish very technical people, particularly the data scientists, machine-learning engineers, people that are really working on the technical side of this, things that you wish they would take away from your work in the space, particularly companies where they’re working on internal data products, right? So, they may not have formal product management or any user experience design or HCI expertise on their team. What are the things they need to unlearn or just really start to be aware of if they really understand the nuts and bolts of the technical work?
Vera: That’s a great question. I think, if I summarize is no one-fits-all solution. We have to move beyond just think about Shapley value or LIME, even though they’re very popular algorithm. There are so many different explainability technique from feature importance, example-based, counterfactual, global explanation. And also this point of view of taking explainability with a broader view, not just about, “Oh, I have this prediction. How do I explain this prediction?” Right?
Sometimes, people want to understand data more, sometimes people want to understand overall logic more. So, these are why we should consider all this in the realm of explainability. And then, once you have this many choices, and then how do you make the choices, right, so that ideally, you should be really driven by understanding your users’ question and your users explainability needs, even if you don’t have the resource or time to do a full-fleshed user research, just talking to a couple of user might be helpful. Even, think through this different question people may ask. So, in our research, in that slide stack, we have this word we call XAI Question Bank, which are more than 50 question that we see, kind of, can be asked across many different AI products. Might be helpful to take a look and then just kind of heuristically, you think about, like, what are the questions that may most likely emerge in your product? And then use that to drive the choices. Don’t jump directly to LIME or SHAP.
Brian: [laugh]. I love it. So, to summarize this, I’m going to just summarize this kind of four-step process you have here. So, Question-Driven XAI design. Step one, identify user questions, right? So, this is the interviews you’re talking about. Typically, people involved with this might be designers and users, but if you don’t have designers, then you got to get your technical people involved in doing that work.
The second step is analyzing the questions. So, this is clustering questions into categories and trying to do some prioritization about which kinds of questions are you going to put effort in on the technical side to allow this application to answer. And that’s going to be your product team and designers as well.
The third step is mapping questions to the technical solutions, which is probably where the technical people want to [laugh] jump to pretty quickly, as you just said here, but this is obviously when you start to look into what’s available off the shelf, or what do we need to make here in order to actually begin to bring some XAI into the solution.
And then finally, step four is iteratively design and evaluate this. So, this is where the first time we’re actually creating a design including all the input that we got from the previous three steps. And then we’re getting this in front of users and we’re reevaluating and such and so forth as is the normal kind of product design lifecycle. A, did I get that right? B, anything you want to add to it?
Vera: Yes, I think that’s a great summary. I think the only thing we didn’t touch today is the evaluation part. I think that’s so useful, and something I want to highlight is also. Explainability or understanding is a means to an end. We say there are different use cases, there are no [unintelligible 00:40:44] solution. That’s because people want explanation for different goals, different use case.
So, it’s really helpful while you’re doing the first step user research, or when you come to make the selection, really articulate what is user trying to achieve with explanation? Is it about making better decisions? Is about identify more bug? Is it about identify is there a bias issue in the data? So, when you do the evaluation, use that versus some kind of ambiguous, is this explainability? Is faithfulness? Usually using users’ goal as the gold standard for your evaluation.
Brian: I mean, even the question like, “What can this tool do for me?” Like, “What are its capabilities?” Is a fairly broad question, but that kind of falls into this camp as well. So yeah, I love that analysis of the questions as being critical. Vera, this has been fantastic. And just because this is very timely with where we are right now—and I know this is a very big question—but any general comments about ChatGPT and thinking about explainability and interpret [laugh]—interpretability about what it’s doing? Any just thoughts about that, how you might approach making a system like that more explainable with large language models and generative AI?
Vera: Oh, that’s really a question on top of my mind these days for a longer, longer discussion that may require [laugh] another—
Brian: Another episode?
Vera: —podcast episode. But I—my colleague, Jennifer Wortman Vaughan, and I, we just put out a position paper called “Human-Centered AI Transparency in the Age of Large Language Models”. We kind of break down some of our thoughts.
Brian: Oh, okay. Great.
Vera: Yeah. So, that’s maybe a place to—if anyone interested in checking it out. And one thing I want to point out is explainability, per se, the current practice is also, people start to ask ChatGPT to give them explanation directly. I want to give an alarm that those explanations are not faithful. They are really—because again, ChatGPT is to generate content that looks plausible, even convincing.
A lot of times, that’s the information that kind of justify answer and you may discover, depending on how ChatGPT answer, it may give you contradicting information in the explanation itself. So, I think we shouldn’t take ChatGPT’s explanation as it is. It’s very much being actively researched topic, how can we produce more faithful explanation, but I think that’s something we don’t want end user or data scientist will jump directly at this point to utilize as the explanation.
Brian: Yeah. Yeah. And it won’t tell you that, as I recall [laugh]. It won’t tell you, like, “Oh, now I’m just going to switch into truth mode and I’m not going to make any shit up, and I’m just going to tell you exactly how I came up with this.” There’s no indication that that’s happening. So anyhow, you’re right. This is a whole—that’s probably another episode.
Vera: Yeah [laugh].
Brian: But in the meantime, if people want to find out about your work and follow your work, is there a best place to find Vera Liao out on the internet? How do they get in touch or follow you?
Brian: Great. Yeah, we’ll put that in the [show notes 00:44:04], and some of these other citations that you mentioned here, we’ll try to dig those up and put those in the [show links 00:44:09]. So, thank you, Vera, so much for coming on. It’s been really great.
Vera: Yeah, it’s a pleasure. Thanks so much.