Discount launch pricing for the Data Product Leadership Community ends Sep. 30!
Participate in our live monthly Zoom webinars (recorded for on-demand access), connect over a 24/7 Slack, and get introduced to a new peer 1x1 every 2 weeks. Join my free Insights mailing list to get subscriber-only pricing before Oct. 1.
Get the coupon then apply for membership

082 – What the 2021 $1M Squirrel AI Award Winner Wants You To Know About Designing Interpretable Machine Learning Solutions w/ Cynthia Rudin

Experiencing Data with Brian T. O'Neill
Experiencing Data with Brian T. O'Neill
082 - What the 2021 $1M Squirrel AI Award Winner Wants You To Know About Designing Interpretable Machine Learning Solutions w/ Cynthia Rudin
/

Episode Description

As the conversation around AI continues, Professor Cynthia Rudin, Computer Scientist and Director at the Prediction Analysis Lab at Duke University, is here to discuss interpretable machine learning and her incredible work in this complex and evolving field. To begin, she is the most recent (2021) recipient of the $1M Squirrel AI Award for her work on making machine learning more interpretable to users and ultimately more beneficial to humanity.

In this episode, we explore the distinction between explainable and interpretable machine learning and how black boxes aren’t necessarily “better” than more interpretable models. Cynthia offers up real-world examples to illustrate her perspective on the role of humans and AI, shares takeaways from her previous work which ranges from predicting criminal recidivism to predicting manhole cover explosions in NYC (yes!). I loved this chat with her because, for one, Cynthia has strong, heavily informed opinions from her concentrated work in this area, and secondly, because Cynthia is thinking about both the end users of ML applications as well as the humans who are “out of the loop,” but nonetheless impacted by the decisions made by the users of these AI systems.

In this episode, we cover:

  • Background on the Squirrel AI Award – and Cynthia unpacks the differences between Explainable and Interpretable ML. (00:46)
  • Using real-world examples, Cynthia demonstrates why black boxes should be replaced. (04:49)
  • Cynthia’s work on the New York City power grid project, exploding manhole covers, and why it was the messiest dataset she had ever seen. (08:20)
  • A look at the future of machine learning and the value of human interaction as it moves into the next frontier. (15:52)
  • Cynthia’s thoughts on collecting end-user feedback and keeping humans in the loop. (21:46)
  • The current problems Cynthia and her team are exploring—the Roshomon Set, optimal sparse decision trees, sparse linear models, causal inference, and more. (32:33)

Quotes from Today’s Episode

  • “I’ve been trying to help humanity my whole life with AI, right? But it’s not something I tried to earn because there was no award like this in the field while I was trying to do all of this work. But I was just totally amazed, and honored, and humbled that they chose me.”- Cynthia Rudin on receiving the AAAI Squirrel AI Award. (@cynthiarudin) (1:03)
  • “Instead of trying to replace the black boxes with inherently interpretable models, they were just trying to explain the black box. And when you do this, there's a whole slew of problems with it. First of all, the explanations are not very accurate—they often mislead you. Then you also have problems where the explanation methods are giving more authority to the black box, rather than telling you to replace them.”- Cynthia Rudin (@cynthiarudin) (03:25)
  • “Accuracy at all costs assumes that you have a static dataset and you’re just trying to get as high accuracy as you can on that dataset. [...] But that is not the way we do data science. In data science, if you look at a standard knowledge discovery process, [...] after you run your machine learning technique, you’re supposed to interpret the results and use that information to go back and edit your data and your evaluation metric. And you update your whole process and your whole pipeline based on what you learned. So when people say things like, ‘Accuracy at all costs,’ I’m like, ‘Okay. Well, if you want accuracy for your whole pipeline, maybe you would actually be better off designing a model you can understand.’”- Cynthia Rudin (@cynthiarudin) (11:31)
  • “When people talk about the accuracy-interpretability trade-off, it just makes no sense to me because it’s like, no, it’s actually reversed, right? If you can actually understand what this model is doing, you can troubleshoot it better, and you can get overall better accuracy.“- Cynthia Rudin (@cynthiarudin) (13:59)
  • “Humans and machines obviously do very different things, right? Humans are really good at having a systems-level way of thinking about problems. They can look at a patient and see things that are not in the database and make decisions based on that information, but no human can calculate probabilities really accurately in their heads from large databases. That’s why we use machine learning. So, the goal is to try to use machine learning for what it does best and use the human for what it does best. But if you have a black box, then you’ve effectively cut that off because the human has to basically just trust the black box. They can’t question the reasoning process of it because they don’t know it.”- Cynthia Rudin (@cynthiarudin) (17:42)
  • “Interpretability is not always equated with sparsity. You really have to think about what interpretability means for each domain and design the model to that domain, for that particular user.”- Cynthia Rudin (@cynthiarudin) (19:33)
  • “I think there's sometimes this perception that there's the truth from the data, and then there's everything else that people want to believe about whatever it says.”- Brian T. O’Neill (@rhythmspice) (23:51)
  • “Surveys have their place, but there's a lot of issues with how we design surveys to get information back. And what you said is a great example, which is 7 out of 7 people said, ‘this is a serious event.’ But then you find out that they all said serious for a different reason—and there's a qualitative aspect to that. […] The survey is not going to tell us if we should be capturing some of that information if we don't know to ask a question about that.”- Brian T. O’Neill (@rhythmspice) (28:56)

Links

Transcript

Brian: Welcome back to Experiencing Data. This is Brian T. O’Neill. I’m here with Dr. Cynthia Rudin from Duke University. Welcome to the show, Cynthia. How are you?

Cynthia: Oh, thank you. I’m glad to be here.

Brian: Yeah, yeah. We’re going to totally nerd out to interpretable machine learning, we’re going to talk about interpretable and explainable AI, and I also first wanted to start out with a congratulations; you’ve recently won a very big award in this space. Can you tell us about this award that you’ve won? And why?

Cynthia: Yes, so I won the Squirrel AI Award. And it’s the premise of the award is that there’s no Nobel Prize for AI, and there’s certainly no Nobel Prize for AI for benefiting society, or humanity. And so triple-AI, with the help of Squirrel AI, designed this new award that, like these big awards in other fields, like the Nobel Prize, would have a million dollar award associated with it. And obviously, this is not something I tried to earn. [laugh].

I’ve been trying to help humanity my whole life with AI, right? But it’s not something I tried to earn because there was no award like this in the field while I was trying to do all of this work. But I was just really, totally amazed, and honored, and humbled that they chose me. And so yeah, that’s what it is.

Brian: Yeah. And I think that it’s really great that there’s a focus on an award here that has something to do with connecting the actual people who use these solutions, and the affected parties that get affected by these solutions as being something worthy of awarding at all, besides just, like, who got the highest level of accuracy in the model with, you know, parameters and, like, 96.7; winner. You know? Like, that there’s something a little bit beyond that, not to say that’s not important in its own way and its own thing, but this is the kind of stuff we like to talk about on this show. So, the first thing I wanted to tell you is, so I’m not a data science and machine learning engineer, which my audience knows, and I just tried to use design and help people use design to make that work count and to actually produce more human-centered solutions in this space.

And one of the things that I’ve been using, a term I’ve been using interchangeably, was explainable and interpretable. And so you have, if I recall from my research when we first met, you have a hard distinction here about these terms, and they’re not the same thing. And I hadn’t seen that before; I’ve been using these terms interchangeably. Tell me what your position is on this. What’s the difference here? Why does it matter?

Cynthia: Yeah. So, I use these two terms to mean completely different things. I’ve been working on interpretable AI since the beginning of my professional career, right? Interpretable machine learning is where you try to design a predictive model that a human can understand. It’s a formula that’s either very simple, like, something a human can memorize, or it’s something that can go in a PowerPoint slide, or it’s interpretable in other ways, like its constraints, so that humans can visually verify that the model is doing what it’s supposed to do.

So, that’s an interpretable model. But then, about five years ago, there was this explosion of interest in explaining black boxes because people realized that, hey, you know, it’s quite dangerous to use black boxes. It’s like, they finally figured it out, you know, but instead of trying to replace the black boxes with inherently interpretable models, they were just trying to explain the black boxes. And when you do this, there’s a whole slew of problems with it. First of all, the explanations are not very accurate; they often mislead you.

And then you also have problems where the explanation methods are giving, sort of, more authority to the black box, rather than telling you to replace them. Like, it’s okay, if we use a black box because we can explain it. It’s like, well, you think you can explain it, but maybe you can’t. Maybe the explanation model is wrong, you know? So, now you’re left troubleshooting two complicated models, like, the explanation model and the black box, instead of just trying to design a model that’s actually interpretable in the first place.

Brian: Right.

Cynthia: For the kinds of models that I design, some of them are used in very high stakes decisions, like, they’re used for detecting seizures in intensive care unit patients; these are critically ill patients. And the formulas that we give doctors for these problems are so small that they can memorize them. And I would not want to give a doctor a black box with an explanation for that type of problem.

Brian: Can you give a hard example of—I’m going to paraphrase what I think you said; you said something about giving the black box more authority than it deserves, or a particular explanation is giving too much authority. Can you give us a—try to paint a visual example of that?

Cynthia: Sure. Radiology would be a good example. There’s a lot of work right now in radiology where they’re creating black box models to try to analyze things like breast lesions or other types of medical images that you might take. And the way a lot of these explanations work is that they’re just trying to highlight which part of the image the neural network is paying attention to. And unfortunately, you have these situations where the neural network is paying attention to one part of the image and the explanation method says it’s paying attention to the whole image or some other part of the image, which is not what it’s actually doing.

So, there’s this huge amount of work right now, in what’s called saliency maps, where you’re just trying to figure out which pixels is it actually paying attention to, whereas for us, the methods we design, you know exactly which pixels the network is paying attention to because we designed it that way, that it would tell you that information. So, I think maybe that helps make a distinction.

Brian: Thanks for that explanation. Can you give me another example of this?

Cynthia: Okay. So, another example is the ProPublica article that was very well known. It’s called “Machine Bias” and the idea of the article was to analyze the black box model called COMPAS that’s used in the criminal justice system.

COMPAS is used for predicting whether someone will commit a crime in the future, so it’s used for predicting criminal recidivism. And it’s used throughout the whole US justice system. And it is proprietary, which means that nobody knows what the formula has in it. And ProPublica wrote this article where they took some data from Florida and they approximated COMPAS with another model, a simpler model that they could understand. And that simpler model depended on the person’s age, the number of prior crimes they’ve committed, and the person’s race.

And the analysis by ProPublica showed that race was a statistically significant factor in the model. So, ProPublica wrote this article claiming that COMPAS depends on race in addition to its correlation with age and criminal history. And the correlation of race, age, and criminal history that comes from systemic racism in society. That doesn’t come from the model itself, right? But they claim that this model depended on all three factors: Age, criminal history, and race.

And when we went back and looked at their analysis, it seems as if they approximated COMPAS incorrectly, they approximated COMPAS with a linear model of age variables and we don’t think it’s linear. And so their analysis, you know, the dependence on race, when you take into account the non-linearity, the dependence on race goes away. And so they created this huge scandal over approximating a black box incorrectly. It’s very easy to mess that sort of thing up. You’re analyzing a black box, you approximate it, you assume that the approximation has the same variables that are important as the black box; it’s not necessarily true.

So, that’s another example where—

Brian: Yeah.

Cynthia: Yeah, a huge—this created a huge amount of interest in the field of algorithmic fairness, which obviously algorithmic fairness is super important, but they missed the point that transparency would have solved this problem.

Brian: Yeah, yeah. I’m going to put aside Facebook and some of the really big examples we’ve heard about over and over. Do you think this kind of stuff is happening because of the maturity level of where we the data practitioners, the statisticians, machine learning engineers, where we collectively are right now? Is this sloppy work or is this the business doesn’t care about these ethical things unless we get caught? Why are we here?

Cynthia: Okay. I think we’re here because the whole field of machine learning grew up on advertising and low-stakes decisions, right? So, people were getting employed for many years doing online advertising, Facebook recommendations, web search engine recommendations, this is all low-stakes stuff; if you make a mistake, it’s not a big deal. So, then people just started designing these bigger and bigger black boxes. Oh, and also computer vision, right? And all this computer vision stuff, right, they were doing, identifying cats and dogs and Facebook images and stuff.

So, this was all low-stakes stuff. So, everybody sort of was doing these black boxes. And they liked them. They really liked them because there was some sort of mystique about the idea that you could have a model that would find patterns that were so subtle that no human could understand them. And people love that. They absolutely love that idea.

And so, back a bunch of years ago, when I would go give a talk about, “Oh, look, I can find these patterns that people can understand,” people would come up to me and start yelling at me because I was ruining that mystique. I mean, for me, it’s just so much more useful if you can understand what you’re doing right? If you’re doing real science, if you’re trying to help people in medicine—if I mean, I was working in power grid reliability, so I wanted to help the power grid engineers, and I want to help—I was working with some detectives, too, I was helping with trying to prevent house breaks. And then for me, with working with real people in real domains, the value of interpretability was so much—you know, it was just so much higher. It was so much more useful to be able to show someone what the model was actually doing, what it was depending on, that to me the mystique was completely—like, they—I just thought they were completely wrong. And so, yeah, that’s why I decided to work in interpretable machine learning. [laugh].

Brian: I just want to comment on one thing here which I really love that you said. So, for the listeners out there, there’s always a time and a place for playgrounds and for exercising technology to see what it can do, and that’s all fine. But all the reasoning that you gave there was about the work being in service of somebody else. It wasn’t about making Cynthia happy; it was about how is it going to serve the people it’s for, whether it’s a paying client, whether it’s your boss, whether it’s another department, whether it’s society? There’s a meaningful difference here because you’re talking about mystique, and mystique perhaps to the machine learning community or the people on the quote, “Inside,” there’s a mystique there.

And there is a time and place for that, probably. The question is, does that belong when you’re out there doing commercial work, work that affects society, this kind of thing? Is that the time and the place to be caring about that [sign 00:11:17] of stuff, accuracy at all costs, finding the magic things, you know, all this kind of stuff. I just wanted to call that out. And I like that your reasons were all focused on this work being in service of somebody else because I think that’s really important.

Cynthia: I want to comment on this idea of accuracy at all cost, right? So, this assumes that you have a static dataset, that you just have one dataset, and you’re not going to mess with it, and you’re just trying to get as high accuracy as you can on that dataset. Now, that is what we do in [capital 00:11:49] competitions, and that’s what we do in machine learning bake-offs, but that is not the way we do data science. In data science, if you look at a standard knowledge discovery process, like the KDD process, or the CRISP-DM process, right—these are processes for discovering knowledge from data—in all of these processes, after you run your machine learning technique, you’re supposed to interpret the results and use that information to go back and edit your data, go back and edit your evaluation metric. And you update your whole process and your whole pipeline based on what you learned.

And so when people say things like, “Accuracy at all costs,” I’m like, “Okay. Well, if you want accuracy for your whole pipeline, maybe you would actually be better off designing a model you can understand.” In my experience working with the power grid data, that was the messiest dataset I’ve ever seen in my life. These were, like, trouble tickets typed by Con Edison dispatchers, these were accounting records dating from the 1880s. I mean, this was a very tough dataset.

And if you throw that thing into a black box and get some result, and you don’t know what’s going on there—which is what we did, right, [laugh] that’s what—I was, I was a very well-trained machine learning person; that’s what we did. And so we did that, and it was just junk. It was like, garbage in, garbage out. And then when we actually changed the model so that we can understand them, we could show them to the power engineers and be like, “Okay, you know, it’s depending on the number of neutral cables a lot. Does that look right?” And they’re like, “No. [laugh] that does not look right.” So, it’s like, “Okay, let’s take neutral cables out of the model and we’ll try running it again,” and editing it and stuff, changing, figuring out what went wrong with our data.

Brian: Let me pause you—

Cynthia: Sure.

Brian: I’m going to interrupt you real quick because I want people to know the context. Is this the one about predicting manholes exploding?

Cynthia: Yeah, that’s what we’re doing.

Brian: Okay. So, that’s the end—that’s the end state, right?

Cynthia: Yeah.

Brian: The human in the loop, the last mile, it’s about can we prevent manholes from exploding? So, I just want to give that context because I don’t think people knew what the thing was that you were working on. So. So, you tried the black box model? You said something about cables, “Is this a logical feature that points to the predictive accuracy?” And they’re like, “No.” So already, we have our users in the loop here, you’re getting feedback. Keep going.

Cynthia: Yeah. So, the interpretability in the process gave us, overall, much better accuracy, right? So, when people talk about accuracy-interpretability trade-off, it just makes no sense to me because it’s like, no, it’s actually reversed, right? If you can actually understand what this model is doing, you can troubleshoot it better, and you can get overall better accuracy.

But just to give you more context on that project, yes, we were predicting manhole explosions. We were predicting failures on the New York City power grid, right? The New York Power Grid is the oldest and largest underground power system in the world, and these failures happen when insulation on the electrical cables breaks down and then they are short circuits. There’s pressure build-up, smoke, and then… yeah, like I said, there can be an explosion, or a fire, or a seriously smoking manhole. We call them manhole events.

Brian: Manhole events. [laugh].

Cynthia: Manhole events. Yes. That’s right.

Brian: That’s great. I have actually—I lived in New York for about 50% of my life, for seven years, and I remember seeing a cross-section when they’re doing construction and a relatively deep hole, and it’s pretty amazing what is going on under the streets of New York when you see, like, a cross-section cut out and the amount of, like, stuff that is underneath your [laugh] feet, it’s just how someone keeps dragging that I’m just like, [laugh] you know how old some of this infrastructure is, I can only imagine what that project was like just with the records being so old. And new datasets, old datasets, trying to make sense of all that must have been quite a project.

Cynthia: It was incredible. I mean, in retrospect, it was a super risky career move because it was like the dumbest thing I could have done, which, you know, in terms of my own career, which is to devote three years of my life to try and working with this dataset, having no idea whether I would ever be successful predicting power failures at all. But I wanted to see how machine learning worked in the real world. It was my first project, I didn’t know anything, and yeah, I’m glad we were able to do it in the end because [laugh] it would have been a pity after so many years if we couldn’t do it.

Brian: [laugh]. Can you tell me a little bit about getting the model right, showing the interpretability, showing the correlated features, but not actually enabling decision-making? So, something breaks down between the user interface and the person sitting there that’s supposed to make a decision—or maybe do nothing; keep the status quo—but being informed enough to know what they should do next et cetera. Is there breakdowns there between, kind of, that design, the interface layer, and the person that’s supposed to take action that’s a separate type of work that getting the model further tuned for accuracy or interpretability? Is that a second stage of work? Or do you feel like no, once you kind of have the features right, then it takes care of itself? Talk to me about that.

Cynthia: If you had asked me, I would give you the same answer for a different question, which is, what is the future of interpretable machine learning? Or what’s the next, the next major thing that machine learning should do, right?

Brian: Milestone or something. Yeah.

Cynthia: Rmile—well, I don’t know about milestone, but the next important future direction for the field. And I would say, yes, the idea of having effective human-computer interaction between these machine learning models and the people who use them I think is where that frontier is, especially when it comes to things like medical imaging, or other types of very complex discovery problems. I’m working in material science right now, too, so those are very difficult discovery problems. So, I think these interfaces are extremely important.

So, humans and machines obviously do very different things, right? Humans are really good at having a systems-level way of thinking about problems. They can look at a patient and see things that are not in the database and make decisions-based on that information, but no human can calculate probabilities really accurately in their heads from large databases, right? That’s why we use machine learning. So, the goal is to try to use machine learning for what it does best, and use the human for what it does best. But if you have a black box, then you’ve effectively cut that off because the human has to basically just trust the black box. They can’t question the reasoning process of it because they don’t know it.

Brian: Sure. But can you have an interpretable machine learning model that has a really bad interface and experience, and so the breakdown is at the interface level? It’s not that the model is not interpretable, it’s something about the presentation of or the experience of understanding that stuff, either the quantity of information, the way it’s laid out to the user, et cetera, I’m curious if that’s a particular challenge, or whether or not if you get the modeling right part it takes care of itself?

Cynthia: No, I think that is a challenge. I mean, the domains that I work in, we haven’t had that particular problem, because for example, if the model fits on a PowerPoint slide, you just, you could put it on an index card or have somebody memorize it, and so you don’t really have those kinds of issues. I design a lot of models like that for healthcare domains. But I can imagine that in a lot of other complex problems. For example, I have a colleague who works in neuroscience, and I believe that kind of problem affects their work.

Brian: Yeah. I’ve seen some times where there’s—it’s almost like you’re dumping your debug into the interface and calling that the interface. So, “Here’s 75 features and all the correlations that were there. And that’s the interpretability. There you go.” And that’s the end of the process of designing the interface.

It’s like, well, that’s what happened. That’s why I came up with that prediction. It’s all there. To a user, that’s often not a viable solution, so effectively, the, quote, interpretable machine learning model, effectively, it was not interpretable, even if technically it was interpretable, if that makes sense. So, I’m curious about that—

Cynthia: You could have—

Brian: —phase.

Cynthia: —you could have really tiny models that are not interpretable, right? Because you could use the wrong factors. For instance, with the power company when we had neutral cables in the model, they said that is not—that does not make any sense. That’s bad. So, that wasn’t interpretable to them.

And then, for instance, for loan decisions, right? It’s possible that if you don’t take into account significant details of the person’s credit history, perhaps you might not find that model interpretable, right? Interpretability is not always equated with, say, sparsity. So, you really have to think about what interpretability means for each domain and design the model to that domain, for that particular user.

Brian: Why does this matter more now than it did? You kind of prefaced your original response to this question around, like, “In the past, I might not have said so, but now I do,” or something like that. I’m just curious, like, what changed, that getting this part right is important?

Cynthia: Did I say that? No, I always believed that.

Brian: I may have misheard, so I apologize if I did. I thought you had said that there needs to be more importance placed on getting this, kind of, last mile human interaction piece correct, such that someone can actually be empowered to make a decision, or whatever it is that the model is suggesting to them to do.

Cynthia: Like I said, I think that’s the next frontier because we’ve actually been able to make huge progress on the different technical problems in interpretable machine learning. So, we can design models that are very sparse, or we can design models that have special constraints, but the question is, did we design the right model? How do we get the human to give their input into the whole design process for these models? What if the human doesn’t like the model that our algorithm constructed? Maybe they want to put in an extra constraint, or maybe they want to look at a bunch of different models that are a little different from each other, so that they could think about what would actually make sense for them?

Brian: Right. Right.

Cynthia: Then you have to basically make your training process interactive, right? You have to make it interactive with the domain expert. That, I think, is quite difficult because you have a computationally hard problem paired with the human-computer interface problem. And if you’re too slow in recomputing your answers, then the human is going to have to wait. And we don’t want that. We don’t want the human to have to wait. So, it’s actually quite a difficult not only technical challenge, but socio-economic challenge.

Brian: A lot of my audience, I think that’s listening works on enterprise instances of machine learning, oftentimes, not software—it’s probably half—I don’t know, half software companies listening and half internal data teams serving operations uses, things like this. Maybe they’re lower stakes, sometimes they’re probably not. On this question of adoption, getting people to trust and use these things, even when, like, they were the ones that asked for it and we delivered it, and then it didn’t get used—this is a common challenge there—I’m curious about the when we talk about making an interpretable, when say a stakeholder tells the data team something about the requirements and what they think are relevant features or data to include, that should be correlated with making something predictive, right, like you said, whether it’s race, gender, age, or whatever, and someone came up with that list and said, “Dr. Ruden, we would like you to include these in your modeling.” What’s the conversation you have when an end-user has a different idea of what those things should be, and so the answers coming back are suspicious to them because on the one hand, the data that’s available, or the data that was maybe came down from a stakeholder was one set of things, and the people actually on the ground doing the work using this thing to manage a factory device or, like, what should I set my dials at right now, and that data, it’s not believable to them because they weren’t part of the cake-baking, as you talked about earlier. They didn’t have that input there.

Is this normal to have this dance where we’re constantly kind of iterating between what the human’s expectation is when we talk about interpretability, and what they might expect to see, such that they’re not constantly surprised, versus what the data supports, kind of a purist view?

Cynthia: Okay, so first of all, data is not pure. [laugh]. Data is dirty. You know that. But yeah—

Brian: Well, I know that. But from that sense that there’s an absolute right—I think sometimes data, it’s just like, well, the facts, say this, and then there’s these kind of squishy humans over here. And I’m making this very binary, which it’s not. I don’t mean to overly stereotype, but I think there’s sometimes this perception that there’s the truth from the data, and then there’s everything else that people want to believe about whatever it says.

Cynthia: Well, okay.

Brian: And somewhere in the middle—[laugh].

Cynthia: [laugh]. So okay, there’s a couple things going on there, right? The first thing is that you need the cake-bakers to be the correct people, you need the cake-bakers to be the people using the model because otherwise very likely they won’t use it. And the truth is that most models don’t get used in practice, which is the whole reason I went into interpretable machine learning in the first place because I wanted people to use my models.

So, your question also touches on this characteristic of datasets called the Rashomon set. So, the Rashomon set is the set of models that are good. So, you could, for instance, get the model that minimizes some loss function right? That loss function you chose fairly arbitrarily. So, you could have chosen accuracy, but you also could have chosen that area under the ROC curve, you could have chosen the exponential loss, the logistic loss, you know, whatever loss function you chose, you chose it fairly arbitrarily. Okay.

So, you could find the model that minimizes this loss function, so that is the best model, right? So, the question then becomes, well okay, how many models—if you change something in your experimental setup, like if you change something really slightly, like loss function, or your data, right, or if you didn’t ask for the very best model, you just asked for a model that was, like, close to the best, can we think about what that set contains? And the answer is it contains a lot of different models, okay? In general, for many datasets and contains a lot of different models, so there is no one right model; there’s just a lot of different models that are all about equally good. And so the question is, well, which one of these are we going to use? [laugh]. Which one of these is more likely that the domain experts will use? And then that’s where this whole human-in-the-loop, all these big questions come in?

Brian: Yeah. And thinking about that, where the humans are in the loop, tell me about, kind of high-level, how you would approach a greenfield project where someone came to you and said, “Hey, we’d like your team to work on X for us.” How and when are you involving end-user in this process? Maybe how when is that different than subject matter experts? So, I’m trying to think of an example; sometimes the SMEs are not necessarily the people who are going to use it, but they may have a lot of input to give—quality input to give into a product or application. Where do you start? What kinds of research do you do with end-users to make sure that we don’t build something that doesn’t get used?

Cynthia: The first thing is, you have to understand the problem. If you don’t talk to the domain experts, the truth is, you’re not going to understand the problem well enough to actually be able to get anywhere on it.

Brian: Yeah, yeah.

Cynthia: So, you have to engage them right from the beginning because they’re the ones who have to hand you that dataset and you have to understand what it is. And so if you don’t even go to those basic steps, then there’s really no hope for your project, in general. So yeah, and then every step of the way, every time you do anything, at least for us, we show them preliminary models, and we go, “How does this look?” [laugh]. You know, “Here’s what we’re doing. What do you think of that?” And then most of the time they shoot us down, which is great, right? If they shoot us down, that’s another constraint we add. And then eventually, we get to something that they want.

Brian: Do you use formal design or user experience techniques for that? Structured usability studies, things like that? Or is it more just kind of self-reported feedback? How do you collect that? Any suggestions on, kind of, process that you’d like to use for this to?

Cynthia: Well, help with that, most of the time, we’ve been just using feedback from the domain experts, just to guide us toward what they think is going to be useful in their domain. But we’re starting this new study for our digital mammography project. We have these interpretable neural networks for mammography. They take images of breast lesions and they pick out certain parts of those images and compare them to prototypical mass margins of different types, like spiculated or circumscribed, and we’re trying to figure out whether these new tools can be used by radiologists, and so we’re going to do a user study with radiologists and see whether it can help either train radiologists or help make their process more efficient.

Brian: Is that like a qualitative, kind of, open-ended or are you asking very discreet questions that have pass-fail criteria or something like that, in terms of how they would use the interface?

Cynthia: Oh, it’ll be both. We would never do a survey without an open-ended box that they could type in their comments. Because you know what? If they hate the whole process, right, we want to know that. So, we want to get their full feedback, of course.

Brian: Mm-hm.

Cynthia: Yeah. I mean, we’ve also done studies with the Con Edison engineers, where we had people look at some of what we did and try to give us some feedback. But to be honest, actually, that work was more—that wasn’t feedback on our work. We actually did a study where we had the Con Edison engineers read the trouble tickets, which are very—these are documents that are very hard to read—and we had them read the documents and say whether they represented a serious event, a non-serious event, or a non-event. And then we checked to see what our machine learning said and whether or not it agreed with the human experts. And whether the human experts agreed with each other because they don’t always agree with each other. [laugh].

Brian: Yeah, I’m a big advocate of qualitative research when I teach data professionals and not using surveys. Surveys have their place, but there’s a lot of issues with how we design surveys to get information back. And what you said is a great example, which is well, seven out of seven people said this is a high—what was—I forget the scale was; high priority event, or highly significant event versus significant—

Cynthia: Serious event. [laugh].

Brian: Oh, serious. But then you find out that they all said serious for a different reason, and there’s a qualitative aspect to that, which is why that was. And so you don’t know whether or not well, maybe we could be capturing some of that information there, but the survey is not going to tell us that if we don’t know to ask a question about that. So, there’s a challenge there sometimes with surveys. But I’m here to learn what you’re doing and your process for going through this.

I want to talk about you, not what I do, and so. [laugh]. Could you paint some examples of doing this poorly, when we talk about putting an interpretable model in front of a user? Are there some best practices there or some definitive not best practices, like red flags? Or things that I’ve learned that maybe, “I used to do it this way and then I’ve changed my stance on that. I now do it this way.” Any black-and-white types of recommendations you can give there, things to avoid, things to go for just in making that kind of last mile experience better for users?

Cynthia: So, a lot of the work that I do is in healthcare where the model is so tiny that you can put it on an index card.

Brian: Yeah, yeah.

Cynthia: So, the presentation for those doesn’t really matter because they have standard ways of presenting those types of models to users. It’s much more difficult to get people to use more complex models. I think that’s pretty clear. We’ve been trying to design interactive visualizations where people can, like, fiddle with things and see how things change, which is, I think those are cool. I think they’re probably pretty useful to the users.

 

But yeah, I don’t know if there’s any, like, hard and fast rules because the thing is, every domain is different. So, you know, you make one rule in one domain and it totally doesn’t apply to another domain, right? Everything is sort of custom.

Brian: Well, I think, too, sometimes about the entire experience. For example, “Can I override this prediction, but we want to capture that I overrode it so that maybe we can track that or learn something for it, and maybe why did that.” If spitting the prediction out, read the interpretability is kind of the hub, there are spokes that come off of that, that may be kind of ancillary experiences that go along with that. I’m thinking about that kind of whole experience here. Or as you talked about using interactive tooling there, which maybe there’s a couple different predictions that came out, there’s three and they all give slightly different scores, and you can lock in this value and then watch these two other parameters change to kind of get a better understanding of what’s going on, or I want to take quote, “The average of these three things,” but I want to lock in some values and then see if my, kind of, human prediction lines up with what the machine is telling me and these kinds of things. So, I’m just curious if there was anything you could generalize there? And if you can’t—I know, it’s a tough question, so—

Cynthia: Yeah, no, well, that’s the goal. That’s the goal of what a lot of this work in the future is for. So, for instance, exploring the whole Rashomon set, right? The Rashomon set, again, is the set of good models, right? And the question is, can you effectively explore that set?

And we’ve been trying to design tools that allow you to explore the set, right? We’ve some visualization tools that allow you to project models down into the variable importance domain, and you say, “Well, if this variable is important, then that variable is not,” you know, things like that, so that you can see what’s going on. But the idea is exactly what you just mentioned, is that you’d like the user to be able to smoothly walk within the Rashomon set to discover different models that are good.

Brian: This has been a really great conversation. You’ve definitely opened my eyes to some new things here. I wanted to ask you, are there any questions I didn’t ask you about this that I should have, for any closing thoughts?

Cynthia: Sure. I can tell you some of the problems that my lab is working on. We haven’t touched on what my lab is, sort of, doing now. So, some of the problems that we’re working on, first of all, the problem we just mentioned, sort of how do you explore the Rashomon set? How do you explore that set of good models? What is the size of the Rashomon set? How big is it? How many models does it contain? Can you prove that there exists an interpretable model somewhere in the Rashomon set before you find it, right? Can you prove that there is a simple model somewhere that is good, before you go to the trouble of solving a hard optimization problem to find it?

Okay, so that’s one set of projects that we’re working on. Other projects are more on, like, how do you find these interpretable models? So, we have a long-term project on optimal sparse decision trees. And we have the fastest decision tree code in the world that creates provably optimal or approximately optimal decision trees—sparse decision trees.

We also are working on sparse linear models and scoring systems, which are tiny little models that are point scores. Like, you have two points for this and one point for that. These are the kinds of models that have been in development for a hundred years. Even before people had data, they were designing scoring systems for various medical and criminal justice applications. And we’re trying to create them from data.

 

Other things that we’re working on our neural networks that do case-based reasoning where the neural network looks at part of the image and says, “Well, I think that looks like this other image, and so that’s why I think this lesion is possibly malignant because this part of the mass margin looks kind of, like this malignant mass—or the spiculated margin that I’ve seen before.”

I also have some projects on causal inference where the goal is to try to figure out what the effect of a treatment is; like, think about what’s the effect of a drug. And we’re doing this by simulating a randomized control trial. So, you have this big dataset and you have some patients that have the drug and some patients that didn’t have the drug, and you try to find sort of identical twins where somebody had the drug and someone didn’t, and then you can compare those people. You can look at all the matched groups and figure out whether the treatment was likely to have an effect on people of different subtypes.

I’m also working on disentanglement of neural networks. So, right now, signals about a specific concept kind of travel all through a neural network. If you want to know where’s the signal about? If you wanted to check whether an image is an image of a bedroom, well, it’s like, well, how much of the bed is it using for detecting bedroom? And so to do that, normally, the information about bed is sort of scattered throughout the network. And so we’re disentangling it and forcing it to go through just one channel in the network so that we can figure out what concepts the network is using and how much of each concept it’s using for its predictions.

We’re also working on dimensionality reduction for data visualization where you take high dimensional data and project it down to low dimensions and try to figure out whether it has cluster structures or manifold structures that might be interesting to explore.

Brian: Cool. Where can people find out about all this? Where’s the best place to go?

Cynthia: You can go to my website. Got everything on there. [laugh].

Brian: Cool. And tell me what that is just so some people that won’t go to the website, I want them to hear it in case they want to hop over there now.

Cynthia: It’s just users.cs.duke.edu/~cynthia. And that’s it, you can just go there and—

Brian: Okay.

Cynthia: Or you can just, you know, search for my name—

Brian: Probably type your name Cynthia Rudin, Duke, and you’ll probably pop up right?

Cynthia: Oh, yeah. Oh, yeah, definitely. And there’s also a website. If you’re interested in learning more about machine learning in general, you can go to my website, and there’s a link for teaching, and on that I have all of my course notes for my course, as well as YouTube videos on every single topic that I cover in the course, which might be helpful if you want to learn about a specific topic that’s important to you if you don’t already know.

Brian: Cool. Well, Cynthia, thank you so much for coming on and talking about this. I know when we first connected, I got the sense that you’ve kind of felt like you had finally gotten some recognition for a lot of hard work that you’ve done, and it hadn’t been recognized. And for what it’s worth, I do know, at least the people that listen to this show and other designers and human factors people I know, et cetera, do care a lot about this stuff, and you’re working on the real backbone of that work. So, it does matter, and there are people that care. So, I’m glad that you came on to talk about this.

Cynthia: Oh, that’s great. That actually means a lot to me to hear that. [laugh]. Thank you.

Brian: Well, great. It’s been really nice. And we’ll put those links into the [show notes 00:37:19] and all that, and I hope everybody here can go check out Cynthia Rudin’s work. And congratulations again on the Squirrel award.

Cynthia: Thank you.

Brian: Yeah. Take care.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Subscribe for Podcast Updates

Join my DFA Insights mailing list to get weekly insights on creating human-centered data products, special offers on my training courses and seminars, and one-page briefs about each new episode of #ExperiencingData.