147 – UI/UX Design Considerations for LLMs in Enterprise Applications (Part 1)

Experiencing Data with Brian O'Neill (Designing for Analytics)
Experiencing Data with Brian T. O'Neill
147 - UI/UX Design Considerations for LLMs in Enterprise Applications (Part 1)
Loading
/

Let’s talk about design for AI (which more and more, I’m agreeing means GenAI to those outside the data space). The hype around GenAI and LLMs—particularly as it relates to dropping these in as features into a software application or product—seems to me, at this time, to largely be driven by FOMO rather than real value. In this “part 1” episode, I look at the importance of solid user experience design and outcome-oriented thinking when deploying LLMs into enterprise products. Challenges with immature AI UIs, the role of context, the constant game of understanding what accuracy means (and how much this matters), and the potential impact on human workers are also examined. Through a hypothetical scenario, I illustrate the complexities of using LLMs in practical applications, stressing the need for careful consideration of benchmarks and the acceptance of GenAI's risks.

I also want to note that LLMs are a very immature space in terms of UI/UX design—even if the foundation models continue to mature at a rapid pace. As such, this episode is more about the questions and mindset I would be considering when integrating LLMs into enterprise software more than a suggestion of “best practices.”

Highlights/ Skip to:

  • (1:15) Currently, many LLM feature  initiatives seem to mostly driven by FOMO
  • (2:45) UX Considerations for LLM-enhanced enterprise applications
  • (5:14) Challenges with LLM UIs / user interfaces
  • (7:24) Measuring improvement in UX outcomes with LLMs
  • (10:36) Accuracy in LLMs and its relevance in enterprise software
  • (11:28) Illustrating key consideration for implementing an LLM-based feature
  • (19:00) Leadership and context in AI deployment
  • (19:27) Determining UX benchmarks for using LLMs
  • (20:14) The dynamic nature of LLM hallucinations and how we design for the unknown
  • (21:16) Closing thoughts on Part 1 of designing for AI and LLMs

Quotes from Today’s Episode

  • “While many product teams continue to race to deploy some sort of GenAI and especially LLMs into their products—particularly this is in the tech sector for commercial software companies—the general sense I’m getting is that this is still more about FOMO than anything else.” - Brian T. O’Neill (2:07)
  • “No matter what the technology is, a good user experience design foundation starts with not doing any harm, and hopefully going beyond usable to be delightful. And adding LLM capabilities into a solution is really no different. So, we still need to have outcome-oriented thinking on both our product and design teams when deploying LLM capabilities into a solution. This is a cornerstone of good product work.” - Brian T. O’Neill (3:03)
  • “So, challenges with LLM UIs and UXs, right, user interfaces and experiences, the most obvious challenge to me right now with large language model interfaces is that while we’ve given users tremendous flexibility in the form of a Google search-like interface, we’ve also in many cases, limited the UX of these interactions to a text conversation with a machine. We’re back to the CLI in some ways.” - Brian T. O’Neill (5:14)
  • “Before and after we insert an LLM into a user’s workflow, we need to know what an improvement in their life or work actually means.”- Brian T. O’Neill (7:24)
  • "If it would take the machine a few seconds to process a result versus what might take a day for a worker, what’s the role and purpose of that worker going forward? I think these are all considerations that need to be made, particularly if you’re concerned about adoption, which a lot of data product leaders are." - Brian T. O’Neill (10:17)
  • “So, there’s no right or wrong answer here. These are all range questions, and they’re leadership questions, and context really matters. They are important to ask, particularly when we have this risk of reacting to incorrect information that looks plausible and believable because of how these LLMs tend to respond to us with a positive sheen much of the time.” - Brian T. O’Neill (19:00)

Links

Transcript

Brian: Welcome back to Experiencing Data. This is Brian T. O’Neill. Today, I’m talking about designing for AI, and specifically looking at LLM use cases in enterprise applications. And before we jump into this, just a heads-up that at the end of this episode, I’ll be explaining how you can get free access, virtually or in person, to the upcoming 2024 CDOIQ conference happening in Cambridge, Mass, between July 16 and 18th. I’m a speaker this year at the conference, and I’ve been given three virtual passes for online access, as well as one on-site pass to give away, so if you’re interested, act quickly. Stay tuned to the end of the episode to figure out how to get access to those. So, let’s jump in.

Today’s a solo episode, and obviously, there’s been a ton of movement in the last year or so around ChatGPT, and these large language models that are out there—Claude and Perplexity—and I felt like it was time to talk about this on this podcast. So, as of now, you know, it’s mid-2024 right now, and these are still quite prevalent. They’re being discussed a ton. Many large enterprises have initiatives around generative AI, and particularly in the tech space, but I haven’t seen a ton of examples of these making a substantial impact to customer experience, or internal user experience, or producing a ton of value. There are some examples; we’ve kicked around some of those in the DPLC community, but in general, I’m not seeing a ton of examples of these so far.

While many product teams continue to race to deploy some sort of GenAI and especially LLMs into their products—particularly this is in the tech sector for commercial software companies—the general sense I’m getting is that this is still more about FOMO than anything else. So, beyond the ‘Rewrite This With AI’ button that you keep seeing pop up in a friendly text area control near you, on a web page near you—hello, LinkedIn—is there anything more than hype here? Like, can we think beyond yet another terrible customer support Chatbot?

So, today I wanted to just share some ideas that I thought would be helpful and that you should be considering in the design of software that is going to leverage LLMs at some point to improve value and outcomes for the humans in the loop that are going to be using them. So first, let’s talk a little bit about foundations, right? No matter what the technology is, a good user experience design foundation starts with not doing any harm, and hopefully going beyond usable to be delightful. And adding LLM capabilities into a solution is really no different. So, we still need to have outcome-oriented thinking on both our product and design teams when deploying LLM capabilities into a solution.

So this is, to me, it’s just a cornerstone of good product work. It’s understanding what does an improvement to the business or organization look like? And what does an improvement to the end-user of the solutions, what does an improvement to their life look like? And the latter one is often what has a lot to do with whether the former one actually sees any result. And you all know this from listening to this podcast and my rants for six years or however long it’s been.

So, now the human-centered design perspective also means we are ideally thinking beyond the business we work for serve and the users because there may be other humans in the loop affected by our design choices. And it’s easy to dismiss this, until it bites you in the ass with your legal team, a newspaper article that you didn’t hope to read or you hoped your aunt or grandmother wouldn’t see, or some other surprises. And it’s all the more reason why having designers and researchers involved in the creation of these solutions, I think, is important. They’re providing a natural check to all of the technical work, work that may now mean even more engineering and data, work that relative to what traditional software might involve. So, the extra burden on your technical team, by definition, probably means that user experience is likely to get shortchanged because the impacts of bad user experience are harder for most people to see and recognize, especially if you don’t have a testing plan to even validate what quality means in terms of the use cases that the LLM is supposed to support.

So, the key point here is that you still need to define what a successful outcome to users and to the business or organization looks like before you implement an LLM, and consider that deployment successful. So, challenges with LLM UIs and UXs, right, user interfaces and experiences, the most obvious challenge to me right now with large language model interfaces is that while we’ve given users tremendous flexibility in the form of a Google search-like interface, we’ve also in many cases, limited the UX of these interactions to a text conversation with a machine. And effectively the CLI, the command line interface, is back. So, some of you Unix dorks and DOS people might like that and think that’s a good thing. I don’t think that’s the best possible experience and interface that we can aspire to, but… kind people might disagree there.

But effectively, we’ve got a text command line thing. You got to know what to type into the prompt to get back the juicy stuff. And right now, most of these machines tend to return the same GUI every time: another text prompt that returns more text. And again, this whole episode, I’m only focusing on LLMs right now; I’m not looking at generative video, and audio, and some of these other mediums that are out there, mostly because, you know, my work focuses on enterprise use cases, and many of our enterprise use cases have to do with dealing with text and analytical data and things like this, visual mediums, so just a little aside there.

But many of these systems also do not store a lot of context either, and they don’t know how to help the user refine their prompts and questions further, such that the tool is guiding them to ask better questions that will lead to better results. And even if they do, not all the information is best conveyed in written text format. So, these UIs are basically immature to me at this point. I still think they’re amazing, what’s possible, given the limitations, but they’re fairly immature in terms of what we can do in terms of user interfaces. So, as such, and as with just about any other technology, I think product managers, the people that are leading these efforts, need to be thinking about context of use.

As always, what is the outcome and benefit here? And how is progress being measured? Before and after we insert an LLM into a user’s workflow, we need to know what an improvement actually means. And the thing is, LLMs may also come with taxes, right? And many of you know these things, right? The made up data that they spit out, sometimes silly, sometimes serious, sometimes ridiculous, frustrating filtering or narrowing experiences where you may not know, am I not getting the results because there’s no data, or because I don’t know how to prompt it correctly, a UI that always returns the same format, which is text, typically.

And you might be thinking again, “Oh, we can create images and videos.” And that’s true, but in a lot of these enterprise use cases, particularly when thinking about data and information, what we may need are better visualizations of accurate information. So, whether you consider that a picture or an image, I really don’t. I think visualization is something else. I don’t see a lot of that happening right now.

So, this forces us to visit the question of the interaction medium, is it text visual, is it voice or some other format? This probably goes without saying, but in a lot of enterprise situations, when we’re designing for knowledge workers who are probably sitting at desks, we’re still probably be looking mostly at visual UIs for the foreseeable future, or text as it may be, versus things like voice. And why is that? Well, you know, if we take a problem, such as I have all this information about X, and I need to figure out what it means, and what to do with it, or how to instruct my team to move forward, or how to get my boss a recommendation, a voice interface is generally not a good way to convey this information, just because of the way the eye visually processes information, and you may have a lot of different stuff going on, on the screen at the same time. So naturally, there’s just an inclination here that visual displays of this technology are probably the thing we’re going to be focusing on more.

And when you take the CLI style interface, the hallucinations in guessing, the challenge that LLMs have with doing calculations or retrieving structured data, such as tabular data that is often part of knowledge worker work, we want that stuff queried with the proper tools like SQL. And when you add all these complications up, we have what’s known as a complex problem. And it’s complex now, and it probably will be for the considerable future. So, if you identify as a product leader, especially a data product leader, and you’re working on these, I think we need to constantly revisit the question: are we making things better for the lives of our customers and users such that the value to the org or your business is likely to go up by including this kind of solution?

And what about the other humans in the loop such as affected third parties? Or what about the people that are currently doing some of this work that the LLM could effectively replace? You know, think about something like an analyst or something like that. If it would take the machine a few seconds to process a result versus what might take a day for a worker, what’s the role and purpose of that worker going forward? I think these are all considerations that need to be made, particularly if you’re concerned about adoption, which a lot of data product leaders are.

So, let’s move on. How much does accuracy matter in terms of the user experience, so the accuracy of a large language model and its predictions or its generated responses? Anyone who’s played with an LLM such as ChatGPT have, I’m sure you’ve all experienced hallucinations and incorrect information. In fact, I’ve noticed you can take almost any response from an LLM and reply to it and just say, “Sorry, that’s not right,” even if it is correct in your belief, and the LLM will accept your criticism as being correct, and produce another response. So, it’s almost like it has a—it’s got a little opinion in there, or it always defers to the human behind the keyboard.

So, this gets to the question of, what does it mean to measure ‘accurate enough’ for the LLM to be useful and safe and valuable, or whatever the parameters are that are most important in your context? So, I’m going to go into a quasi-hypothetical scenario here. Hopefully, I’m not going to confuse things by using a somewhat meta example, but that topic is using—or not using—an LLM to summarize customer interview transcripts from research panels. So, should we be using this technology to take large quantities of text from lots of users that were—or customers—that were interviewed, and find actionable trends or insights that could then inform future product development? So, let’s look at this hypothetical use case, right?

And partly why I’m using this one, for reasons besides privacy concerns—and I’m going to put some of that stuff aside with what the models are trained on, and just the issues around putting private data into public cloud services like ChatGPT, et cetera—for reasons beyond privacy, I saw a post recently about just how wrong it is to use LLMs to process information such as raw customer interviews and transcripts to do just this, to summarize findings. And a lot of the ramifications were concerns that the wrong types of shortcuts would be taken by the users of customer research information, such as product teams, and would lead to misunderstanding of the primary research data that had been captured, effectively nullifying much of the value of doing primary research in the first place. I also implied from this post and from other ones that I’ve been kind of seeing on LinkedIn from researchers, that there is a sense that the LLM or an AI agent simply can’t do as good of a job as a human researcher, and could potentially introduce false information, and misleading insights from a research panel. So, in short, the sentiment I’ve been exposed to is primarily that a human researcher has to do this work; they have to be involved. And even if you control for things like data privacy, the LLM simply cannot provide useful, credible information.

My take on this, as of right now—and this is not meant to be an episode specifically about whether to use LLMs for UX research; I’m using this as just a hypothetical example if we were to build a solution in-house for explicitly this purpose, how might we approach that? My take on this right now is this position seems a little extreme, but it doesn’t really matter for the purpose of this podcast and for the article that I’m writing on this. I’m not here to take a position on that. However, in our hypothetical, let’s assume an internal team that was contemplating building their own custom GPT for this purpose, what are the questions that they might need to ask in order to determine whether to proceed? If I was working with a company like this that was considering this, I would probably be asking questions like, how is an accurate human analysis defined today such that we know that we can trust that the researcher is providing good info?

And I’m not saying that the issue is that researchers are not trustworthy people, but rather, how do we know that we got the best possible analysis and insight out of that information, beyond the fact that it’s their job? How much does speed such as the time to an insight and the time to put that insight into a product decision—or making an operational change or whatever it may be—how much does that speed matter? Another one, is it better or worse if the recipients of research insights, such as product managers or engineers or business stakeholders, are now more interested in conducting primary research because the time to insight or their ability to ship product that is now informed by actual customer research is reduced because effectively they’re able to get from the point that the raw research data is collected, they’re able to extract something practical for their purposes. If that’s increased, is that better or worse? How easy is it to undo a product decision or an operational decision or whatever it may be that was misinformed.

So, in other words, let’s say a team got a bad interpretation, or a response was generated by the LLM that was just factually wrong in terms of what the panel said about some question, and they put some product, a feature into play, into production, how easy it is to undo something like that if they were to react to that information? If we enabled ten product teams to move more quickly by having faster access to user research, although largely through an LLM, and let’s say in the next quarter, five of these ten could show positive value in terms of the user experience and the business results of the product that they were building, is that a 50% win or a 50% loss? How do we frame that, right? And if the other five that didn’t show results, if they shipped, quote, “Bad product,” bad features, some kind of improvement that was objectively bad, in terms of it had no measurable impact on the organization’s key objectives, like, let’s say we’re trying to reduce cost of something or the time of something, and it simply had no improvement was seen, and we knew this was because, in hindsight, we’re able to see that this team designed some solution or feature that was based on fabricated information there, how much does the incorrect work that this team did offset the 50% of the teams that saw a gain because they are more readily accessing customer insights when building or shipping services and products?

Does that negative result—or maybe a null result might be a better way to look at assuming we didn’t do damage here, and we simply shipped an increment of work that had no positive impact, that was just null—does that matter that much? How much does it matter? And what if just one of these wins of the ten teams, what if one of them was a huge outside success because they just nailed it, they tapped into some really deep insight there, and they were able to address that concern from customers, and they shipped a feature or product or solution or improvement that really had a huge impact. And likewise, what if one of the failures the team that made a build decision based on something that was totally wrong, ended up shipping something that was horrible for the brand, how much does that matter? A huge success or a huge loss in terms of whether or not we would use a technology like this going forward?

So, this gets into the question, what game are we playing, right? Is our product, or our products, or our product teams, are these portfolio of bets to us, like, Amy Duke’s parlance? Are we playing, like, several hands, and overall, we’re trying to win hands there, or are we looking to do no harm anywhere? Of course, there’s always going to be context here, right, and I understand that matters, depending on the domain that you’re in, but these are just ways to think about this at an organizational level, even beyond the product, right, if you’re going to introduce a technology like this.

And finally, another question. What is the role of the researcher at this org in this use case, right? What is their highest value work, and are they spending the majority of their time doing that type of work, or are they doing other types of work that isn’t necessarily maximizing the experience, and knowledge, and training that they have? So, there’s no right or wrong answer here. These are all range questions, and they’re leadership questions, and context really matters, but I think these things are important to ask, particularly when we have this risk of reacting to incorrect information that looks plausible and believable because of how these LLMs tend to respond to us with a positive sheen all the time, or at least a lot of the time.

So, the team really needs to decide what are the benchmarks, right? What is safe or good enough to use an LLM mean, what does that mean? Teams of people, teams of humans have to come up with this. And if you’re in health or medical use cases, for example, then using an LLM to summarize information or generate some kind of stuff, some kind of text of some sort, that’s going to be really different than perhaps if you’re trying to generate copy for 15 different eyeglass advertising creatives—you’re running ads on Facebook or something like that—the risk level of getting the advertising copy wrong is obviously much lower than what the risk of making an error with medical data, for example. So again, judgment comes into play here.

And from everything I know—and this is outside of my technical expertise, as most of you all will know—my understanding is that fabrication and hallucinations by these LLMs, this is not going away anytime soon. It may also never go away, just because of how humans read information when you look at different languages, and there’s all kinds of factors that go into these systems. And this is just kind of a constant that seems like we’re stuck with for the foreseeable future, so I think we just need to accept that this is a dynamic variable. It’s like, well, it’s the new tool, but it comes with some downsides, too. It’s like having a drill, that the drill works pretty well, but sometimes it cuts your hand, and it’s like, well, okay, do I want to manually drive bolts into the wall, or do I want to go ahead and use the drill as a screwdriver because it’s going to let me finish the project ten times faster than if I have to manually screw every screw into this new garage that I’m building? But sometimes it’s going to cut my hand. Again, a judgment call has to be made there.

So, what does it mean to do a good job with the solution? This is a range of values; it’s not binary. There’s no answer that it’s either good enough or not good enough. There’s some range that has to be decided upon by the team. And so, as I was writing this article, one of the things I came across here is that one of the key differences of designing for AI, and particularly for these LLMs, might be this idea that there can be more misinformed use, that’s its usage of the tool that seems like it’s based on factual information, and it’s not, so that the users had a positive intention, they moved forward with something believing what they had read, but that information was entirely wrong or even partially wrong.

This is a little different from deterministic systems in software applications where we know what all the edge cases are, and we’ve got error handling in place for all them, and the application is not going to route you to some different screen or UI at random, based on how you filled out the form today versus tomorrow. And so, this misinformed use is now a new variable that designers and product people need to consider when designing for AI.

So, let’s pretend that this internal team has kicked around the pros and cons here, and they’ve broadly accepted that the pros do outweigh the cons, and it does make sense to build some kind of internal custom GPT here with overall goals to increase usage of primary research across all the product teams to accelerate the shipment of product that is informed by research such that those outsized wins—maybe they’re, you know, 1 in 20 times that we have a greater chance of seeing some of those outsized wins happen because more of our sprints or whatever your increments of work are, they’re more regularly and readily informed by actual insights instead of guessing. And we want to obviously reduce how much bad product is shipped out the door, so less reliance on guessing, for example, or simply not doing research. Or just shipping stuff because it was in the backlog, someone told us to build it, and it’s not my problem if it’s not the right thing.

So, let’s put this into a practical example, and I’m going to set this in the insurance industry. But I’m not going to do that right now [laugh]. You’ll have to stay tuned for part two. So, I’m going to try to take this whole idea of building this internal LLM solution to help accelerate access to internal customer research data. How might that look in an insurance industry use case. But that’s part two, so stay tuned for that. I really hope these initial ideas about designing for AI, and specifically for LLMs, were useful today. There’s a link to the article that I’m writing that’s informing this episode. And for that are for more insights just like this, just head over to designingforanalytics.com/podcast, and from there, you can hop on my mailing list.

I’ll be sending out part two as well as part one. You can also leave a review for this show, which is always appreciated, and it helps spread it around. And in fact, as promised at the top of the episode, that’s actually all you need to do to get one of the free passes to the CDOIQ conference that’s coming up, I think it’s July 16, to 18th 2024. These are the passes that I get to give away as a speaker. So, to do that, just leave your Apple Podcast review at designingforanalytics.com/podcast, same URL I mentioned earlier, and then when you’re done just shoot me an email at brian@designingforanalytics.com, and I’ll send you a coupon code.

And if you’ve already reviewed the show, you can also just share this episode on your LinkedIn post, and just tag me in it. You might need to connect with me first. And once you’ve done that, I’ll enter you in the drawing. I am basically going to take the first four people to contact me. Again, I’ve got three virtual passes, and I believe I have one on-site pass, if you happen to live in the Boston area. So, until next time, stay tuned for part two, and stay healthy.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Subscribe for Podcast Updates

Join my DFA Insights mailing list to get weekly insights on creating human-centered data products, special offers on my training courses and seminars, and one-page briefs about each new episode of #ExperiencingData.