UI/UX Design Considerations for LLMs in Enterprise Applications

Rather listen to this?

I also recorded a 2-part podcast episode about this.

Let's look beyond the “rewrite with AI” button and chatbots!

Generative AI in 2024 is all the rage and for the most part, to date, I haven’t seen or heard about many great examples of these making a substantial impact to user experience.

While many product teams continue to race to deploy some sort of GenAI and especially LLMs into their product, the general sense I get is that this is more about FOMO than anything else. So, beyond the “rewrite this with AI” button you keep seeing popping up (hello LinkedIn), is there anything here more than hype? Can we think beyond yet another terrible customer support chat bot?

Today I want to share some ideas that I would be considering in the design of software that is going to leverage LLMs to improve value and outcomes for the humans in the loop.

Foundations

No matter what the technology, a good UX design foundation starts with not doing harm and hopefully going beyond usable to being delightful. Adding LLM capabilities is no different. We still need to have outcome-oriented thinking on both our product and design teams when deploying LLM capabilities. This is a cornerstone of good product work: understanding what an improvement to the business and to a user’s life looks like.

 Now, the human-centered design perspective also means we are (ideally!) thinking beyond just the business we work for or serve , and the users — because there may be other humans in the loop affected by our design choices. It’s easy to dismiss this until it bites you in the ass with your legal team, a newspaper article you didn’t hope to read, or some other surprise — and it’s all the more reason why having designers and researchers involved is important. They’re providing a natural check to all the technical work—work that may now mean even more engineering and data work relative to traditional software efforts. The extra burden on your technical team by definition means that UX is likely to get short-changed because the impacts are harder to see, especially if you don’t have a testing plan to validate what “quality” means in your LLM use cases.

So, the key point here is that you still need to define what a successful outcome to users and the business looks like before you implement an LLM and consider it successful.

Challenges with LLM UI/UXs

The most obvious challenge to me right now with LLMs is that while we’ve given users tremendous flexibility in the form of a Google-search like interface, we’ve also, in many use cases, limited the UX of these to a text conversation with a machine. The CLI is back.

The machine tends to return the same GUI every time: another text prompt that returns yet more text.

Many of these systems do not store context either, nor do they know how to help the user refine their question (e.g. “you may want to try asking X.”)

Even if they do, not all information is best conveyed in written format, and so the UIs right now are immature with LLMs. As such, and as with just about any other technology, product managers need to be thinking about context of use.

What is the outcome/benefit here and how is progress being measured before and after we insert an LLM into the workflow?

The thing is, LLMs may also come with these taxes: the made-up data (silly, serious, and ridiculous), frustrating filtering/narrowing experiences, and a UI that tends to always return the same type of response format (more text). You may be thinking “but we can also create images and videos.” Yes, that’s true—but in a lot of enterprise use cases, particularly when thinking about data and information, what we may need are better visualizations of accurate information.

This then forces us to visit the interaction medium: text/visual, voice, or other format?

In many enterprise situations when designing for knowledge workers sitting at desks, we’re probably still looking mostly at visual UIs for the foreseeable future (vs. voice etc.) Why? If we take a problem such as, “I have all this information about X, and I need to figure out what to do/instruct my team to do/give my boss a recommendation on next steps,” a voice interface is generally not a good way to convey this information.

When you take the CLI-style interface, the hallucinations/guessing, the challenge LLMs have with doing calculations or retrieving structured (i.e. tabular) data that is often part of knowledge work that we want queried with the proper tools (e.g. SQL) and we have what’s known as a “complex problem.”

It’s complex now, and probably will be for the considerable future.

Hence, we must as product leaders constantly revisit the question, “are we making things better for the lives of our customers and users such that the value to the org/business is likely to go up?” What about the other humans in the loop such as affected third parties? What about the workers doing this work now by hand—what is their role and purpose going forward?

LLMs: How Much Does Accuracy Matter in the UX?

Anyone who has played with an LLM such as Chat-GPT have experienced hallucinations and incorrect information. In fact, you can take almost any response from an LLM and reply to it, “sorry that’s not correct,” and the LLM will accept your criticism as correct and produce a new response.

So, this gets to the question of what does it mean to measure “accurate enough” to be useful, safe, valuable or whatever parameters are most important in your context?

I’m going to go into a quasi-hypothetical scenario here, hopefully not confused by using a somewhat “meta” example. The topic is using (or not!) an LLM to summarize customer interview  transcripts from research panels.

Should we use these to take large quantities of text from lots of users and find actionable trends/insights to inform product development?

Hypothetical Use Case: Using LLMs to Summarize Raw Customer Research

For reasons besides privacy concerns, I saw a post recently about just how wrong it is to use LLMs to process information such as raw customer interviews (i.e. transcripts) and summarize findings.

A lot of the ramifications were concerns that the wrong types of shortcuts would be taken and lead to misunderstanding of the primary research data that had been captured—effectively, nullifying much of the value of doing primary research.

I also implied from this post (and prior ones from other UX researchers I know) that there is a sense that an LLM/AI agent could not do as good of a job as a human researcher, and could potentially introduce false information and misleading insights from a panel. In short, the sentiment I’ve been exposed to is primarily that “A human researcher has to be involved” and even if you control for things like data privacy, an LLM simply cannot provide useful, credible information.

My take as of right now is that this position seems extreme, but it doesn’t really matter for the purposes of this article. I’m not here to take a position as to whether you should or should not use LLMs for this purpose. However, in our hypothetical, let’s assume an internal team contemplating building their own custom GPT for this purpose. What questions might they need to ask to determine whether to proceed?

I might suggest asking:

  • How is “accurate” human analysis defined today such that we know we can trust that the researcher is providing “good” info?
  • How much does speed (time to insight/reaction) matter?
  • Is it better or worse if the recipients of research insights (i.e. product managers, engineers, business stakeholders) are now more interested in conducting primary user research because time-to-insight (or product experimentation time) is reduced because of easier access to insights?
  • How easy is it to undo a product decision that was mis-informed (e.g. a product strategy / feature / change was implemented based on a bad or incorrect interpretation of research insights from an LLM?)
  • If we enabled 10 product teams to move more quickly by having faster access to user research - but largely through an LLM - and in the next quarter, 5 of them could show positive value (UX+biz) as a result, is that a 50% win or a 50% loss?
  • If the other 5 shipped “bad” product that had no measurable impact on the org’s key KPIs, in part because at some later date, we found out the customer research insights this team use used were fabricated by the LLM, does that change our 50% win (or loss) score as an org?
  • What if just one win was a huge outsized success, or one of the failures was deemed “horrible for the brand”?
  • What game are we playing? Is our product or products a portfolio of “bets,” to use Annie Duke’s parlance?
  • What is the role of a researcher at this org? What is their highest value and are they spending the majority of their time doing that type of work?

There isn’t an absolute right/wrong answer here; this is a leadership question and the context, industry and domains matter.

The team will have to decide what “safe/good enough to use an LLM” means. Teams of humans. Logically, health and medical use cases will be different than using an LLM to generate tag lines for creating eyeglass advertising copy.

Fabrication and hallucinations aren’t going away any time soon, so…

As I understand it, and I’m neither a data scientist nor an LLM researcher, we’re not close to guaranteeing that LLMs won’t hallucinate and make mistakes. Whether it is a “good job” is a metric that has a range of values; it’s not binary. The model will predict certain things we cannot control because it is probabilistic – so teams leveraging LLMs have a new factor to consider that doesn’t exist with traditional software: misinformed use.

This is one of the unique differences between designing for AI and LLMs vs. traditional software: we now have to consider informed use, no use, and misinformed use. That’s part of the price, and challenge, of designing for AI and especially for LLMs.

So, let’s pretend we’ve broadly accepted that the pros here outweigh the cons, and an internal team is going to build its own custom GPT with an overall goal to:

  • Increase usage of primary research across the product teams
  • Accelerate the shipment of product that is informed by research such that outsized wins have a greater chance of emerging
  • Reduce how much “bad product” is shipped as a result of relying upon guessing (no research) or because “we were told to build it”

Let’s look at a practical example of this set in the insurance industry. But not right now. You’ll have to stay tuned for part 2 😉

Want Part 2?

To get notified when it comes out, just subscribe to my Insights mailing list below:

 

Photo by Toru Wa on Unsplash

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

More Free Insights: