The past year and a half has demonstrated the impressive capabilities of generative AI (GenAI) systems, such as ChatGPT, Bard, and Gemini. Business application vendors have since begun a sprint to include the most recently enabled capabilities (summarizing, drafting text, natural language conversation, etc.) into their products. And organizations across industries have started to deploy generative AI to help serve customers — hoping that GenAI-powered chatbots could provide a better customer experience than the failed and largely useless service chatbots of the past.
The results have started to come out, and they are mixed. The service chatbots of organizations such as Air Canada and DPD have made unsubstantiated offers or even rogue poetry. Another customer chatbot for a Nordic insurance company was not updated with the latest website reorganization and kept sending customers to outdated and decommissioned web pages.
The popular Microsoft Copilot hallucinated about recent events and invented occurrences that never happened. Based upon personal experience, a customer meeting summary written by generative AI included a final evaluation of the meeting as “largely unproductive due to technical difficulties and unclear statements” — an assessment not echoed by the human participants.
These issues highlight several dilemmas related to using generative AI in software applications:
- Autonomous AI functions versus human-supervised AI. Autonomous AI is attractive to customer service departments because of the cost difference between a chatbot and a human customer service agent. This cost saving potential must, however, be balanced against the risk of reputational damage and negative customer experiences as a result of chatbot failures and mishaps.
Instead, designing solutions with “human in the loop” may have multiple benefits. Incorporating employee oversight to guide, validate or enhance performance of AI systems may not only drive outputs accuracy, but also increase adoption of GenAI solutions. For example, a customer service agent could have a range of tools, such as automatically drafted chat and email responses, intelligent knowledge bases, and summarization tools that augment productivity without replacing the human.
- At what point is company-specific training enough? In other words, extensive training investments into company-specific large language models (LLMs) versus relying on out-of-the-box LLMs, such as ChatGPT, for good-enough answers. In some of the generative AI failures described above, it seems that the company-specific training of the AI engine was too superficial and did not cover enough interaction scenarios.
As a result, the AI engine resorted to its foundational LLM, such as GPT or PaLM, and these did, in some cases, act in unexpected and undesired ways. Organizations are obviously eager not to reinvent the wheel with respect to LLM, but the examples above show that overly reliance upon general LLMs is risky.
- Keeping the chat experience simple versus allowing the user to report issues. This includes errors, biased information, irrelevant information, offensive language, and incorrect format. To this end, it is crucial to understand sources and training methods. A good software user experience is helped by a clean user interface. In the context of generative AI, think of the prompt input field in an application. Traditional wisdom suggests keeping this very clean. However, what is the user supposed to do in case of errors or other types of unacceptable AI responses, and how is the user supposed to verify sources and methodologies?
This is linked to the need for “explainable AI”, which refers to the concept of designing and developing AI systems in such a way that their decisions and actions can be easily understood, interpreted, and explained by humans.
The need for explainability has arisen because many advanced machine learning models, especially deep neural networks, are often treated as “black boxes” due to their complexity and the lack of transparency in their decision-making processes.
- Using generative AI for very specific and controlled use cases versus general AI scenarios. One way to potentially curb the risks of AI errors is to frame the use of AI into specific and limited application use cases. One example is a “summarize this” button as part of a specific user experience next to a field with unstructured text. There is a limit to how wrong this can go, as opposed to an all-purpose prompt-based digital assistant.
This is a difficult dilemma simply because of the attractiveness of a general-purpose assistant, which has prompted vendors to announce such general assistants (e.g., Joule from SAP, Einstein Copilot from Salesforce, Oracle Digital Assistant, and the Sage Copilot).
- Charging customers for generative AI value versus wrapping into existing commercial models. GenAI is known to be expensive in terms of compute costs and manpower needed to orchestrate and supervise training. This begs the question of whether such new costs should be rolled over to the customers.
This is a complex dilemma for a number of reasons. Firstly, AI costs are expected to decline over time as this technology matures. Secondly, AI functionality will be embedded into standard software, which is already paid for by customers.
The embedded nature of many AI application use cases will make it very difficult for vendors to change for incremental, separate new AI functions. Mandatory additional AI-related fees related to existing SaaS solutions are likely to be met by strong objections from customers.
- Sharing the risk of Generative AI outputs inaccuracy with customers and partners versus letting customers be fully accountable. Generative AI will be increasingly leveraged in supporting key personas’ decision-making processes in organizations. What if it hallucinated and the outputs were misleading? And what if the consequence is a wrong decision that will have serious negative impact on the client organization? Who is going to take the responsibility for the consequences of those actions? Should customers accept this burden alone, or should the accountability be distributed between vendors, their partners (e.g., LLMs), and end customers?
In any case, vendors should have full transparency of their solutions (including clear procedures regarding training, implementing, monitoring, and measuring the accuracy of generative AI models) to be able to immediately provide required information to the customer in the case of an emergency.
After having taken the enterprise technology space by storm, generative AI is likely to progress slower than initial expectations. As a new technology, GenAI might enter the “phase of disillusionment,” to paraphrase colleagues in the analyst industry.
This slowdown will be driven by a more cautious adoption of AI in enterprise software, as new horror stories instill fear of reputational damage in CEOs across industries. We believe that new generative AI rollouts will have more guardrails, more quality assurance, more iterations, and much better feedback loops compared to earlier experiments.