NL2SQL: When GPT4 doesn’t even answer

June 17, 2024

There has been a lot of emphasis and focus on various LLM models hallucinating, or confidently providing an answer that is not accurate. While this remains true, in our testing, several models, such as GPT4, categorically fail to provide any answer at all. In particular, GPT4, and many other state of the art in natural language to SQL generation models, often do not provide any response for several types of questions, such as those with domain-specific vocabulary, and for complex questions that may result in relatively complex SQL.

In a recent paper from Postech and Google, which assessed the current state of NL2SQL, “Natural Language to SQL: Where are we today?” came to the same assessment, “The existing [Nl2SQL] techniques are still at a basic level and need to be significantly enhanced… Real-world databases may have a lot of rare words which cannot be found in the pre- trained word embedding… Those become even worse on complex queries.”

In our own testing, we found that consistently, GPT4 refused to provide an answer. In the video below, we’ve provided the same data schema and information to both GPT4 and to our own hila Conversational Finance application. The results are clear — the techniques implemented in hila Conversational Finance (on top of the LLM) are necessary, given the current maturity of even the best LLMs. And our techniques not only ensure answers, but those answers are also reliable.

We’ve also broken out “simple” queries, such as “Total amount by fiscal year between 2019 and 2024 for category Sales Revenue,” with complex questions, such as, “What was the total amount for the account category Gains Price Difference on each of the underlying general ledger accounts in fiscal year 2024?”

It’s important to note that this does not include times when the model returned the wrong answer, which does happen, especially for complex questions.

About our results

Our data set is specific to financial systems. This also is the target for hila Conversational Finance, and the key to the domain knowledge that we’ve placed in the system. This domain knowledge aids in overcoming the barrier the paper points out — the “rare words that cannot be found in the pre-trained word embedding.”

This domain knowledge also enables us to handle many questions in our dataset that GPT4 fails on, such as ratios, percentages, CAGR formulas, year-end and quarter contexts, and more.

GPT4, as all advanced LLMs, continue to face challenges at overcoming the last-mile aspects of SQL generation. With hila Conversational Finance, we’ve done this work for our clients, so they can get started with useful generative AI right away.