Chatting with Home Office immigration data

--

Inspired by our recent ChatGPT hackathon, I wondered whether we could use AI to talk to some of the publicly available datasets we often use, such as the Home Office’s immigration statistics.

The answer is: yes, sort of:

This prototype offers a simple web interface where you can ask questions about immigration statistics published by the Home Office. In the animation above, I asked two relatively simple questions and it gave confident answers — but did it get them right?

How many people claimed asylum in 2022? The prototype answers: “89,398 people claimed asylum in 2022,” which matches the official count.

How many people from Afghanistan were granted family reunion visas in 2021 and 2022? The prototype says: 381 people. We’ll have to delve into the Home Office’s data tables to check this… After a minute or two of wrangling with Excel, we get the same answer:

Source: Family reunion visa grants / Home Office.

That’s two out of two correct. Very impressive so far (especially since this prototype is only about 30 lines of code), but how does it fare with nuance and complexity?

It struggles with more complicated questions

I asked it: “How many people claimed asylum last year?” Instead of specifying a particular year, this question requires the AI to figure out what year it is and work backwards. It turns out my fairly simple implementation struggles with this because it thinks we’re in 2021 (possibly because the underlying technology only knows about the world up to 2021), so gives the answer for the year 2020.

A more complex request like “Calculate the annual grant rates for people claiming asylum from Afghanistan over the last five years” causes the prototype to throw up a baroque error message. But for quite an interesting reason (read on to find out why…).

The rest of this post is going to delve into the technical details of how I built this prototype, how it works, and how it fails. If you’re not interested in GPT, LLMs, SQL, or other fun TLAs then look away now.

How to talk to databases

This prototype is a fairly simple Python program that uses LangChain to connect OpenAI’s GPT-3.5 to a custom SQLite database containing the immigration statistics, all wrapped up in a Flask web framework.

When a user submits their question, LangChain cleverly writes a prompt offering details of the database structure, example data, and the user’s question — which it sends to a large language model of your choice and then processess the answer.

Here’s how to roll your own version:

Create a SQLite database of immigration statistics

Many statistics published by the UK Government come in the form of spreadsheets. As far as I know, there’s no simple way to get large language models to reason over Excel files so our first step is wrangling them into a tidy, machine-readable format.

Conveniently, I’d already built an R package for easy access to the UK’s immigration statistics (which is part of my team’s ‘humaniverse’ of R packages for humanitarian data in the UK), so it was just a case of using R to copy the data into a SQLite database:

install.packages(c("asylum", "dplyr", "RSQLite"))

library(asylum)
library(dplyr)
library(RSQLite)

# Create the SQLite database
con <- dbConnect(RSQLite::SQLite(), "asylum.db")

# Some of the column names contain spaces; replace with underscores
applications <- asylum::applications |> rename_with(~ gsub(" ", "_", .x, fixed = TRUE))
decisions_resettlement <- asylum::decisions_resettlement |> rename_with(~ gsub(" ", "_", .x, fixed = TRUE))
family_reunion <- asylum::family_reunion |> rename_with(~ gsub(" ", "_", .x, fixed = TRUE))
awaiting_decision <- asylum::awaiting_decision |> rename_with(~ gsub(" ", "_", .x, fixed = TRUE))
irregular_migration <- asylum::irregular_migration |> rename_with(~ gsub(" ", "_", .x, fixed = TRUE))

# Write the tables
dbWriteTable(con, "applications", applications)
dbWriteTable(con, "decisions_resettlement", decisions_resettlement)
dbWriteTable(con, "family_reunion", family_reunion)
dbWriteTable(con, "awaiting_decision", awaiting_decision)
dbWriteTable(con, "irregular_migration", irregular_migration)

# Save and disconnect
dbDisconnect(con)

That seems to have worked:

Using LangChain

Sign up for an OpenAI account and create an API key, then run pip install openai langchain before running this code to initialise LangChain:

from langchain import OpenAI, SQLDatabase, SQLDatabaseChain

db = SQLDatabase.from_uri("sqlite:///asylum.db")
llm = OpenAI(temperature=0, openai_api_key="<Your OpenAI API key goes here>")
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)

Asking questions of our database takes just one line of code:

db_chain.run("How many people claimed asylum last year?")

Which gives this output:

In this example, you can see that GPT-3.5 interprets “last year” as 2019 for the SQL query it generates. (And sometimes it thinks “last year” was 2020.)

Wrapping the code in a web framework

The final step is to put this code in the Flask web framework. I won’t go into the details here; see my GitHub repo for the code. Make sure you pip install flask first.

Why do complex questions fail?

There are two questions I want to explore here because they cause the prototype to fail in interesting but different ways.

“How many people arrived from Libya by small boats in 2021?”

This seemingly simple question gives a confidently incorrect answer:

But why? Here’s the SQL query LangChain and GPT-3.5 generates behind the scenes:

SELECT SUM(Number_of_detections) 
FROM irregular_migration
WHERE Year = 2021
AND Method_of_entry = 'Small boats'
AND Nationality = 'Libya';

It looks legit, except “Small boats” is not a valid entry in the Method_of_entry column. It should actually say “Small boat arrivals,” but LangChain doesn’t tell GPT this. Fortunately, there’s a way to pass this information on.

When LangChain talks to a large language model about a SQL database, it gives some context for each table: a CREATE TABLE statement giving the structure, alongside some sample data based on the top handful of rows in the table. But if this sample data doesn’t include all possible values for columns such as Method_of_entry then GPT has no way of knowing to include them in the SQL query it generates.

Luckily there’s a fairly easy fix. We just need to give LangChain some custom info about this table:

irreg_table_info = {
"irregular_migration": """
CREATE TABLE irregular_migration (
"Date" REAL,
"Year" REAL,
"Quarter" INTEGER,
"Method_of_entry" TEXT,
"Nationality" TEXT,
"Region" TEXT,
"Sex" TEXT,
"Age_Group" TEXT,
"Number_of_detections" REAL
)
/*
3 rows from irregular_migration table:
Date Year Quarter Method_of_entry Nationality Region Sex Age_Group Number_of_detections
17532.0 2018.0 1 Inadequately documented air arrivals Afghanistan Asia Central Female 18-24 4.0
17532.0 2018.0 1 Recorded detections in the UK Afghanistan Asia Central Female 25-39 14.0
17532.0 2018.0 1 Small boat arrivals Afghanistan Asia Central Female 40+ 22.0
*/
"""
}

db = SQLDatabase.from_uri("sqlite:///asylum.db", custom_table_info = irreg_table_info)

In the code above, I’ve given an example row of data that contains “Small boat arrivals” in the Method_of_entry column. Note that this example doesn’t exist in the actual data — you can give fake data to illustrate what you need.

Now, asking the question gives:

Which is the correct answer:

Source: “Irregular migration” statistics / Home Office

“Calculate the annual grant rates for people claiming asylum from Afghanistan over the last five years.”

This is a much more complex question which, when asked, throws an error:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: Applications
[SQL: SELECT Year, SUM(Decisions) AS Total_Decisions, SUM(Applications) AS Total_Applications, (SUM(Decisions) / SUM(Applications)) * 100 AS Grant_Rate FROM decisions_resettlement WHERE Nationality = 'Afghanistan' GROUP BY Year ORDER BY Year DESC LIMIT 5;]

If you look at the AI-generated SQL statement, it has made up an Applications column that doesn’t exist in the decisions_resettlement table. So, it has the right idea but is struggling a bit. What if we include a definition of ‘grant rate’ in the query?

Grant rate is the proportion of initial decisions which resulted in a grant of protection or other leave. Calculate the annual grant rates for people from Afghanistan over the last five years.

We now get an answer rather than an error:

Unfortunately, these answers are massively wrong. It calculated last year’s grant rate for people claiming asylum from Afghanistan as 4.8% whereas in reality it was actually 98% (see Figure 4 in the official bulletin).

What’s going wrong? Let’s glance at the AI-generated SQL query:

SELECT Year, (SUM(Case_outcome_group = 'Grant of Protection' OR Case_outcome_group = 'Grant of Other Leave') / SUM(Decisions)) * 100 AS Grant_Rate 
FROM decisions_resettlement
WHERE Nationality = 'Afghanistan'
GROUP BY Year
ORDER BY Year DESC
LIMIT 5;

There are a few things it gets wrong:

  • The numerator in its calculation of Grant_Rate should actually be the sum of decisions where Case_outcome_group was a grant of protection or of other leave, not the number of rows.
  • The denominator should be the sum of decisions where Case_outcome_group was a grant or a refusal (but nothing else).
  • It should filter rows where Case_type is an asylum case.

… but I’m not sure how to fix this (perhaps adding a bit more detail in the prompt?).

Can we trust it?

Despite getting three of my four questions right (after a bit of fine-tuning), this prototype makes far too many mistakes to be trusted.

I only trust it insofar as I can double-check the SQL it generates and runs. But I definitely wouldn’t give this prototype to non-technical users without first doing a LOT of rigorous testing.

But if you do comprehensive enough testing on relatively simple public datasets such as this, might you end up with an almost complete bestiary of the kinds of questions users want to ask of these datasets? So you could avoid the stochasticity inherent in AI-generated SQL queries and instead use something like a fuzzy lookup table (e.g. if user mentions grant rates then execute pre-defined grant rate calculation code).

If you know that users will want to ask for common but non-obvious calculations like asylum grant rates, maybe you (the developer) could pass predefined functions to the large language model — though I don’t know whether LangChain supports that (yet).

Final thoughts

It goes without saying that productionising this prototype would require a far, far better user interface, graceful error handling, a guide to asking questions, information on how to check answers, ways to avoid SQL code injection (assuming that’s possible — I haven’t checked), and much more besides. But I’m pleasantly surprised by what was possible with a few hours of weekend hacking.

Hopefully it’s easy to use this code as a basis for other datasets (such as all the public health and health system performance data in our healthyr R package). And I’m looking forward to trying this with GPT-4 one day soon…

--

--

Matthew Gwynfryn Thomas
Insight and Improvement at British Red Cross

Anthropologist, analyst, writer. Humans confuse me; I study them with science and stories.