turgeon.au

Building a Lightning-Fast Address Search with SQLite and FTS5

Intro

Ever needed to lookup your address when you’re ordering something online? It’s pretty common, I thought it would be cool to implement an address lookup database in SQLite3 to get a better understanding of full-text search.

The Dataset we will use

The Australian Government release a geocoded address dataset for the whole of Australia with quarterly updates. As quoted from data.gov.au:

Geoscape G-NAF is the geocoded address database for Australian businesses and governments. It’s the trusted source of geocoded address data for Australia with over 50 million contributed addresses distilled into 15.4 million G-NAF addresses.

The Geoscape Geocoded National Address File (G-NAF), not to be mistaken for FNAF (Five Nights at Freddy’s). Is basically a zip file with a bunch of PSV files supplied by each state in Australia with physical addresses.

This type of dataset and use-case (address auto completion) is a perfect task for SQLite3 and the FTS5 extension, which adds full-text search functionality to the database engine.

Shoutouts

Thankfully I am not the first person to ingest the G-NAF dataset into SQLite, a huge thanks to Peter’s blog post on Exploring G-NAF with SQLite and this GitHub Repo by Gerard which contains ingestion scripts, which saved me a lot of time ingesting the data into SQLite.

The first couple of sections of this post is a quick rehash taken from the people above, included here for completeness so you can easily follow along. You can skip to my full-text search addition here.

Tools you’ll need

You need SQLite3 installed with the FTS5 extension enabled. Some Linux distributions come with it already enabled, you can check by running this command:

$ sqlite3 :memory: "PRAGMA compile_options;" | grep ENABLE_FTS5

If you’re lucky you’ll see ENABLE_FTS5 as output from the above command. If not, you may need to compile and install SQLite3 from source. Notably passing --enable-fts5 to ./configure.

You’ll also need the following tools

Ingesting the G-NAF into SQLite3

Alright let’s get this data into SQLite!

Download the Dataset

Use this command to get the latest G-NAF dataset URL from data.gov.au:

$ curl -s -o- https://data.gov.au/data/api/3/action/package_show?id=19432f89-dc3a-4ef3-b943-5326ef1dbecc | jq -r '.result.resources[].url | select(contains("gda2020"))'

Once we have that URL we can then download the dataset, it’s almost 2GB:

$ curl -o gnaf.zip 'https://data.gov.au/data/dataset/19432f89-dc3a-4ef3-b943-5326ef1dbecc/resource/fa2f2db3-c7ad-4bc7-b93b-1f6415e180fa/download/g-naf_may25_allstates_gda2020_psv_1019.zip'

Finally we’ll extract the dataset, it’s about 8GB uncompressed:

$ unzip gnaf.zip -d gnaf

The structure of the dataset looks like this:

$ tree gnaf
G-NAF
├── Extras
│   ├── GNAF_TableCreation_Scripts
│   │   ├── add_fk_constraints.sql
│   │   ├── create_tables_ansi.sql
│   │   ├── create_tables_sqlserver.sql
│   │   └── README.txt
│   └── GNAF_View_Scripts
│       ├── address_view.sql
│       └── README.txt
└── G-NAF MAY 2025
    ├── Authority Code
    │   ├── Authority_Code_ADDRESS_ALIAS_TYPE_AUT_psv.psv
    │   ├── Authority_Code_ADDRESS_CHANGE_TYPE_AUT_psv.psv
    │   ├── Authority_Code_ADDRESS_TYPE_AUT_psv.psv
    │   ├── Authority_Code_FLAT_TYPE_AUT_psv.psv
    │   ├── Authority_Code_GEOCODED_LEVEL_TYPE_AUT_psv.psv
    │   ├── Authority_Code_GEOCODE_RELIABILITY_AUT_psv.psv
    │   ├── Authority_Code_GEOCODE_TYPE_AUT_psv.psv
    │   ├── Authority_Code_LEVEL_TYPE_AUT_psv.psv
    │   ├── Authority_Code_LOCALITY_ALIAS_TYPE_AUT_psv.psv
    │   ├── Authority_Code_LOCALITY_CLASS_AUT_psv.psv
    │   ├── Authority_Code_MB_MATCH_CODE_AUT_psv.psv
    │   ├── Authority_Code_PS_JOIN_TYPE_AUT_psv.psv
    │   ├── Authority_Code_STREET_CLASS_AUT_psv.psv
    │   ├── Authority_Code_STREET_LOCALITY_ALIAS_TYPE_AUT_psv.psv
    │   ├── Authority_Code_STREET_SUFFIX_AUT_psv.psv
    │   └── Authority_Code_STREET_TYPE_AUT_psv.psv
    └── Standard
        ├── ACT_ADDRESS_ALIAS_psv.psv
        ├── ACT_ADDRESS_DEFAULT_GEOCODE_psv.psv
        ├── ACT_ADDRESS_DETAIL_psv.psv
        ├── ACT_ADDRESS_FEATURE_psv.psv
        ├── ACT_ADDRESS_MESH_BLOCK_2016_psv.psv
        ├── ACT_ADDRESS_MESH_BLOCK_2021_psv.psv
        ├── ACT_ADDRESS_SITE_GEOCODE_psv.psv
        ├── ACT_ADDRESS_SITE_psv.psv
        ├── ACT_LOCALITY_ALIAS_psv.psv
        ├── ACT_LOCALITY_NEIGHBOUR_psv.psv
        ├── ACT_LOCALITY_POINT_psv.psv
        ├── ACT_LOCALITY_psv.psv
        ├── ACT_MB_2016_psv.psv
        ├── ACT_MB_2021_psv.psv
        ├── ACT_PRIMARY_SECONDARY_psv.psv
        ├── ACT_STATE_psv.psv
        ├── ACT_STREET_LOCALITY_ALIAS_psv.psv
        ├── ACT_STREET_LOCALITY_POINT_psv.psv
        ├── ACT_STREET_LOCALITY_psv.psv
        ...

Create the SQLite Schema

Thankfully we don’t need to create this ourselves, you may have spotted above that the dataset actually includes an ANSI SQL schema which is compatible with SQLite3, even better Gerard has gone through this schema by hand and added all the appropriate constraints to all the tables. You can download that here.

Once we have that schema, we’ll then apply it to our new SQLite3 database gnaf.db.

$ sqlite3 gnaf.db < gnaf_schema.sql

Once created you’ll have something that looks like this (credit):

G-NAF ORM

Once last thing, the dataset also includes a stored query which makes it a little easier to lookup addresses in the database. The stored query supplied by the dataset doesn’t work right out of the box with SQLite as the OR REPLACE syntax is not supported. You can remove it or just download this.

Add it to the database:

$ sqlite3 gnaf.db < address_view.sql

Importing the G-NAF dataset into SQLite3

First we’ll import the Authority Code files into the database, these files contain constant definitions like the street types STREET, COURT, ROAD and many others.

$ find "gnaf/G-NAF MAY 2025/Authority Code" -name "*.psv" -print0 | \
while IFS= read -r -d '' line; do
  fileName=`basename -- "$line"`
  tableName=`echo "$fileName" | sed -n 's/Authority_Code_\(.*\)_psv.psv/\1/p'`

  (
    echo ".mode csv"
    echo ".separator \"|\""
    echo ".import \"$line\" $tableName --skip 1"
  ) | sqlite3 gnaf.db 
done

This will go through all of the PSV files in the Authority Code directory and generate SQLite3 compatible CSV import statements, feeding them directly to the database.

We’ll also do the same for the Standard directory, this holds the physical address data for all the Australian states and will take some time to complete.

$ find "gnaf/G-NAF MAY 2025/Standard" -name "*.psv" -print0 | \
while IFS= read -r -d '' line; do
  fileName=`basename -- "$line"`
  tableName=`echo "$fileName" | sed -n 's/^\(WA\|NT\|VIC\|NSW\|SA\|TAS\|OT\|ACT\|QLD\)_\(.*\)_psv.psv$/\2/p'`

  (
    echo ".mode csv"
    echo ".separator \"|\""
    echo ".import \"$line\" $tableName --skip 1"
  ) | sqlite3 gnaf.db 
done

Finally let’s do some cleanup of the database, we’ll convert empty values in the database tables to NULL values, this helps keep our queries clean. Grab this script and run it with:

sqlite3 gnaf.db < null_cleanup.sql

Once the command completes you’ll now have the G-NAF dataset imported into your SQLite3 database. At the moment it weighs in at around 8GB in size.

But there’s a problem… It’s slow, really slow, try this (credit):

$ sqlite3 -cmd ".timer on" gnaf.db <<SQL

  SELECT COUNT(*)
  FROM ADDRESS_VIEW
  WHERE STREET_LOCALITY_PID = 'VIC1930629'
SQL

Eventually the query completes and we see how long it took, in my case it took a minute to run…

Run Time: real 54.790 user 24.049264 sys 28.518008

That’s not going to cut it. Let’s create some index’s to speed up data retrieval, we’ll create indices on anything ending in _pid or _code. Run this command to generate the required SQL.

$ sqlite3 gnaf.db <<'EOF' > gnaf_indices.sql
    SELECT printf(
      'CREATE INDEX IF NOT EXISTS %s ON %s (%s);',
      LOWER(t.name) || '_' || LOWER(c.name),
      t.name,
      c.name
    )
    FROM sqlite_master t
    LEFT OUTER JOIN pragma_table_info(t.name) c
    WHERE t.type = 'table'
      AND (
        c.name LIKE '%_pid' ESCAPE '\' OR
        c.name LIKE '%_code' ESCAPE '\' OR
        c.name = 'code'
      );
EOF

We’ll then add these indices to the database. This will take some time to complete and double the databases size. As Peter mentioned in his blog post, not all of the indices are required but using the script above is simple.

$ sqlite3 gnaf.db < gnaf_indices.sql

If we run the same select query from before we see it now only takes 62 milliseconds to complete, much better than 54 seconds:

Run Time: real 0.062 user 0.014118 sys 0.021947

Full-Text Search with SQLite3’s FTS5 Extension

Now that we have the G-NAF dataset ingested into our SQLite3 database, let’s get full-text search working using the SQLite3 FTS5 extension.

From the FTS5 extension documentation:

FTS5 is an SQLite virtual table module that provides full-text search functionality to database applications. In their most elementary form, full-text search engines allow the user to efficiently search a large collection of documents for the subset that contain one or more instances of a search term.

Create the Full-Text Search Table

Lets define the FTS5 virtual table for our full-text search queries:

$ sqlite3 gnaf.db <<SQL
    CREATE VIRTUAL TABLE ADDRESS_FTS USING fts5(                                    
      id UNINDEXED,                                                                 
      display,                                                                      
      tokenize='unicode61',                                                         
    );
SQL

In this FTS5 virtual table we define an id column that won’t be indexed for full-text search. Instead, we will be using this column to store a reference to the ADDRESS_DETAIL_PID from the ADDRESS_VIEW stored query. This is so we can easily gather more data about an address when we need to, such as retriving the latitude and longitude of the address to pin on a map for example.

We also define a display column, this is what we will be conducting full-text searches on, as well as what we display. Another common approach is having a separate unindexed formatted column which stores a much prettier formatted address that is shown to users.

Finally we define our tokeniser, in this case unicode61.

FTS5 has numerous tokeniser implementations that are suitable for different search workloads, from the docs:

We don’t need to use the porter tokeniser as there is no need to reduce words to there base form, for example fishingfish since our vocab is small compared to other datasets such as Wikipedia.

trigram is pretty useless to us too as we don’t want to create token’s of three characters in length.

For our use-case which is searching physical addresses, the unicode61 tokeniser is the most appropriate for our needs, it tokenises on whitespace which is perfect for our address format we will be using in our display column, so lets define that, here’s a simple one:

<NUMBER> <STREET_NAME> <STREET_TYPE_CODE> <LOCALITY> <STATE> <POST_CODE>

Using the unicode61 tokeniser the following address:

442 KINGSFORD SMITH DRIVE HAMILTON QLD 4007

will get tokenised as the following list:

  1. 422
  2. KINGSFORD
  3. SMITH
  4. DRIVE
  5. HAMILTON
  6. QLD
  7. 4007

In my research this could be optimised, we actually shouldn’t split on each word, in the above example address the STREET_NAME is two words KINGSFORD SMITH. It would be more search efficient to merge the two words, so queries like kingsford s* quickly match to tokens like KINGSFORD SMITH or KINGSFORD STREET, instead of searching the entire table for tokens with KINGSFORD AND a token starting with a s.

To do this you would be required to create a custom SQLite3 FTS5 tokeniser, I may do a part 2.

Populating the Full-Text Search Table

Now we are ready to populate the table with addresses! To do this we need to create a query that formats all of the addresses according to our address format above. We will use this command to insert all of the formatted addresses into our full-text search table:

$ sqlite3 gnaf.db <<SQL
    INSERT INTO ADDRESS_FTS(id, display)
    SELECT
      ADDRESS_DETAIL_PID,
      TRIM(
        CASE
            WHEN FLAT_NUMBER IS NOT NULL AND LOT_NUMBER IS NOT NULL AND FLAT_NUMBER = LOT_NUMBER THEN
                COALESCE(FLAT_TYPE || ' ', '') || COALESCE(FLAT_NUMBER_PREFIX, '') || FLAT_NUMBER || COALESCE(FLAT_NUMBER_SUFFIX, '')
            WHEN FLAT_NUMBER IS NOT NULL AND LOT_NUMBER IS NOT NULL THEN
                COALESCE(FLAT_TYPE || ' ', '') || COALESCE(FLAT_NUMBER_PREFIX, '') || FLAT_NUMBER || COALESCE(FLAT_NUMBER_SUFFIX || ' ', '') || ' LOT ' || LOT_NUMBER || COALESCE(LOT_NUMBER_SUFFIX, '')
            WHEN FLAT_NUMBER IS NOT NULL THEN
                COALESCE(FLAT_TYPE || ' ', '') || COALESCE(FLAT_NUMBER_PREFIX, '') || FLAT_NUMBER || COALESCE(FLAT_NUMBER_SUFFIX, '')
            WHEN LOT_NUMBER IS NOT NULL AND NUMBER_FIRST IS NULL THEN
                'LOT ' || LOT_NUMBER || COALESCE(LOT_NUMBER_SUFFIX, '')
            ELSE
                ''
        END || ' ' ||
        COALESCE(NUMBER_FIRST_PREFIX, '') ||
        COALESCE(NUMBER_FIRST, '') ||
        COALESCE(NUMBER_FIRST_SUFFIX, '') ||
        CASE
            WHEN NUMBER_LAST IS NOT NULL THEN ' ' ||
                COALESCE(NUMBER_LAST_PREFIX, '') ||
                COALESCE(NUMBER_LAST, '') ||
                COALESCE(NUMBER_LAST_SUFFIX, '')
            ELSE ' '
        END ||

        COALESCE(STREET_NAME || ' ', '') ||
        COALESCE(STREET_TYPE_CODE || ' ', '') ||

        COALESCE(LOCALITY_NAME || ' ', '') ||
        COALESCE(STATE_ABBREVIATION || ' ', '') ||
        COALESCE(POSTCODE, '')
      ) AS display
    FROM ADDRESS_VIEW;
SQL

This query looks pretty complex, but all it is doing is formatting addresses and handling some formatting edge cases for addresses with FLAT_TYPES, FLAT_NUMBERS and LOT_NUMBERS.

Let’s try it out! Run the following query to get all addresses that match 442 KINGSFORD.

$ sqlite3 -cmd ".timer on" gnaf.db <<SQL
    SELECT id, display
    FROM ADDRESS_FTS 
    WHERE display MATCH '442 KINGSFORD*'
    LIMIT 5;
SQL

Nice we get the expected results in about 11 milliseconds on my hardware!

Run Time: real 0.011 user 0.007617 sys 0.002808

What happens if we keep typing the address, such that we query for 442 KINGSFORD S*?

Run Time: real 1.049 user 0.960677 sys 0.086235

Woah it takes about a second to complete, what’s going on here?

FTS5 can quickly lookup the complete tokens 442 and KINGSFORD easily as they are automatically indexed by FTS5, this makes querying complete tokens fast. However when we search for the prefix token S*, FTS5 has to do a range scan on all tokens that match S and greater, such as SMITH, STREET and similar complete tokens.

So what can we do to speed this up? Luckily FTS5 has the ability to define prefix indices, which speed up queries for prefix tokens. For example, optimising a query for prefix token S* requires a prefix index of just one character index. However I also want token prefixes like ST* and STR* to be fast, so we’ll create a prefix index of three-character prefixes.

Let’s add the prefix configuration option to our FTS5 virtual table, we’ll need to re-create the table, so lets drop it and create it again with the new prefix configuration option included:

sqlite3 gnaf.db <<SQL
    DROP TABLE IF EXISTS ADDRESS_FTS;
    CREATE VIRTUAL TABLE ADDRESS_FTS USING fts5(                                    
      id UNINDEXED,                                                                 
      display,                                                                      
      tokenize='unicode61',                                                         
      prefix='1 2 3'
    );
SQL

We’ll then run the same population query from earlier. Then run the select query again:

sqlite3 -cmd ".timer on" gnaf.db <<SQL
    SELECT id, display
    FROM ADDRESS_FTS 
    WHERE display MATCH '442 KINGSFORD S*'
    LIMIT 5;
SQL
Run Time: real 0.028 user 0.014243 sys 0.002874

Nice! Our database is working pretty well and we could stop here, but we won’t.

Ranking Relavent Search Results

On testing this database I found that the ranking for some search results weren’t the best, luckily again, FTS5 has us covered with a hidden built in rank column. We can simply add ORDER BY rank to our search query to have ranked search results sorted by relevance.

sqlite3 -cmd ".timer on" gnaf.db <<SQL
    SELECT id, display
    FROM ADDRESS_FTS 
    WHERE display MATCH '442 KINGSFORD S*'
    ORDER BY rank
    LIMIT 5;
SQL

We do take a little speed penalty for this however, but not by much. For me, it’s a trade-off I’m willing to make for better search results.

FTS5 also has the bm25 auxillary function which is a much more sophisticated ranking method producing more accurate ranked results, with a bigger speed penalty of course.

Wrapping up

Hopefully you can see how easy it is to get full-text search with simple tools. Full-text search engines like Elastic Search and friends have their place don’t get me wrong, but do you really need them for simple use-cases like this? I’ve only scratched the surface.

Reducing Database Size

At the moment our gnaf.db is sitting at around 20GB which is pretty big. If you just want to transform physical addresses to geographical coordinates or similar, you can simply add latitude and longitude columns to the FTS5 virtual table, and insert the latitude and longitude from the ADDRESS_VIEW when populating the virtual table. We can then copy the virtual table data to a new database (with the same virtual table schema) leaving us with a database that can solely transform physical addresses to geographical coordinates. Doing this leaves us with a database around 3GB in size.

Automatically Generate the Full-Text Database

It goes without saying that all of this can be put into a shell script and automated. I may create a GitHub repo for automating this process for the GNAF dataset.