Australian census data is rich, but building a clean time series from it is not straightforward.

At first glance, the task looked simple: take census data for 2011, 2016, and 2021, line the years up, and compare trends by suburb or LGA.

In practice, that approach breaks quickly. Boundaries shift, codes change, and some geographies split, merge, or get renamed between census editions.

I wanted to build a dataset that could support city-level analysis — starting with Melbourne, my home city, and then extending to other Australian cities — without hand-waving away the geography problems.

When I first tried to match historical suburb data by name, I lost roughly half the population. That forced me to rebuild the process properly using Python, TableBuilder extracts, allocation files, and ABS correspondence files.

The Problem

The goal was to build a consistent time series for demographic variables like language spoken at home, English language proficiency, and later birthplace and religion. And to do it at two useful geographic levels: LGA for council and regional analysis, and suburb for more granular local work.

The challenge was that these geographies are not stable across census editions. LGAs are administrative boundaries that can merge, split, or shift. Suburbs and localities change over time. Even the supporting statistical geographies underneath them, like SA1s, are edition-specific.

A direct comparison of 2011, 2016, and 2021 files would produce misleading results.

Why This Was Harder Than It Looked

There were three issues to solve.

Geography changes over time. A 2021 suburb or LGA does not always have a one-to-one match with its 2011 or 2016 version. Some areas split into multiple successor areas. Others merge into a larger boundary.

Parallel geographies do not nest cleanly. LGAs and suburbs are both built from lower-level geographies, but they do not nest neatly inside each other. That makes simple rollups unreliable unless you go far enough down the geographic hierarchy.

Name matching is not enough. A name-based join looks convenient, but it introduces major gaps. In early testing, suburb matching by name left large parts of the 2011 population unmatched — I lost roughly half the Australian population — because the suburb structure had changed too much by 2021.

The Approach

The key decision was to work at SA1 level within each census year, and then harmonise everything onto 2021 boundaries.

Step 1: Pull the source data at SA1 level

For each census year, I pulled the target variable from ABS TableBuilder at SA1 level. That created a consistent input structure: one row set per year, each containing SA1 codes and demographic counts.

Step 2: Build year-specific SA1 to geography lookups

Within each census year, I used ABS allocation files to map mesh blocks to both SA1 and the target geography (LGA or suburb). From there, I used Python to join the allocation files at mesh block level, derive year-specific SA1-to-LGA and SA1-to-suburb mappings, and turn those raw geography files into reusable lookup tables for aggregation.

This meant I was not trying to compare SA1s across time. I only used SA1s as a clean stepping stone inside each census edition.

Step 3: Aggregate each year separately

Once I had the lookups joined, I aggregated the SA1-level counts up to the geographic level I needed. This produced clean yearly totals for 2011, 2016, and 2021 LGAs and suburbs.

At this stage, each year was internally correct, but still not directly comparable across time.

Step 4: Harmonise onto 2021 boundaries

This was the turning point.

To create a usable time series, I rebased older data onto 2021 geography definitions using ABS correspondence files. These map 2016 SA1s to 2021 SA1s, and 2011 SA1s to 2016 SA1s (which I chained forward to 2021). The correspondences contain population-weighted ratios, so older counts could be redistributed onto 2021 SA1 boundaries before aggregating back up to 2021 suburbs and LGAs.

A simple suburb name match looked acceptable at first, but it dropped a huge amount of 2011 data because the suburb structure had changed too much. Rebuilding through SA1 correspondences fixed the problem and restored near-complete coverage.

Step 5: Clean exclusions and edge cases

A few data issues needed handling before the final outputs made sense.

I excluded categories like No usual address and Unincorporated areas where appropriate. These buckets distort totals and are not useful for suburb or council-level analysis.

I also had a small number of SA1s that straddled multiple target geographies. For those edge cases, I used a practical majority-rule allocation rather than overcomplicating the model for a very small share of records.

Projection to 2026

Once the 2011, 2016, and 2021 series were aligned to 2021 boundaries, I added a 2026 projection layer.

With only three historical points per geography and category combination, the projection method needed to be simple and transparent. I tested compound and log-style growth assumptions, but linear projection was the most sensible choice:

  • easier to explain and audit
  • avoided runaway projections for rapidly growing small groups
  • produced population totals that better aligned with broader ABS expectations
  • a more conservative extension of the underlying census trend

The final dataset includes 2011 actual, 2016 actual, 2021 actual, and 2026 projected.

What the Final Dataset Supports

The result is a harmonised demographic dataset that can be analysed consistently at both LGA and suburb level. That means I can use one consistent geographic base to look at things like changing language patterns across Melbourne suburbs, English proficiency trends by council area, comparisons between inner, middle, and outer suburbs, or demographic shifts across Australian cities over time.

This methodology piece is the foundation for those later analyses.

The hardest part of time-series census analysis is not building the chart. It is making sure the geography underneath the chart is genuinely comparable over time.