Parsing the Price Property Register
Part 01: Eircode, County and Dublin Postcode

by Brian Lynch

toa-heftiba-nrSzRUWqmoI-unsplash

Photo by Toa Heftiba on Unsplash

Introduction

The Residential Property Price Register is produced by the Property Services Regulatory Authority (PSRA) pursuant to section 86 of the Property Services (Regulation) Act 2011. It includes Date of Sale, Price and Address of all residential properties purchased in Ireland since the 1st January 2010, as declared to the Revenue Commissioners for stamp duty purposes.

Anyone who has attempted to work with the data from Ireland’s Price Property Register will know that the dataset can be a bit of a mess.

We will attempt to tidy it up by extracting structured data from the semi-structured address string.

By the end of this blog we will have improved almost 50% of the rows in the dataset.

Currently the dataset has approximatly 450,000 rows. The columns in the dataset are:

The ones we are interested in are Address, Postal Code and County.

In many rows in the dataset, the Address can contain County information Postal Code information...or both.

Aim

Our aim is to remove this information from the address string and update the Postcode or County information where necessary.

Examples
AddressCleaned Data
1 MEADOW AVENUE, DUNDRUM, DUBLIN 14address = 1 Meadow Avenue, Dundrum
postcode=Dublin 14
44 ALLEN PARK ROAD, STILLORGAN, COUNTY DUBLINaddress = 44 Allen Park Road, Stillorgan
county = Dublin
1 INFIRMARY RD, DUBLIN 7, DUBLINaddress = 1 Infirmary rd
postcode = Dublin 7
county = Dublin

Tools for the job

Method

Data preperation

PPR data is attained as a csv file. Pandas will be used to load the data, modify the column names and create a uuid for each row.

This will be exported to a sqlite database.

Address cleaning

First we will normalise the eircode, county and postal code substrings in each address. Each will be normalised to an easily extractable pattern.

Testing will be used to create the normalisation and extraction functions

regex will play a major role in creating these functions

Extraction will be done in order of structure

  1. Eircode: Rigid structure of a key followed by 4 alphanumeric characters
  2. Postcode: “Dublin” or “D” followed by a postcode number
  3. County: Different spellings of “county”, followed by County Name

Data Preperation

The PPR dataset can be downloaded here: https://propertypriceregister.ie/Website/NPSRA/pprweb.nsf/page/ppr-home-en

We will use pandas to prepare the data and save it to a sqlite database

Now we have the PPR data in a sqlite database.

Create tempoary data table

We will create a temporary table to iterate over while we are creating the functions to normalise and extract data.

Dummy columns are added to store our extracted data.

To do this, we will again use pandas.

Below is a view of the temp table created.

image-20210425110822447

Eircode Extraction

eircode

Eircodes are quite structured, so there shouldn’t be a need to normalise the address before extraction.

There are 139 eircode routing keys, so we will look for one of these plus 4 succeeding alphanumeric characters.

Eircode routing keys can be found here: https://www.autoaddress.ie/blog/autoaddressblog/2016/09/21/eircode-routing-keys

Extract an Eircode from an address

First we will create some tests.

eircode extraction tests

Now we will write our funciton to pass the tests.

eircode extraction function

Eircode can now be successfully extracted from an address

Remove an Eircode from an address

Again, we will write test first.

eircode removal tests

And now, we will write the function so all the tests pass. This workflow will be common throughout.

eircode removal function

Eircode can now be successfully removed from an address

Extract Eircode from address

Now we will loop over the database and extract/remove eircode from address containing them.

Now after opening the database, we can see we have sucessfully extracted 35 eircodes.

Considering Eircodes were introduced in July 2015, which accounts for over half of the rows in the database, this is a disappointing return.

Its is approximatly 0.0001% of address.

image-20210426105039237

Dublin Postcode Extraction

Leeson_St_-_Hatch_St_nameplates

Dublin postcodes are next in line in terms of structure. They are typically written as “Dublin” followed by the postcode number. This will not be the case for all addresses, so we will need to normalise the address before attempting extraction.

There are 22 Dublin postcodes, and a list can be found here: https://en.wikipedia.org/wiki/List_of_Dublin_postal_districts

Normalising the address for Dublin Postcodes

From scanning through the database, we have extracted some tests for our normalisation function. We will attempt to normalise each postcode to the format of dublin_{postcode_number}

Postcode normalisation tests
Postcode normalisation function

We will run this function through or database to see how many address were normalised and search for outliers that aren’t represented in our tests.

 

Look for outliers

DB Browser for Sqlite allows for regex searching which is extremely useful for this.

Example: Search for misspellings of dublin with a number after (regex: /d[ublin ]*[\dw]+$/)

image-20210427144902431

Here are some examples of rows that were not picked up by out original tests.

addresserror
1 airpark square, rathfarnham, dublin16no space between dublin and number
111 marlborough road, donnybrook, dublin 4double space between dublin and number
11 the alders, carrington, santry dubllin 9misspelling of dublin

We will add all outliers found to our tests

Postcode normalisation tests

Now, we will update the postcode normalise function to make our new tests pass.

Postcode normalisation function

Here is a link to the regex used for Dublin misspellings: https://regex101.com/r/GtTYKh/2

Now we have a robust function for normalising a Dublin Postcode in an address.

Extracting Dublin Postcode from an address

Dublin postcodes now have a consistent pattern in the address string. This makes it easy to extract them.

Postcode extraction tests
Postcode extraction function

Now we can successfully extract a Dublin Postcode from an address

Remove Dublin Postcode from an address

Postcode removal tests
Postcode removal function

Update database

We will loop over the database and extract/remove Dublin Postcodes from address containing them.

Looking at the results in the database, we can see that 69,687 Dublin Postcodes have been extracted.

Considering sales in Dublin account for about 150,000 rows in the database, this is a great return.

Screenshot 2021-05-04 at 10.37.11

County Extraction

shutterstock-32309722-1

Finally, we will attempt to extract the county name from the end on an address. This will be more complicated than the previous two extractions.

There are 26 counties in the database. Initially we will look for one of these these surrounded by a longhand/shorthand naming of county.

Normalise function

1 Some Place, co cork => 1 Some Place, co_cork

From scanning through the database, we have extracted some tests for our normalisation function. We are attempting to normalise county to co_{county_name}

County normalisation tests
County normalisation function

Look for outliers

We will do some investigation into the database to find outliers. Again, we will use DB Browser for Sqlite regex search for this.

Search for County in address (regex: / (county|co\.?) [a-z]+$/)

image-20210426121830279

Seach for county names at the end of the address preceeded by a space

image-20210426122635022

Between these two initial searches, we have found almost 1,000 outliers.

After futher investigation, there are numerous mistakes that we can correct. Here are a few that we missed:

addresserror
8 ash park heath, lucan, cco. dublincounty prefix misspelling
217 galloping green, stillorgan road, cunty dublincounty prefix misspelling
muckalee, ballyfoyle, cuonty kilkennycounty prefix misspelling
ballyrahan, tinahely, coounty wicklowcounty prefix misspelling
st heliers, coast rd, fountainstownco corkcounty prefix missing space
cloghoolia, sixmilebridge, county clqarecounty name misspelling
ranafast, annagry, county donegakcounty name misspelling
donore, multifarnham, west meathcounty name misspelling
ballymacrown, baltimore, west cork'west cork' not needed in address as baltimore is a unique town
liskillea, waterfall, near cork'near' word unnecessary
springvale ballymaw, waterfall, near cork city'near' and 'city' words unnecessary

If we updated our normalisation function to account for all these errors, it would potentially become difficult to read.

What we will do is split normalisation into three functions:

Normalise County Spelling

County spelling normalisation tests

Here is a subset of the tests written.

County spelling normalisation function

There are several ways to normalise incorrect spelling. I have chosen to use regex.

Normalise County Prefix Spelling

County prefix spelling normalisation tests

County prefix spelling normalisation function

Normalise County

Normalise County tests

Normalise County function

Update database

Now we will loop over the database and extract/remove County names from address containing them.

Looking at the database, we can see that 163,637 County names have been extracted.

This is approx 34% of rows in our database affected.

Screenshot 2021-05-04 at 11.20.26

Combine Extracted Data with the origional Dataset

We have extracted either an Eircode, Dublin Postcode or County from 229,338 rows in the database.

We will now combine this data with the original dataset. In some cases, either the extracted County or Postcode won’t match what is in the County or Postcode column. In this case we will combine the two and comma-separate them.

Also, if we have extracted a Dublin Postcode from an address where the County is not listed as Dublin, we will add Dublin to the county column.

Theses rows that contain comma-separated values in the County or Dublin Postcode columns will have to be manually updated.

Get Origional Dataset as a pandas dataframe

We will add the updated address to its own column(address_mod).

It is useful to have the original when manually updating conflicts.

Combine Extracted Data

Eircode
Postcode
County

Export Data to Sqlite DB

Conclusion

We set out to extract Eircode, Postcode and County from the address string in the Property Price Register. To do this we normalised the information in the string to add structure where it was missing.

Through a TDD approach, we were able to successfully extract:

35 Eircodes

69,687 Dublin Postcodes

163,637 Counties

This solution is not without its failings. There are conflicts due to inaccurate input of address, postcode or county in the original dataset.

The number of conflicts are:

2,068 Dublin Postcodes

1,038 Counties

image-20210504150250705

There are ways to reduce these conflicts programatically, but that's for another blog.

All the code for this blog is available to view on Github

This blog can also be read on Medium