Notes on extracting data from Tower Hamlets Council

In December 2013 I started to have a look at some of the financial data on the London Borough of Tower Hamlets website. 10 months later I am still digging.

I am a little bit wiser, and hopefully other residents of my part of London are too. But the reality of access to Open Data in the UK is that it is all down to where you live. You might be lucky, you might not be.

This is the first in a series of notes on extracting data from Tower Hamlets Council that boils down to testing the validity of the statement below from the Council website.

“One of the benefits to the community in publishing this data is that it means that our spending will come under greater scrutiny by local people. We hope it will inform people about what we do, and encourage people to challenge how we spend money. It is your money and we welcome comments.” Tower Hamlets Council

I can safely say I have not been encouraged in my challenges. And some of my comments have not been welcome either.

On to the geek stuff!

I should have been sensible and blogged about the technical aspects as I went along but I was too busy writing about the results on my hyperlocal site, Love Wapping.

And occasionally I became part of the story surrounding my investigations which was no fun at all.

If you want the back story on all this data scrubbing then check out Love Wapping, on here I am just going to describe my own experience of trying to make sense of how one Council in the UK spends its money.

Oops, I meant of course spending our money.

Before I lose my grip on reality..

Just in case I lose the will to live and do not write a series of eloquent blog posts here is a quick and dirty overview of what I have been up to.

The work has been divided into two parts.

1. Making sense of the Tower Hamlets Council Payments to Suppliers over £500 data that sometimes seemed to be at odds with data released under Freedom of Information (FOI) requests.

2. Examining how grants have been distributed by the Council across the Borough. This issue has been the subject of numerous other investigations by others.

1. Payments to Suppliers over £500

In retrospect getting the Payments to Suppliers over £500 data from the Council’s website and then republishing it under the Open Data license was not that complex. Just lots of leg work.

The main problem was that the data was ‘published’ in either Excel or CSV form on the Council site. This meant getting just over 200 different files down from the site then trying to make sense of them.

A lot of this ended up being basic data scrubbing and reconciliation work.

None of this scraper nonsense for me! I should be so lucky. Once the files were safely on my hard drive then each needed to be inspected because there was little consistency in the format. Payments from one department in one CSV file would be in a different logical format to that in another CSV file.

Over time a pattern to the inconsistencies began to emerge and then all I needed to do was convert the format of the files into a consistent structure. Again lots of manual cutting and pasting although I did write some Python scripts to do some of the basic merging of multiple files.

Once i had the data consistent I then had to put it somewhere. In retrospect the bin would have been a good place but being stupid i went down the MySQL route.

I have tinkered with RDBMS over the years but never really taken the plunge but now I was committed and after purchase of more than a few MySQL books and lots of head scratching I had a reasonable system running that would enable me to run reports and finally generate Council data in a way that ordinary residents could understand. So I did.

The learning curve here was steep but still fun – in a geeky masochistic sort  of way. If you are a traditional journalist considering switching to data journalism having a good knowledge of SQL is essential. If this prospect does not appeal then data journalism is not for you. Full stop.

You can only understand the story if you understand the data. ‘Cos that is what the story is.

Sure you can get a lot down with Excel and mastery of this is essential too, but it is only once you have started to interrogate data with the power of a relational database that you can do the real work.

You also need to be able to normalise your data properly. Doing this at the start will prevent many hours if not days of heartbreak down the line (see below).

Apart from publishing the data I also created a Sankey diagram as it was the only way I could find to illustrate the incredibly complex journey that payments took within the Council.

2. Tower Hamlets Grants Payments

After several months of data scrubbing and a Christmas Day and News Years Eve hacking Python any sensible person would have quit while they were ahead. I am not sensible. I might appear very sensible but I am not. So I stuck with it.

The techniques I used for this work where the same – data scrubbing, Excel frolics, MySQL delight – but the big BIG issue here was normalisation of the data. Oh and finding the data in the first place.

(I should hang my head in shame here and admit that it was only on this part of the project that I really began to understand the benefits of Open Refine. Wannabe data journos also note – you need to know and love this tool.)

At the time of writing (October 2014) there are numerous investigations, enquiries and court cases either under way or pending  that relate to Tower Hamlets Council and so I can say nothing about the background to this work apart from what I have already published on Love Wapping.

What I can say is that the subject of  Tower Hamlets Grants Payments has already been the subject of numerous investigations in the national press and a BBC Panorama documentary.

Anyway the only reason I had the source documents (PDF files as ever) was that excellent journalists like Andrew Gilligan had already found the files on the Tower Hamlets site.

So all I had to do was tip my hat to Mr G. and do the data thing.

The original work on Payments to Suppliers was a grind but at least there was some consistency to the original data.

The data in the PDF files containing the grants distribution data was all over the place.  The PDFs were conversions from Excel, no surprise or problem there. But there was no consistency either across the different documents or even within a document.

I will return to this issue in detail in a later post but at the moment it is late and i want my dinner.

Normalisation of the data, especially organisation names and addresses was very problematic.  It was essential of course that the authenticity of the original data was preserved. But at the same time a lot of the original data was wrong. I knew it was wrong partly because it was common sense because I knew the addresses in question.

Tricky. This is not the sort of work that can be given to a Python script to sort out. Too much fuzziness.

Eventually I found the art was to compare all possible data sources for an address and or organisation name and work from there.

These data sources were:

  • Original grant application (dubious at times)
  • Organisation’s website (often lacking basic address information)
  • Opencorporates (fantastic wonderful site!)
  • Openlylocal (another fantastic wonderful site!)
  • Charity Commission (often absolutely useless)

I also made very good use of DuckDuckGo rather than Google.

And on more than one occasions I went for a walk up the road and checked the address myself. More leg work. Of which more later.