thomsonreuters

Without identifiers the goals of Open Data are unachievable

By pure coincidence the day after posting my previous entry I saw an interesting tweet from the nice ODI people and the equally nice Thomson Reuters people about linking Open Data with identifiers.

The ODI and Thomson Reuters have written a white paper about this issue which is incredibly important which you can find here:  Open Data Institute and Thomson Reuters, 2014, Creating Value with Identifiers in an Open Data World, retrieved from thomsonreuters.com/site/data-identifiers/

As Dave Weller, Chief Enterprise Architect for Thomson Reuters, points out in this blog post,   the human mind links facts to other facts to put the world in context.

If you know Fact A the person in front of you with an axe raised is your Fact B Uncle Joe and you know Fact C you need logs chopped for the fire all is well. If Fact A the person in front of you with an axe raised is a Fact B complete stranger who has just Fact C broken into your house circumstances may not be so peachy. It’s basic logic.

In Open Data Digging Around For Interesting Stuff Land it is not often easy to know which facts are which and consequently the validity of any assumptions based on those facts. Your logic chain may be faulty or just plain broken.

Is Joe Bloggs Acme Widget Limited of 23 Acacia Avenue owned by the same Joe Bloggs who also owns Joe Bloggs Charity Widgets for Africa at 24 Acacia Avenue? The only way to know for certain is by having access to data points that are unique identifiers and so allow assumptions to be made on facts, no guesses.

For the last year I have encountered this issue on a daily basis while undertaking data journalism work relating to payments by Tower Hamlets Council.

Proof it is worth writing to your MP

At around the same time an update to the Code of Recommended Practice for Local Authorities on Data Transparency was out for consultation so I copied my views to Jim Fitzpatrick who is my local MP. Text below.

Email to Jim Fitzpatrick MP 15th January 2014

To: Transparencycode@communities.gsi.gov.uk

Re: Code of Recommended Practice for Local Authorities on Data Transparency

“I work in new media and also run a hyperlocal site, Love Wapping, in Wapping, Tower Hamlets. For various reasons I have spent the last two months working on a data journalism project relating to voluntary sector grant allocations by Tower Hamlets Council. This research is completely dependent on being able to identify the recipients of grants with reasonable accuracy. 100% would be nice but i live in the real world.

As a result of spending weeks trying to reconcile whatTower Hamlets Council claims to have given in grants against the organisations that are supposed to be the recipients of grants i know from experience that there is a gaping hole in the legislation that needs to be fixed. Here it is:

Appendix A p43-44 Grants to voluntary, community and social enterprise organisations

Information Title

The proposal is that ‘information which must be published’ includes (among other things) the beneficiary of the grant. Well fine. But – and this is the flaw – information recommended for publication (my emphasis) includes providers registration numbers where the provider is from the voluntary or community sector.

‘Recommended’ is wrong. And this is why.

I have spent the last month working through the different organisations that Tower Hamlets Council have made grants to and I know that if grants or any other monies are given to organisations and the details of those organisations do not include the organisations registration number (Company, Charity, Mutual, etc.) it is impossible to accurately map to a grant to a organisation.

Additionally I have found at least one instance where two organisations with exactly the same name operate in the same part of Tower Hamlets doing similar work. And without the registration number of the beneficiary of the grant it’s a little tricky to work out which one got the cash.

So all I am saying is that information which must be published has to include providers’s registration number. Without this requirement the legislation will be toothless.

Many charities have a corporate identity as well, so without the registration number of the organisation that the funds were given to how is it possible to work out where the money has gone?”

Jim forwarded my views to Brandon Lewis MP, Minister for Local Authorities at the Department for Communities and Local Government (DCLG).

I have to admit that until writing this blog post I had not checked the resulting update to the Code of Recommended Practice for Local Authorities on Data Transparency. Sorry!

So I thought I should. And I was  delighted to find this on page 15 relating to payments to Grants to voluntary, community and social enterprise organisations.

33. For each identified grant, the following information must be published as a minimum:

  • date the grant was awarded
  • time period for which the grant has been given
  • local authority department which awarded the grant
  • beneficiary
  • beneficiary’s registration number 
  • summary of the purpose of the grant, and
  • amount

Woo hoo! Life just got a lot easier! Thanks Jim and Brandon! Of course this change no doubt had nothing whatsoever to do with my input, but I don’t care, it’s there.

Spot the difference

Now it is possible to identify the difference between a charity that gets a grant and a company with the same name who gets a grant. What fun!

But this is no-brainer stuff.  The ODI / Thomson Reuters work promises a much more powerful system, another step towards a semantic web. However it ain’t going to be easy.

The white paper addresses eight challenges for Open Data that need to be overcome for an open system of linked identifiers to work.

  1. Data is ungrounded
  2. Lack of reconciliation options
  3. Lack of identifier scheme documentation
  4. Proprietary identifier schemes
  5. Rationalising multiple identities
  6. Inability to resolve identifiers
  7. Fragile identifiers
  8. Identifier recycling and evolution

The good news is that as Thomson Reuters is a big beastie in Data Land they have done lots of interesting work on this.

Calais? Mon dieux!

Part of this work is something called OpenCalais.  This is a web service that automatically creates rich semantic metadata for submitted content.

OpenCalais schematic
OpenCalais schematic

Oh yes. Me likey! And to be honest what I really like about this is that named entities includes people. Now that could be very powerful indeed.

The OpenCalais site is a bit dull and unloved but there is an OpenCalais WordPress plugin (not updated for a year and well buggy on current WP installs) and a Drupal module too.

Bottom line with linked identities is that you need legislation to force people like Local Authorities to undertake even the most basic of ID tagging, people like ODI / Thomson Reuters to take the lead, some decent software and Open Data enthusiasts to see what they can do.

Because without identifiers the goals of Open Data are unachievable. Simple as that. We might as well all go home and have a nice cup of tea instead.

Data-Story-featured-image

Notes on extracting data from Tower Hamlets Council

In December 2013 I started to have a look at some of the financial data on the London Borough of Tower Hamlets website. 10 months later I am still digging.

I am a little bit wiser, and hopefully other residents of my part of London are too. But the reality of access to Open Data in the UK is that it is all down to where you live. You might be lucky, you might not be.

This is the first in a series of notes on extracting data from Tower Hamlets Council that boils down to testing the validity of the statement below from the Council website.

“One of the benefits to the community in publishing this data is that it means that our spending will come under greater scrutiny by local people. We hope it will inform people about what we do, and encourage people to challenge how we spend money. It is your money and we welcome comments.” Tower Hamlets Council

I can safely say I have not been encouraged in my challenges. And some of my comments have not been welcome either.

On to the geek stuff!

I should have been sensible and blogged about the technical aspects as I went along but I was too busy writing about the results on my hyperlocal site, Love Wapping.

And occasionally I became part of the story surrounding my investigations which was no fun at all.

If you want the back story on all this data scrubbing then check out Love Wapping, on here I am just going to describe my own experience of trying to make sense of how one Council in the UK spends its money.

Oops, I meant of course spending our money.

Before I lose my grip on reality..

Just in case I lose the will to live and do not write a series of eloquent blog posts here is a quick and dirty overview of what I have been up to.

The work has been divided into two parts.

1. Making sense of the Tower Hamlets Council Payments to Suppliers over £500 data that sometimes seemed to be at odds with data released under Freedom of Information (FOI) requests.

2. Examining how grants have been distributed by the Council across the Borough. This issue has been the subject of numerous other investigations by others.

1. Payments to Suppliers over £500

In retrospect getting the Payments to Suppliers over £500 data from the Council’s website and then republishing it under the Open Data license was not that complex. Just lots of leg work.

The main problem was that the data was ‘published’ in either Excel or CSV form on the Council site. This meant getting just over 200 different files down from the site then trying to make sense of them.

A lot of this ended up being basic data scrubbing and reconciliation work.

None of this scraper nonsense for me! I should be so lucky. Once the files were safely on my hard drive then each needed to be inspected because there was little consistency in the format. Payments from one department in one CSV file would be in a different logical format to that in another CSV file.

Over time a pattern to the inconsistencies began to emerge and then all I needed to do was convert the format of the files into a consistent structure. Again lots of manual cutting and pasting although I did write some Python scripts to do some of the basic merging of multiple files.

Once i had the data consistent I then had to put it somewhere. In retrospect the bin would have been a good place but being stupid i went down the MySQL route.

I have tinkered with RDBMS over the years but never really taken the plunge but now I was committed and after purchase of more than a few MySQL books and lots of head scratching I had a reasonable system running that would enable me to run reports and finally generate Council data in a way that ordinary residents could understand. So I did.

The learning curve here was steep but still fun – in a geeky masochistic sort  of way. If you are a traditional journalist considering switching to data journalism having a good knowledge of SQL is essential. If this prospect does not appeal then data journalism is not for you. Full stop.

You can only understand the story if you understand the data. ‘Cos that is what the story is.

Sure you can get a lot down with Excel and mastery of this is essential too, but it is only once you have started to interrogate data with the power of a relational database that you can do the real work.

You also need to be able to normalise your data properly. Doing this at the start will prevent many hours if not days of heartbreak down the line (see below).

Apart from publishing the data I also created a Sankey diagram as it was the only way I could find to illustrate the incredibly complex journey that payments took within the Council.

2. Tower Hamlets Grants Payments

After several months of data scrubbing and a Christmas Day and News Years Eve hacking Python any sensible person would have quit while they were ahead. I am not sensible. I might appear very sensible but I am not. So I stuck with it.

The techniques I used for this work where the same – data scrubbing, Excel frolics, MySQL delight – but the big BIG issue here was normalisation of the data. Oh and finding the data in the first place.

(I should hang my head in shame here and admit that it was only on this part of the project that I really began to understand the benefits of Open Refine. Wannabe data journos also note – you need to know and love this tool.)

At the time of writing (October 2014) there are numerous investigations, enquiries and court cases either under way or pending  that relate to Tower Hamlets Council and so I can say nothing about the background to this work apart from what I have already published on Love Wapping.

What I can say is that the subject of  Tower Hamlets Grants Payments has already been the subject of numerous investigations in the national press and a BBC Panorama documentary.

Anyway the only reason I had the source documents (PDF files as ever) was that excellent journalists like Andrew Gilligan had already found the files on the Tower Hamlets site.

So all I had to do was tip my hat to Mr G. and do the data thing.

The original work on Payments to Suppliers was a grind but at least there was some consistency to the original data.

The data in the PDF files containing the grants distribution data was all over the place.  The PDFs were conversions from Excel, no surprise or problem there. But there was no consistency either across the different documents or even within a document.

I will return to this issue in detail in a later post but at the moment it is late and i want my dinner.

Normalisation of the data, especially organisation names and addresses was very problematic.  It was essential of course that the authenticity of the original data was preserved. But at the same time a lot of the original data was wrong. I knew it was wrong partly because it was common sense because I knew the addresses in question.

Tricky. This is not the sort of work that can be given to a Python script to sort out. Too much fuzziness.

Eventually I found the art was to compare all possible data sources for an address and or organisation name and work from there.

These data sources were:

  • Original grant application (dubious at times)
  • Organisation’s website (often lacking basic address information)
  • Opencorporates (fantastic wonderful site!)
  • Openlylocal (another fantastic wonderful site!)
  • Charity Commission (often absolutely useless)

I also made very good use of DuckDuckGo rather than Google.

And on more than one occasions I went for a walk up the road and checked the address myself. More leg work. Of which more later.