Category Archives: Blog

Data scrubbing always precedes data visualisation (sigh)

How best to analyse and visualise the 1,378 A4 pages of court case transcript?

I know the story within these 1,378 pages better than most. My task is to extract the story and turn it into something that explains the story to others.

I consider myself lucky that I have the transcript as it is a précis of a protracted and involved tale. OK, it’s a long précis but it is still a précis.

Each of the 30 days of the court case give me a base structure as a timeline. The witness testimony gives another level of granularity. The names, places and organisations that each witness mentions provides a way to create links (‘edges’) between the different actors (‘nodes’).

One way to visualise this might be a co-occurrence matrix along the lines of the classic ‘Les Mis’ example. or some form of hierarchical edge bundling or chord diagram. The data will decide that, not me.

I have already undertaken some basic semi-automatic data scrubbing of the original Word documents, converting these into text files and then creating a PDF of each file and then creating a compound document of all the resulting PDFs. Huh? Why bother doing that?

The utility of entity analysis in DocumentCloud is clear.
The utility of entity analysis in DocumentCloud is clear.

I have been exploring the potential benefits of DocumentCloud and if you give DocumentCloud a single PDF it will treat each page within the PDF as a separate document and generate timelines for all people, places and other entities using the OpenCalais system.

Default hierarchical view of documents in Openview
Default hierarchical view of documents in Openview

I discovered DocumentCloud through using the Knight Foundation’s Overview system. You can automatically import DocumentCloud into Overview and so take advantage of the benefits of both. So I urge you to go get accounts sorted with both and see what you can do. The staff at Overview and DocumentCloud are super helpful.

Before I uploaded the transcript into these systems I spent some time looking at tagging options and desperately trying to remember the SGML project I undertook a couple of decades back – yes before I even knew HTML existed. Fortunately DocumentCloud does much of the tagging legwork for you so no need to hack another SGML editor.

But I keep coming back to the quality of the data. Which is what any data driven journalism project relies on. Good old fashioned clean data.

Sure the results given by Overview and DocumentCloud are better if I had uploaded Word files (eek!). But looking at the resulting text I know it could be better still and at a level where it could then be used as a data structure foundation on which to build to useful computational structures.

Possibly because I know the real story currently hidden in the text I am anxious to generate the best possible narrative.

I make no apologies for thinking about any dataset as being one big list. That’s what an A.I. degree does for you. And I can see the one big list being nicely parsed and imported into MySQL – then it can be used for all sorts of things.

Patience has never been a virtue of mine and so I have spent several days trying to dodge the reality that before the data can become useful I really need to fire up Python and get the transcript text 100% clean and then properly structured.

First off just a simple lexical analysis. Down the road possibly some limited semantic analysis. But the reality is that any elegant data visualisation always needs a clean dataset.

So that’s me in Python land for a couple of weeks. I know I still have it easy as much of the transcript is of the form:

19 Q: Did you do this naughty thing?

20 A: No I did not do that naughty thing, honest.

21 Q: Are you sure?

22 A: Oh yes, on my life guvnor!

Stripping the line numbers is hardly a big issue although I am in no way a Regex expert. I do have a book on that very subject I got for Christmas though.

But I want to be hacking D3 loveliness right now!

So the workflow at the moment looks like it will be the usual:

Python + Regex -> MySQL -> D3 -> End product(s).

[Note: I am now using a very nice clean and simple product called iA Writer for all my writing needs. iA Writer just lets you write stuff with no Mr. Paperclip nonsense and it is also a great way to get up to speed with markdown. Highly recommended.]



Data visualisation – one issue, three representations

For some reason I thought I had done this post some time back – but I hadn’t so here it is. Sorry!

The distribution of third-sector grants by London Borough of Tower Hamlets council is an issue I have been covering for quite some time on my hyperlocal site Love Wapping.

I don’t want to duplicate all the content of the original blog posts here but simply highlight three different ways of visualising the same issue in different ways.

1. Draw a bar chart.

Tower Hamlets council grants distribution by electoral ward
Tower Hamlets council grants distribution by electoral ward – click to view larger image.


Original post: Tower Hamlets council grants distribution by electoral ward

2. Draw another bar chart.

Tower Hamlets grants compared to quality of life indicators
Tower Hamlets grants compared to quality of life indicators – click to view larger image.


Original post: Tower Hamlets grants compared to quality of life indicators

3. Create a nice map

Original post: Child poverty in Tower Hamlets mapped by ward

Different but the same

Why three visualisations?

Simple. The core subject is too complex to explain in one visualisation.

The data for all three is slightly different but overlaps.

Also as I am not, the last time I checked, a national newsroom, I don’t really have the luxury of devoting lots of time to experimenting with sophisticated D3 work.

The story needs to be covered as it breaks and the data becomes available and or makes sense.

Most importantly any data visualisation needs to be appropriate for the intended audience. The people who are most interested in the grant funding issue are Tower Hamlets residents (‘cos it’s their money) and it does not need extensive user testing to discover that they just want something to look at and get the story.

Down the road, time and circumstances permitting, I may come back to this specific issue and see what else I can do with it.

But for the moment I have other data stories to work on.



Hyperlocals can release the rich seam of data in local chemists

Hyperlocal websites and visualisation of data are made for each other.  A little imagination and a lot of hard work could see hyperlocals become the ideal platform to deliver information derived from data analysis back to the very communities from where it was first gathered.  At the same time monolithic organisations such as the NHS might be provided with the perfect outlet to engage with patients.    

In my part of east London there is no local press coverage to speak of.  Stories that engage and inform are obvious in their omission from those local papers that do still exist.  These papers will soon cease to exist – online or in print – as their owners concentrate on other areas of London where they can still make a profit.

At the same time in E1W there are three hyperlocal sites that provide valuable information for the 12,000 – 15,000 residents of Wapping – What’s in Wapping, Pootling Around and my own Love Wapping.  Wapping might well have one of the highest ratios of hyperlocals to residents in the country and between us we don’t miss much.

Knock knock. Who’s there? A story!

Finding content for Love Wapping is never a problem. It’s the East End. Sometimes stories literally knock on my door. In fact there is a surplus of great stories, most of which don’t get covered due to restraints on time and resources.

Which leads to the crazy situation where the established but dying local papers fail to provide good coverage of local stories because their reporters are stuck behind a desk many miles away and hyperlocals  fail to provide good coverage of local stories because their writer / publishers are stuck behind their desk doing their day job.  Hyperlocals just don’t pay.  So the paucity of local news coverage continues.

So in the ‘information age’ it would seem there is not enough information getting to people.

My local NHS general practice in Wapping Lane does great work and is integral to our community and I have started to attend its patients forum. Trouble is that, and maybe this is partly the curse that goes with any aspect of modern life that uses the ‘forum’ word, it is not as effective as it could be. Meeting every couple of months there is the feeling that it just does not do what it should do.

And the problem the practice has – along with most organisations – is that it cannot reach its patients beyond the surgery.

Disengagement with the NHS

So maybe Wapping residents are not interested in their health.  Can this be true.? Everyone has an interest in their own well-being whether they like it or not.  But there does seem to be a disengagement with issues other than immediate personal illness matters.

Mulling over the problem I thought that one possible solution was to make my local doctor’s surgery more ‘interesting’.  If only the staff dealt with matters of life and death every day – oh.

Hyperlocal. NHS GP surgeries. And maybe a data angle?

Medicines Use Review (MUR) service

A few days later I read this in the Guardian.  Bingo.

“Data routinely recorded by pharmacies around the UK could provide physicians, commissioners, public health authorities and the pharmaceutical industry with invaluable insights into the effectiveness of medicines and the behaviour of patients. Every day, pharmacies around the country gather data on the use of medicines, outcomes, adherence levels and the progress of symptoms.”

Seems community pharmacies are an untapped gold mine of data.


Everyone but the patients

Notice how the benefits to everyone and their NHS dog is mentioned in that lead paragraph.  Apart from patients.  Patients are the source of the data and patients should be included as intended beneficiaries of the data. But they seem to be excluded once the initial data harvest is done.

It seems that something called the Medicines Use Review (MUR) service collects shed loads of data and has cost the NHS £85m in the three years ending April 2014.

Basic idea is that this free NHS service is offered by community chemists (or pharmacies if you are in the USA) to patients and is a private chat with the pharmacist (or chemist if you live in the UK) about the different medicines being taken, why they have been prescribed and any subsequent issues. The chemist record the conversation and can make recommendations to patient and or their GP.

And guess what? It seems that every MUR has an automatic data sharing consent clause. Great. But it would seem only for GPs, clinical commissioning groups and NHS England. Doh.

So does this mean that this incredibly useful – and expensive – data can only be used within the NHS?

The Guardian article quotes a study indicating that ‘data from MURs indicates that 25% of patients with long-term conditions don’t use their medications as directed.’ Why not? How could this be addressed? Do the people with long-term conditions know this?

Turning data into information

An informal conversation with a NHS professional suggests that there may be a way to get the untapped (data) gold mine of MUR out into the world, turn it into information and so make it of use to the communities that provided the data. 

I do not know if this can be done but it is definitely worth finding out.  And then find out if this data can be sourced centrally for wider comparative work.

Well visualised and relevant data published on a hyperlocal platform such as Love Wapping (or love anywhere for that matter) may in turn help to engage the local community with their doctor’s surgery.

There is no reason why the established media should be better at the analysis and explanation of data than hyperlocals. And of course hyperlocals know the area they are talking about.

No doubt there will be lots of challenges in solving this particular problem. But then that’s the fun. Potential benefits? Unknown. Only one way to find out.

For more information: