Running PROCEDURE ANALYSE() works a treat (see below) although I found that in MySQLWorkbench I needed to follow these simple instructions to get the right result. In my version (Community 6.0) you need to look in the ‘SQL Queries’ tab in Preferences not SQL Editor.
CHAR(7) NOT NULL
ENUM(’10’,’30’,’50’) NOT NULL
MEDIUMINT(6) UNSIGNED NOT NULL
MEDIUMINT(6) UNSIGNED NOT NULL
ENUM(‘E92000001’) NOT NULL
ENUM(‘E19000003’) NOT NULL
ENUM(‘E18000007’) NOT NULL
CHAR(0) NOT NULL
Anyway it works.
Which means I can now get on with the real business of analysing the data.
How best to analyse and visualise the 1,378 A4 pages of court case transcript?
I know the story within these 1,378 pages better than most. My task is to extract the story and turn it into something that explains the story to others.
I consider myself lucky that I have the transcript as it is a précis of a protracted and involved tale. OK, it’s a long précis but it is still a précis.
Each of the 30 days of the court case give me a base structure as a timeline. The witness testimony gives another level of granularity. The names, places and organisations that each witness mentions provides a way to create links (‘edges’) between the different actors (‘nodes’).
I have already undertaken some basic semi-automatic data scrubbing of the original Word documents, converting these into text files and then creating a PDF of each file and then creating a compound document of all the resulting PDFs. Huh? Why bother doing that?
I have been exploring the potential benefits of DocumentCloud and if you give DocumentCloud a single PDF it will treat each page within the PDF as a separate document and generate timelines for all people, places and other entities using the OpenCalais system.
I discovered DocumentCloud through using the Knight Foundation’s Overview system. You can automatically import DocumentCloud into Overview and so take advantage of the benefits of both. So I urge you to go get accounts sorted with both and see what you can do. The staff at Overview and DocumentCloud are super helpful.
Before I uploaded the transcript into these systems I spent some time looking at tagging options and desperately trying to remember the SGML project I undertook a couple of decades back – yes before I even knew HTML existed. Fortunately DocumentCloud does much of the tagging legwork for you so no need to hack another SGML editor.
But I keep coming back to the quality of the data. Which is what any data driven journalism project relies on. Good old fashioned clean data.
Sure the results given by Overview and DocumentCloud are better if I had uploaded Word files (eek!). But looking at the resulting text I know it could be better still and at a level where it could then be used as a data structure foundation on which to build to useful computational structures.
Possibly because I know the real story currently hidden in the text I am anxious to generate the best possible narrative.
I make no apologies for thinking about any dataset as being one big list. That’s what an A.I. degree does for you. And I can see the one big list being nicely parsed and imported into MySQL – then it can be used for all sorts of things.
Patience has never been a virtue of mine and so I have spent several days trying to dodge the reality that before the data can become useful I really need to fire up Python and get the transcript text 100% clean and then properly structured.
First off just a simple lexical analysis. Down the road possibly some limited semantic analysis. But the reality is that any elegant data visualisation always needs a clean dataset.
So that’s me in Python land for a couple of weeks. I know I still have it easy as much of the transcript is of the form:
19 Q: Did you do this naughty thing?
20 A: No I did not do that naughty thing, honest.
21 Q: Are you sure?
22 A: Oh yes, on my life guvnor!
Stripping the line numbers is hardly a big issue although I am in no way a Regex expert. I do have a book on that very subject I got for Christmas though.
But I want to be hacking D3 loveliness right now!
So the workflow at the moment looks like it will be the usual:
Python + Regex -> MySQL -> D3 -> End product(s).
[Note: I am now using a very nice clean and simple product called iA Writer for all my writing needs. iA Writer just lets you write stuff with no Mr. Paperclip nonsense and it is also a great way to get up to speed with markdown. Highly recommended.]
Simple. The core subject is too complex to explain in one visualisation.
The data for all three is slightly different but overlaps.
Also as I am not, the last time I checked, a national newsroom, I don’t really have the luxury of devoting lots of time to experimenting with sophisticated D3 work.
The story needs to be covered as it breaks and the data becomes available and or makes sense.
Most importantly any data visualisation needs to be appropriate for the intended audience. The people who are most interested in the grant funding issue are Tower Hamlets residents (‘cos it’s their money) and it does not need extensive user testing to discover that they just want something to look at and get the story.
Down the road, time and circumstances permitting, I may come back to this specific issue and see what else I can do with it.
But for the moment I have other data stories to work on.