Another data driven journalism project starts

Tired and generally worn out. But I have another data journalism project that I need to get done that might crack open all sorts of things.

To be honest it is more a continuation of previous work that I have been undertaking over the last 18 months or so. Might lead somewhere, might not.

Either way I need to do some basic data analysis to see if what I have been told is true.

If it is then BINGO. Maybe even double BINGO.

With all this talk of data driven this and data driven the other it’s sometimes easy to forget that the best information is almost always what someone tells you.

Funny thing being that what to them might be a minor occurrence could be a significant development into investigative work.

I have had the Excel files sitting on my hard drive for a couple of weeks but life has been a bit busy.

First thing – convert the Excel files with multiple worksheets into individual CSV files. I use a file conversion service at Zamar.com to do this.

Not any more I don’t – two Excel files uploaded came back empty. Oops. Probably a fault in the files, not the Zamar service but am not taking any chances at the moment.

This Excel VBA macro seems to do the job. Code below:

Sub ExportSheetsToCSV() 
Dim wSheet As Worksheet 
Dim csvFile As String
For Each wSheet In Worksheets
On Error Resume Next
wSheet.Copy
csvFile = CurDir & "\" & wSheet.Name & ".csv" ActiveWorkbook.SaveAs Filename:=csvFile, _ FileFormat:=xlCSV, CreateBackup:=False ActiveWorkbook.Saved = True
ActiveWorkbook.Close
Next wSheet
End Sub

Next? Oh joy. Data scrubbing.

I have tried to document a data journalism story before failed miserably. This time I hope to be able to keep a little diary.

Mixing emotion and statistics to tell the data story

It is rare to get a press release that has everything needed to tell a story, let alone a data story.

A few weeks back I noticed an interesting data story from Barnardo’s UK in a local newspaper in sarf lunnun (or South London if you don’t speak the East End patois).

Getting in touch with Dan at Barnardo’s I soon had a press release with real facts in, a cracking model released image and some interesting data.

Even data references! How unusual is that?

Below is the result of my work.

barnados graphic 01-02

The image was a little too tight for my liking so had to use context-sensitive fill in Photoshop and more than a little retouching to give myself a few more valuable pixels on the right hand side.

At the same time I wanted the typography and data to intrude onto the boy’s image.

I tried various styles of bar chart, most of them very slight variations on the finished result, the main option being a yellow key line around the ’81 families in Tower Hamlets’ bar.

Barnardo’s liked what I had done and it helped to get the reality of life behind statistics out there.

Here is the story on my hyperlocal site Love Wapping.

 

How to work out best field types from a .csv file for MySQL

Doing data journalism work means I have a common problem popping up most days.

Yet more weird (or not so weird) and wonderful (or not so wonderful) CSV files need to be imported into MySQL  for analysis and that means yet another table.

So more time spent working out – aka guessing – the optimal values for the fields. After all we don’t want a nasty fat sluggish database do we? No.

I get annoyed having to do anything a machine can do so much better than me so had a quick Google and found this very handy bit of information – Determining the optimal Column Widths from a .csv File – over on DBA StackExchange courtesy of Cristian Porta.

Thanks Cristian!

Running PROCEDURE ANALYSE() works a treat (see below) although I found that in MySQLWorkbench I needed to follow these simple instructions to get the right result. In my version (Community 6.0) you need to look in the ‘SQL Queries’ tab in Preferences not SQL Editor.

Optimal_fieldtype
CHAR(7) NOT NULL
ENUM(’10’,’30’,’50’) NOT NULL
MEDIUMINT(6) UNSIGNED NOT NULL
MEDIUMINT(6) UNSIGNED NOT NULL
ENUM(‘E92000001′) NOT NULL
ENUM(‘E19000003′) NOT NULL
ENUM(‘E18000007′) NOT NULL
CHAR(0) NOT NULL

Anyway it works.

Which means I can now get on with the real business of analysing the data.

 

 

UXD, analytics, data viz, open data, data journalism, hyperlocal. Tea.

Fork me on GitHub