More on OpenRefine, Stephanie Falkowski

For my last post on using OpenRefine, I worked with a dataset of the British Library’s comic book holdings found on thomaspadilla.org.

go to thomaspadilla.org

This site provides guidance for some of the ways in which OpenRefine can be used to clean data. It has walk-throughs of using features including the text filter, facet tool, clustering, and transform on the data there provided. For my post, I used some of these same techniques to find answers to other questions. I questioned how many of the records listed had a less-than-certain place of publication recorded. I did not take into account for the purposes of this exercise those records where the place of publication was unknown to the point of leaving the field blank – of which, there are 2050 – but only those that offered a place followed with a qualifying question mark. There were only 233 that fit the description, and with OpenRefine I was also able to tell that most of them, 183, were given the designation of “London?” followed distantly by the broader placeholder of “England?” that was used in only fourteen instances, and others used less than four times each.

chart showing facet tool sorting most common questioned places of publication

OpenRefine offers many ways of manipulating messy, inconsistent data to render it usable to answer a variety of questions. It was not necessary to clean the data completely, but only those aspects determined by the question being asked, which in this case meant manipulation of the Place of Publication column, though plenty of further cleaning could be done in other equally inconsistent columns to open it to even further questions.