Google Books and labour erasure in the digital humanities

I’m currently prepping to teach an article by Jean-Baptiste Michel and colleagues (paywalled) that presents an early (if not the first) analysis of the content digitised as part of the Google Books project. I ended up going down a rabbit hole trying to understand how the 25+ million books were actually scanned, as it wasn’t immediately clear. This is from the supplementary methods section of the article itself:

We describe the way books are scanned and digitized. For publisher-provided books, Google removes the spines and scans the pages with industrial sheet-fed scanners. For library-provided books, Google uses custom-built scanning stations designed to impose only as much wear on the book as would result from someone reading the book. As the pages are turned, stereo cameras overhead photograph each page, as shown in Figure S15.

‘Google’ is being euphemistically used here to mean human labour, though the term is employed in a way so that the reader might think the whole process is automated rather than undertaken by humans. There is no actual mention of the scanning undertaken by humans at all.

From digging further, it is clear that the labour was indeed provided by people, and as Leah Henrickson shows in “The Darker Side of Digitization“, many of these book-scanners seem to be people of colour (which is anecdotally confirmed in an interview Henrickson cites with artist Andrew Norman Wilson). There are a number of sites on the web that collate human glitches in Google Book scanning (see image below from this New Yorker article on the ‘Art of Google Books‘), many of which reveal the handiwork of people of colour. Henrickson refers to a source (of questionable reliability) that states how ‘the average Google book-scanner earned $24,000 USD in 2008’. Assuming this is close to being true (and it’s not hard to imagine it is) it’s apparent that Google did not value this work any greater than minimum wage — Henrickson claims that it must be ‘the most unenjoyable job at Google’.

It is striking how the authors of the article I was reading fail to mention the labour involved in scanning these books, bearing in mind that this is one of the first articles to utilise Google Books as a subject of analysis (and Google themselves are listed as authors). What’s more, all the authors (as far as I can tell) are men and the majority white, senior academics in their field; there is thus a sharp distinction between women of colour as a resource for the creative uses of men. I’m reminded of Silvia Federici’s critical analysis that under capitalism ‘women themselves became the commons, as their work was defined as a natural resource, laying outside the spheres of market relations’ (Caliban and the Witch, 2004, p. 97).

The erasure of the labour of women and people of colour is no new thing, nor is the erasure of labour in the digital humanities (or ‘big data’ fields in general). Yet this is one of the starkest examples I’ve seen of both at once, especially as it looks like the authors of the paper went to lengths to erase this labour and imply that it is simply undertaken by machines. This was undervalued, poorly paid work undertaken largely by marginalised people and should have been acknowledged. Please remember this if your academic research has to utilise such labour, be it via Google Books, Mechanical Turk or student interns.