Measuring the ANZACs Tutorial 1: Classifying pages
Welcome to the tutorials for Measuring the ANZACs. Thanks to the feedback on Talk we’ve identified the issues citizen scientists need to know as we work on this project. Over the next few weeks we’ll be publishing a series of detailed tutorials on the steps involved in working on Measuring the ANZACs. Each tutorial will explain the scientific and research rationale for what we’re doing, and why we’ve set up the workflow in a particular way, and then outline how things work with examples, screen shots, and step-by-step instructions. Please leave comments, and please join us on Talk to ask your questions.
The Measuring the ANZACs data and workflow: Measuring the ANZACs is transforming 4 million pages of soldiers’ personnel files into a structured database. One way to approach this task would be to transcribe the words on every page into a free text database, like a big Word document or blog post.
However a quick look at the pages in the files shows many of the documents have a structure. Take this somewhat random example. It’s a hospital admission form. It even has a standard form number (E.F. 60) suggesting that many of the reports on hospital admissions will collect the same information. The pieces of information that will be collected in this context are pre-printed on the form in bold text. The different responses to these questions are handwritten (and sometimes typewritten). For those of you studied statistics we could refer to the printed text as variables and the handwritten responses as their values. In fact, among the research team that’s often how we do distinguish between the general concepts and the specific answers.
You will also notice that there are lots of different kinds of pages with different questions on them. So the pages themselves are variable. In order to transcribe the information on any page in a structured format we first have to know what kind of page we’re looking at.
When you—as citizen scientists—are looking at a page you can tell what kind of page it is. Often it says exactly what the page is”HISTORY SHEET,” for example. But not always! You’ll see that too, or you’ve seen it, if you’ve been working with us for a while. Early on we had hoped that we would be able to use Optical Character Recognition to identify the types of forms you are looking at. But there are enough pages that don’t have legible titles, or don’t have titles at all, that we soon realized this wouldn’t work at a level of acceptable accuracy. To put it plainly, unless the OCR process was recognizing 90% or more of the pages accurately we’d still be needing human review.
These considerations lead us to order of events in the Measuring the ANZACs workflow
- Classify a page as a particular type of document
- If a page is classified as a document that we are transcribing, mark the fields to be transcribed.
- Transcribe the marked fields.
And that’s all there is to it. Rinse and repeat for 140,000 files and 4 million pages!
The classification of page types will help us set up the database in such a way that other researchers will be able to use it for their research on other topics. If you are researching hospital treatments, for example, it would be great if we had identified all forms relating to hospital admissions, so you could go straight to those pages. That’s what we’re hoping to do. Here’s how we’ll do that, with your help!
Classifying pages is found under the Mark workflow, so start by selecting either link to Mark.
We eventually want to mark variables on the History Sheet, the Statement of Services (which is really page 2 of the History Sheet), the Attestations, and any form of Death Notifications. These are the options for classifying a page, plus “Other”
In this example you can see it’s a HISTORY-SHEET”, so you click on History Sheet on the right hand side of the page. The color of the button will change to indicate you’ve selected it. Then you click on NEXT.
If you’ve classified a page as one of the designated types, the next part of the Mark workflow will ask you to mark the variables or fields on the page. That is, identify the things in printed text that have handwritten answers we’re interested in. We’ll cover that involved process in a future tutorial. For now, let’s consider what about the “Other” pages.
What to do with Other pages
Here’s an image we marked as Other. This is one of the dullest examples — it’s like a place marker in the file. We can probably identify this as a “NEW FILE BEGINS” page using Optical Character Recognition. But stick with me for the example. We mark this as “Other”, and then click “Done”.
If you wanted to describe what this page was (if it was more interesting) you can click on “Transcribe this page now”. This brings up a dialog box that you can type in a free text description of the form. You can move the dialog box around so you can see what’s underneath.
Summary: Classifying what types of pages are in the personnel files is very important to the Measuring the ANZACs workflow. It helps you put the correct variables on the correct pages.
Describing what the “Other” pages are will help other researchers beyond the existing research team use the files in the future by providing links to known pages on known topics. We’re working on ways to make this part of the process go a little smoother and make it more obvious.