Archive | October 2015

Measuring the ANZACs Tutorial 1: Classifying pages

Welcome to the tutorials for Measuring the ANZACs. Thanks to the feedback on Talk we’ve identified the issues citizen scientists need to know as we work on this project. Over the next few weeks we’ll be publishing a series of detailed tutorials on the steps involved in working on Measuring the ANZACs. Each tutorial will explain the scientific and research rationale for what we’re doing, and why we’ve set up the workflow in a particular way, and then outline how things work with examples, screen shots, and step-by-step instructions. Please leave comments, and please join us on Talk to ask your questions.

The Measuring the ANZACs data and workflow: Measuring the ANZACs is transforming 4 million pages of soldiers’ personnel files into a structured database. One way to approach this task would be to transcribe the words on every page into a free text database, like a big Word document or blog post.

However a quick look at the pages in the files shows many of the documents have a structure. Take this somewhat random example. It’s a hospital admission form. It even has a standard form number (E.F. 60) suggesting that many of the reports on hospital admissions will collect the same information. The pieces of information that will be collected in this context are pre-printed on the form in bold text. The different responses to these questions are handwritten (and sometimes typewritten). For those of you studied statistics we could refer to the printed text as variables and the handwritten responses as their values. In fact, among the research team that’s often how we do distinguish between the general concepts and the specific answers.

Hospital admission form (E.F. 60)

You will also notice that there are lots of different kinds of pages with different questions on them. So the pages themselves are variable. In order to transcribe the information on any page in a structured format we first have to know what kind of page we’re looking at.

When you—as citizen scientists—are looking at a page you can tell what kind of page it is. Often it says exactly what the page is”HISTORY SHEET,” for example. But not always! You’ll see that too, or you’ve seen it, if you’ve been working with us for a while. Early on we had hoped that we would be able to use Optical Character Recognition to identify the types of forms you are looking at. But there are enough pages that don’t have legible titles, or don’t have titles at all, that we soon realized this wouldn’t work at a level of acceptable accuracy. To put it plainly, unless the OCR process was recognizing 90% or more of the pages accurately we’d still be needing human review.

These considerations lead us to order of events in the Measuring the ANZACs workflow

  1. Classify a page as a particular type of document
  2. If a page is classified as a document that we are transcribing, mark the fields to be transcribed.
  3. Transcribe the marked fields.

And that’s all there is to it. Rinse and repeat for 140,000 files and 4 million pages!

The classification of page types will help us set up the database in such a way that other researchers will be able to use it for their research on other topics. If you are researching hospital treatments, for example, it would be great if we had identified all forms relating to hospital admissions, so you could go straight to those pages. That’s what we’re hoping to do. Here’s how we’ll do that, with your help!

Classifying pages

Classifying pages is found under the Mark workflow, so start by selecting either link to Mark.

start marking

We eventually want to mark variables on the History Sheet, the Statement of Services (which is really page 2 of the History Sheet), the Attestations, and any form of Death Notifications. These are the options for classifying a page, plus “Other”Mark this as a History Sheet

In this example you can see it’s a HISTORY-SHEET”, so you click on History Sheet on the right hand side of the page. The color of the button will change to indicate you’ve selected it. Then you click on NEXT.

Selection of page as History Sheet

If you’ve classified a page as one of the designated types, the next part of the Mark workflow will ask you to mark the variables or fields on the page. That is, identify the things in printed text that have handwritten answers we’re interested in. We’ll cover that involved process in a future tutorial. For now, let’s consider what about the “Other” pages.

What to do with Other pages

An "Other" page

Here’s an image we marked as Other. This is one of the dullest examples — it’s like a place marker in the file. We can probably identify this as a “NEW FILE BEGINS” page using Optical Character Recognition. But stick with me for the example. We mark this as “Other”, and then click “Done”.

transcribe other page

If you wanted to describe what this page was (if it was more interesting) you can click on “Transcribe this page now”. This brings up a dialog box that you can type in a free text description of the form. You can move the dialog box around so you can see what’s underneath.

Summary: Classifying what types of pages are in the personnel files is very important to the Measuring the ANZACs workflow. It helps you put the correct variables on the correct pages.

Describing what the “Other” pages are will help other researchers beyond the existing research team use the files in the future by providing links to known pages on known topics. We’re working on ways to make this part of the process go a little smoother and make it more obvious.


Why are there sticky notes on the history sheets?

The material we are using in Measuring the ANZACs is relatively complex. Across 140,000 different files there are around 4 million page images. Thus each file is nearly 30 pages long. The 140,000 distinct personnel files represent a smaller number of distinct individual people. Some people served in both South Africa and World War I, and ended up with distinct files. A more common reason for the existence of multiple files about the same person is that people tried to enlist multiple times. Particularly in the early years of the war the research team has observed people trying to enlist years before they were of age to enlist. These men, teenagers when they first tried to enlist, show up in different attestation files with the same day and month of birth and a different year than in the subsequent (successful) enlistment. The New Zealand Army tried to merge files that were clearly the same person, but did not manage to do it all. One of the outcomes of Measuring the ANZACs when we have completed marking and transcription for all the files, is that we’ll be able to make some educated guesses about people who are really the same person.

So we have a lot of pages and a lot of different people, some of whom might not be different. The complexity that requires citizen science is apparent again looking at the pages. The different files have a widely varying number of pages in them, from just a couple to over a hundred. Someone who enlisted late in the war and saw no active service will tend to have a short file. Someone with longer and more eventful service will have hundreds of pages.

Thus the first task for citizen scientists is to identify what’s in the files. What kind of documents do we have? This is the scientific rationale for the marking process, helping us work out what material we have. We are identifying a small number of the basic documents that most files should have—an attestation form from enlistment, a service history form, and notifications of death. You will notice that on the second page of the history sheets where the “Statement of Services” is recorded the second half of the page is a record of Conduct. Or really, misconduct. At the moment we are just asking for your involvement in marking the Conduct incidents, but not transcribing them. We hope that we can use this identification of the conduct incidents to sample these pages later on in the project and enable research on misconduct during the war.

Misconduct was not uncommon among a hundred thousand men serving overseas. Our introductory post, for example, mentioned the Dibble brothers and their service that included medals and conduct citations for drunkenness. But misconduct was uncommon enough that it doesn’t appear in every file. We just won’t know until we’ve worked our way through the files. The National Library of New Zealand has scanned the reports of the Defence Forces, and they report more than 1,000 prosecutions in the Territorial Forces each year during the war out of 30,000 territorials. We aren’t sure about misconduct in the Expeditionary Force when men served overseas. In any case this suggests misconduct may be recorded in around 5% of the files. This is an example of the complexity of the material that we have to work with, but also the richness and potential for telling the life stories of these men.

The richness and density of information in the files also gives rise to the “sticky notes” you will see on many history sheets.

An example of a sticky noted History sheet.

An example of a sticky noted History sheet.

You will notice that these are not sticky notes that we would see today: commercially made and sold as sticky notes. These are pieces of paper stuck to the paper with paste. They serve the same function as sticky notes today, both physically and intellectually. Sticky notes today are often used to write summaries or notes of the material on the underlying page—that is their intellectual purpose. Their physical form is meant to keep them with the underlying page, but not permanently obscure it.

You will nearly always see a sticky note on the front page of a History sheet. Let me rephrase that. Having looked at thousands of files I’ve never seen one anywhere else. It’s theoretically possible they exist on other documents. Please tell us in the discussion forums if you find another type of page with a sticky note on it!

Look closely at the content of the sticky notes and the content of the History sheets, and you can begin to see the history of these files themselves as working documents for organizing the administration of soldiers in service and as returned soldiers (what Americans would call “veterans” New Zealanders and Australians call “returned soldiers.” The organizations of these men are the RSA in New Zealand and the RSL in Australia). Looking closer at the sticky notes you can see that they summarize the length and places of service. They clearly had a role in the Defence Forces’ processing of medals and pensions. They allowed clerks to quickly see how long a man had served, whether a file had been processed and what actions had been taken.

Detail from a sticky note

Like modern day sticky notes these sticky notes contained a summary of information or instructions to action based on the material underneath. It was important not to obscure the information underneath, so only the top of the paper was stuck down. Sticking the paper down with paste also meant that the files were not damaged with pins or staples. These files are now nearly 100 years old in parts, and the paper has not been visibly degraded by the sticky notes. As archivists and historians around the world will tell you pins, staples and rubber bands can do tremendous damage to paper files. In even a slightly damp environment pins, paper clips and staples will start to rust within a decade and damage the paper. So we should be grateful that these files have sticky notes.

The sticky notes presented a technical challenge for both Archives New Zealand (an important partner in Measuring the ANZACs) and the Zooniverse team designing the website. For Archives New Zealand the question was how to scan the material without damaging it, and also present all the information to readers. In the reading room you can lift up the sticky note. There is, as yet, no way to do that in a scanned image! Archives’ solution was to scan the pages with sticky notes multiple times lifting the notes up to reveal the information underneath. Thus, for a History sheet with two sticky notes it may take 3-4 scans to show the whole page including material underneath.

Screen Shot 2015-10-01 at 12.14.49 PM

Use the Lightbox feature in the Mark interface to see the next pages. When you see a History sheet with a sticky note, you will see the same page repeat several times to capture all the information. Thus in the Mark workflow you may need to mark several different images to capture the information off what is, in physical form, just one piece of paper. The key to this process is our question to you: Are there any sticky notes. This helps us work out what to do with the page when processing the data.Screen Shot 2015-10-01 at 12.17.52 PM

Seemingly banal, the sticky notes on the History sheets shed light into the working lives of these files as government documents about peoples lives, and also show some of the issues archivists and the research team have had to think about in bringing Measuring the ANZACs to life. As always, if you have questions, bring them to the discussion forums.

Evan Roberts

Why are we Measuring the ANZACs?

Welcome to Measuring the ANZACs. In our introductory post we mentioned that “The international research team began collecting information about New Zealand soldiers to study changes in the height and weight of New Zealanders over time. Height and weight are some of the best information we have to explore the health and material well being of people in the past.”

The science behind using height and weight to study well being is called anthropometrics, and when applied to the past is called “historical anthropometrics” or “anthropometric history”. Scholars who study height and weight in the past are a mix of historians, economists, archaeologists, and anthropologists.Historians and economists are more likely to use written records that describe height and weight for many people, while archaeologists and anthropologists are more likely to use skeletal records. The research team behind Measuring the ANZACs reflects this mix: Kris Inwood and Les Oxley have a background in economics, and Evan Roberts trained as an historian.

The science of anthropometrics is also biological and physiological (the science of human growth and body dimensions is called auxology). Historical anthropometrics relies on the fact that only 80% of the variation in human height is genetic. That is, most of the variation in whether you’re tall or short is because your parents and grandparents were tall or short. Or somewhere in between, as most of us are. The remaining 20% of variation is environmental, and we understand the word environmental to encompass all aspects of the environment in which people grow up: social and economic as well as natural (e.g. climate).

A fascination with how humans grow, and how that varies across different groups is long-standing. For those of you who want to read more a book by James Tanner called A history of the study of human growth is a fascinating read. But the interest in using large samples of records about stature in the past is a fairly recent one, dating to the 1970s. Scholars were interested in two, seemingly disparate, questions: how well off were slaves during American slavery, and what were the living conditions of people in Europe during the Industrial Revolution? Attention to these questions has continued, with much debate about whether the stature of Europeans declined during the Industrial Revolution and whether the heights of American men born in the mid-nineteenth century also declined. These findings are apparent paradoxes, since in other respects, people appeared to be getting better off in the nineteenth century: average incomes were rising.

Anthropometrics recognizes that growth is the culmination of competing forces. When young, growing people are getting enough to eat and meet their basic needs they have energy left over to grow. The phrase “net nutrition” captures the idea that people grow net of other needs for the calories they are taking in. Three factors can affect whether people have enough energy left over to grow

  1. Food consumption
  2. Physical exertion
  3. Disease load

All other things being the same if people eat less (more) food, or do more (less) physical activity, or have more (less) disease they will have less (more) energy for growing. But things are rarely the same, and seeing that people’s heights have changed doesn’t tell us why their heigh changed. Anthropometric history is a powerful scholarly tool for uncovering when in the past there were more or less challenging times to be a growing person. We have to supplement it with other evidence to find out why average stature rose or fell. And it is average stature that matters. Because height varies genetically your height, or my height for that matter, tells us little about whether we were underfed or well fed when we were growing. We need to see how things changed for lots of people to identify the 20% of the variation that is environmental. This, in short, is the science behind why we want to Measure the ANZACs. It can tell us a lot about historical change in New Zealanders well-being and health, and set that in international context.

Our research has already found that New Zealanders were tall by international standards, with adult men standing about 68 inches (173 cm) on average at the end of the nineteenth century. This is important because it confirms that New Zealanders were well off on many dimensions. We also know from our existing research that the indigenous Māori and Pākehā (European settlers and their descendents) stature was nearly the same through the late nineteenth century, and only began to diverge in the twentieth century. This is important because it tells us more about how European settlement affected the Māori population. While Māori population declined the health and well-being of living Māori, at least as reflected in stature was not affected until the twentieth century. In the early twentieth century Māori stature fell behind both absolutely (average stature fell slightly) and relatively (Pākehā stature rose, on average). But in the second half of the twentieth century Māori caught up again with Pākehā. Stature isn’t everything, and other things (income, life expectancy) tell us Māori were not as well off as Pakeha. But stature is something — it tells us that children were in some important ways equally well fed.

Research like ours starts with one question, answers it, and opens others. Our research has answered some basic questions about the level and changes of height and weight in New Zealand. But it has thrown up many questions about how men’s health changed over time. How did health in early life affect how long people lived, and what they did later in life? These questions overlap with our questions about Māori and Pākehā differences where generational changes suggest the importance of long-term influences on health. Answering these questions is why we want to Measure the ANZACs. The stories of these soldiers will help us answer these questions, and they will open up more questions for us and others. We look forward to you joining us in this research.

Evan Roberts

Further reading

Richard Steckel, “The formative period of the new anthropometric history