Five tips for digitizing handwritten data
For more than a decade, Christie Bahlai has been part of a long-running survey of ladybirds. Each summer, she and other scientists send students from their laboratories to a site on Gull Lake, Michigan, about 220 kilometres west of Detroit, to monitor 14 beetle species. Their goal: to track how invasive species are affecting native populations.The ladybird project has been running since 1989, and for 20 years had been directed by Douglas Landis, an entomologist at Michigan State University in East Lansing and Bahlai’s former postdoctoral adviser. But last year, Landis decided to retire. In December, he reached out to Bahlai, a computational ecologist at Kent State University in Ohio, to ask whether she wanted the many boxes of handwritten data sheets stored in his laboratory.
How to digitize your lab notebooks
Bahlai already had digital scans of the documents. Still, she wanted the originals. “What if there’s a note on the data that didn’t come through on the scans, or some other crucial context?” Bahlai explains. “And so, I received my inheritance over Christmas vacation.”Generally, “the easiest and most productive” way to compile data is to type them directly into a spreadsheet, says Miguel Acevedo, an ecologist at the University of Florida in Gainesville. But using a computer, tablet or even a smartphone in the field isn’t always practical. The Puerto Rican rainforest, where Acevedo studies malaria infection in lizards, is prone to unexpected downpours, so his team logs its data in pencil-drawn tables in waterproof notebooks. Bahlai and her colleagues record species counts by hand while working in dusty cornfields, their fingers often covered in sticky insect-trap goo.To be useful, handwritten data must be digitized into a form that can be analysed. But because this is one step removed from data collection, the process is rife with potential for error. Whether carrying over data manually or using software tools such as optical character recognition (OCR), researchers need to think about how to keep damage to a minimum. Here are five tips to do just that.Make a digitization integrity planOne of the first things Acevedo tells his students is that correcting mistakes in data becomes an order of magnitude more difficult with every step towards publication. He calls this the 1/10/100 rule: incorrectly writing down a lizard’s length in millimetres instead of centimetres is easily fixed — a metaphorical $1 mistake. But the cost jumps to $10 if the data point slips through the digitization process, and to $100 after it’s been analysed. Standardized protocols and workflows help to prevent such errors and minimize the cost, he says.In Bahlai’s lab, a “meticulous and reliable” student volunteer transfers the data from the original paper sheets to Google Docs. They annotate anything that they’re unsure of — a smudged number, for example — and tag Bahlai, who will take a closer look. After a second student double-checks the data, Bahlai transfers them into a spreadsheet, for more in-depth quality checks.Acevedo’s set-up is different: students work in pairs, with one reading out the data and the other typing them in. He also insists on including in each notebook a metadata page that contains acronym definitions, units of measurement and other elements. “If somebody is looking at notebooks 20 years from now,” he says, “they’ll know exactly what they’re looking at.”Here’s another must-do, says Joel Correia, a human–environment geographer at Colorado State University in Fort Collins: invest the time and resources upfront to train the people doing the fieldwork. Correia studies the social and ecological effects of long-term land-stewardship practices in three Indigenous nations in the Ecuadorian Amazon. His team teaches members of those communities social-science research methods, such as designing and conducting interviews and surveys in their local language. In such multilingual, multicultural contexts, he says, having shared clarity around the concepts underlying the research is crucial, especially when taking written field notes that will be translated and digitized.Back up your paper, ASAPOnce you’re back from the field, do some rough digitization as soon as you can, Correia advises. Scanning your notebooks to PDFs, or even photocopying them, will safeguard you from the pain of seeing stacks of interviews destroyed by rain, fire or other unforeseen events. “I have not had that experience, but I have certainly heard of other people who have,” he says.Use several pairs of eyesOne common way of reducing errors is by having a number of people input the same data, and then correcting inconsistencies between the versions. How many pairs of eyeballs do you need? Make it an experiment, suggests Acevedo: test your error rate with different numbers of double-checkers, and find the point of diminishing return.
For her research, climatologist Linden Ashcroft often has to digitize historical sources, such as this page of climate data from 1837.Image courtesy of the National Archives of Australia. NAA: PP430/1, VOLUME 4.
But don’t overlook the human element, says Linden Ashcroft, a climatologist at the University of Melbourne, Australia. Ashcroft has run community-science efforts to digitize hand-written records in farmers’ diaries and other historical sources, some going back 200 years. She says that the World Meteorological Organization recommends that such data be double- or triple-keyed — that is, input independently by two or three people. But she knows of projects involving as many as eight individuals. “Is that really a good use of people’s time?”A good rule of thumb, Ashcroft suggests, is to calibrate your error rate to your project goals. “If you’re doing a deep dive into the weather and climate of a particular place, you want to lovingly correct the data,” she says. But if your data will be just a few of a million entries in an international database that researchers use to predict weather patterns, a slightly higher error rate probably won’t affect the outcome.Home in on outliersIf you are working with numbers, you can program your software to flag outliers, improperly formatted values and seemingly illogical data. For instance, Acevedo recalls discovering that lizards measured in one year were an order of magnitude smaller than usual. “I was there, so I knew that the lizards were not particularly small.” After he examined the notebooks from that period, he saw in the metadata that the numbers were recorded in millimetres rather than centimetres, and corrected the data.But that approach doesn’t always work, Ashcroft cautions. Outliers in her data can reflect unusually heavy rain or aberrations in atmospheric conditions, such as temperature or air pressure — real variation that is simply unexpected. “You don’t want the statistical test to kick those [values] out, because extremes are how we’re going to be affected climate change,” she says.Try OCR (and other software)OCR software can be used to convert scanned images into machine-encoded text. Many such tools are available — and most of them can successfully capture data sets in which handwritten text and numbers are written clearly and do not bleed out of their designated columns.But off-the-shelf software often falls short when applied to historical handwriting, in which s’s might look like f’s, for example, says Stuart Middleton, a computer scientist at the University of Southampton, UK. It also performs poorly in the face of image noise, such as creases or shadows in a scanned image or text that spills over from one column of a table into another. In the documents he works with, he says, “there are all sorts of horrors going on.”
NatureTech hub
In such cases, researchers with a bit of computer-science savvy can try using different OCR models. Those models — available on the open-source machine-learning platform Hugging Face, for instance — are generally pre-trained on a wide array of images, but feeding them training images that are similar to your data could improve their performance, says Middleton. Scientists with advanced skills in coding and artificial intelligence can also modify the networks to better fit their projects. Middleton’s team is developing more-advanced, multistep OCR solutions for working with historical weather data, including new ways of training, as well as image post-processing.Options exist for digitizing other data types, too. Eliza Grames, an integrative biologist at Binghamton University in New York, uses historical data — largely graphs and charts from studies in the late nineteenth and early twentieth centuries — to map long-term insect-population trends. A program called metaDigitise (and a similar, browser-based program, WebPlotDigitizer) allows her to redraw the plots and calculate the underlying data. She also uses Inkscape, an open-source alternative to Adobe Illustrator, to digitize old species range data into a format that is readable by geographic-information-system mapping software.OCR still requires extensive expert oversight to clean up irregularities and check for errors, Ashcroft warns. She prefers to harness the efforts of volunteers around the world. “To me, the historical weather observations are a really valuable opportunity to engage people with climate science in a fun and easy way,” she says. “People get to be a part of the story.”And OCR is not always worth the trouble. For smaller projects, scanning every page and checking for errors after processing it through software might not yet be more efficient than having students do the job by hand, Acevedo says. But at the rate software is advancing, he says, that could soon change. “Maybe if we have this talk in 2025.” More