To help solve the problem of getting its vast inventory of analog holdings online, the Library of Congress has been turning to crowdsourcing enlisting volunteers to read and transcribe documents. The latest completed project makes available thousands of letters written to Abraham Lincoln while he was president. For details, Federal News Network’s Jared Serbu spoke with the Library’s digital collections specialist, Carlyn Osborn on the Federal Drive with Tom Temin.
Insight by ServiceNow: IT practitioners provide insight into how they are creating a digital fabric by optimizing cloud and citizen services in this exclusive executive briefing.
Jared Serbu: Carlyn, I think that before we talk at all about the collection itself, take us through just a little bit, procedurally, how does this crowdsource transcription process work?
Carlyn Osborn: Absolutely. By the People is a volunteer engagement program where we invite the public to transcribe documents on the Library of Congress’s website. Since launching in 2018, we’ve done 18 public releases, which we call campaigns, where we upload materials onto our website, which is crowd.loc.gov, and those represent a wide variety of materials from the library, including the Rosa Parks Papers, letters to Teddy Roosevelt, Spanish legal documents from the 16th and 17th century, and letters to Lincoln, which just finished last month. So technically what happens is we bring material into our website, crowd.loc.gov, and then when all that material has been transcribed, we take those transcriptions and pull them back to the library’s main website. And when they’re back on the library’s main website, they become keyword-searchable and available for reading by screen readers. So it makes our collections both more discoverable and accessible.
Jared Serbu: And the Lincoln question that you just wrapped up that you mentioned, or I’m sorry, the leak Lincoln collection that you just wrapped up that you mentioned, tell us a little bit about that collection itself. As I understand it, a fair amount of President Lincoln’s papers had already been imaged and transcribed and then there was another subset that was imaged but not yet transcribed. Is that about right?
Carlyn Osborn: Precisely. So the Lincoln collection that we have at the Library of Congress is around 40,000 documents or items–we kind of use those words interchangeably. And there was a previous transcription project with Knox College in Illinois, where they are focusing on making transcriptions for letters that Lincoln had written himself. So our program filled in the gap by focusing on around 28,000 other pages that hadn’t been transcribed previously. So these 28,000 pages or assets, we call them, too, were transcribed by volunteers for the Library of Congress. And how our program works is we have these pages. One transcriber will go through and do the initial transcription for the page. And then we have a second transcriber go through it and they review it. So they either mark that page as complete–like it looks good, it’s ready to go–or they can reopen it to make more edits. And so that way, we have two pairs of eyes, at least, on every single document that gets transcribed. So we were just filling in the gap for what Knox College decided to not focus on.
Jared Serbu: I’ve got to say, looking at some of these images, they are incredibly hard to decipher to my eyes. So I’m certainly not going to criticize anyone’s accuracy. But how much did these volunteers’ transcription accuracy vary from person to person and how important is, you know, letter-perfect accuracy?
Carlyn Osborn: Yeah, it’s a great question. So, the goal of our program is to make these pages more discoverable and accessible for everyone. So we’re not aiming for perfection. We’re looking for stuff that’s good enough. And our program focuses on collections of the library that aren’t susceptible to really good machine reading, like you might have come across the term OCR or optical character recognition. So we pick items and documents that aren’t good for that. So that could include handwritten documents. Some of these documents, you know, they were originally on microfilm and so the scans aren’t really great. So the typeface isn’t read very clearly, some of the documents have bleedthrough so you can actually see, like writing from the other page. So that’s what we focus on. So anything that the transcribers or volunteers for a program, anything they can do to make these pages more discoverable is good enough for us. And in general, studies do suggest that human transcription is more reliable than OCR in the first place.
Jared Serbu: So now that this large Lincoln collection is discoverable, as you’ve looked through it–any personal favorites? Anything that’s particularly interesting to you that we might not have seen before?
Carlyn Osborn: Yes, one of my personal favorites, is a letter sent in the 1830s to Abraham Lincoln. But it was sent to him in his capacity when he was serving as postmaster in New Salem, Illinois, and this letter was sent from a Mr. March in Portsmouth, New Hampshire. And it accounts of a tornado hitting down in New Hampshire. And I actually have the line right here if that would be interesting to read out for everybody.
Jared Serbu: Sure.
Carlyn Osborn: OK. So, here it goes. “On the night of 17th August, a tornado passed over this place, laid the fences flat, rooted up the trees, blew down corn and done other damage. The next morn, by daylight, as I was putting up my fence, two great wolves walked along unconcerned within 30 yards with me. I tried to scare them by taking off my hat and running towards them, but they would not click in their gait. They are the only ones I have seen.” So now that we have this transcription, and this transcription is already back on the library’s main website–and what this means is if you type the word “tornado” into the Library of Congress’s catalog or main search function @loc.gov, you’re going to get this transcription that was made by one of our volunteers.
Jared Serbu: How do you determine which sorts of collections are good candidates? I mean, you already mentioned things that are not susceptible to OCR. But in terms of types of content, how do you decide what to prioritize for this crowdsource project?
Carlyn Osborn: So we work very closely with curators in different parts of the Library of Congress. And actually, we rely on them to come up with proposals for content and collections. So we work with curators from the manuscript division, and they’re the ones that come up with proposals for materials from their collections that they think would be good candidates for our website. And it’s important to note that candidates for our website are materials that are already up on the library’s website. So I mentioned the Rosa Parks Papers, letters to Teddy Roosevelt. Those collections are already available at loc.gov. So curators look at what’s already available online, and then they tell us, “Actually, this would be a great collection for your website.”
Jared Serbu: And I imagine also for the process to really work, it needs to be the sort of collection that’s going to generate enough public interest that people are going to care enough to generate large numbers of people to actually do this work.
Carlyn Osborn: Absolutely. And we also like to think of this as an opportunity to focus on some of the smaller collections that are also available on the website. So it’s a good way to highlight some of our smaller collections, but then also get transcriptions for some of our heavy hitters like Teddy Roosevelt, Abraham Lincoln, and like Rosa Parks.
Jared Serbu: So you have done you have done previous collections via crowdsourcing. I’m sure Lincoln is not the end. What else is in the pipeline?
Carlyn Osborn: So just this past week, we published 42,000 images of documents from the records of the national American Women’s Suffrage Association. And this association was formed in 1890, and contributed to make suffrage victories out west at the end of the century. So this launch was timed to coincide with the centennial celebrations for the 19th Amendment. And already, or just this past about week and a half, we’ve had 300 people begin transcribing over 5,000 pages. And we’re so excited to see the enthusiasm and uptake on this one collection. And we’re also going to be involved with the National Book Festival this year, which is going to be all virtual. And the National Book Festival runs from Friday, September 25, through Sunday, September 27. And as a part of the National Book Festival, we’re going to be launching additional materials into two of our existing campaigns. So we’re going to be adding documents to Walt Whitman. And we’re also going to be adding additional documents into the historical Spanish legal documents campaign. So that’s coming up just this month. And then in October, we’re also going to be releasing our first campaign from our Rare Book division and we’re really excited about this one. It’s launching during the week of Halloween. And it’s going to include 4600 pages from the Houdini collection about crystal and mirror gazing. So we have some big things coming up this year and we’re really excited to get these into the hands of everybody.
Tom Temin: Carlyn Osborne, a digital collection specialist at the Library of Congress speaking with Federal News Network’s Jared Serbu.