« May 2017 | Main | April 2019 »

October 6, 2017

Espy Project Fall Update

It's well past time for an update on the Espy Project. We've done a lot over the past 6 months, working to make the stories of the over 15,000 people executed by the state accessible to the public. There is still plenty to do, but we're off to a great start. Here is what we've completed so far.

We've successfully digitized almost 33,000 sides of index cards and started on the over 115,000 pages of reference material. Working with an outside vendor, we shipped off 21 boxes of index cards and 46 record boxes out for scanning. All of the index cards have been completed and reviewed, as well as 3 of the 46 boxes of reference material. This gives us enough to start moving along, creating metadata, and making everything available to the public.

In the past, once we receive the digital images back from the vendor, we would list and link them in a finding aid, or upload everything into a digital asset management system. In this case, we think these records have the potential to be used in many different ways and we aim to make them available through a modern web application and also expose the raw data through an open API. Expediting this process required some rethinking of how we digitize special collections material.

Some of the index cards documented one individual on only one side, while others used the backside or even multiple cards to show all of the information Espy was able to find. Since digitization required someone to manually handle each card, it made the most sense to distinguish how many card-sides documented each person during the scanning process. Since the scanning was done by an outside vendor, we had to work with them to develop a system where this information could be documented in the filenames for the images. We developed an "a, b, c" numbering system that was simple enough for the vendor, but also sufficiently told us which image files described which people.

All of these records have connections to the Watt Espy papers finding aid, so we used the ArchivesSpace API to extract the unique IDs for the records' parents so we can maintain these links. We wrote preprocessing scripts using open source tools like ImageMagick and Tesseract-OCR to make CSV files for all the data that can be seeded into a database. We then had a CSV for each series of material that lists the filenames for each individual, OCR text, a state abbreviation, and the ArchivesSpace parent ID. Each CSV has now become a database table. These scripts are available in the Espy Project Github repository.

We've partnered with ICPSR to use the Espy File data. The Watt Espy papers were the subject of a University of Alabama study in the 1980s that encoded Espy's records and created a dataset that's now managed by ICPSR. Since there was already detailed metadata, and ICPSR was gracious enough to give us permission to use it, we will be able to provide much more extensive information than we wrote in the original CLIR grant. After consulting with the National Death Penalty Archive Advisory Board, we think there is some possibility of improving the accuracy of the original dataset. We are also hoping the addition of the reference material will provide more meaningful context than lines in a spreadsheet. We are including the ICPSR identifiers in our metadata as well, so it should be easy for researchers to make connections. After the project is complete we will also offer our data back to ICPSR, which will hopefully broaden its use even more.

We've gained institutional skills to support open source web applications. We have some awesome tools here at UAlbany, namely a virtualized data center that can spin up servers for us. However, it's often challenging to get the time and expertise necessary to implement complex tools. These skills can be very costly for academic libraries. We are also a Windows campus which means our authentication and security network uses Windows tools and our Library Systems staff is primarily Windows-trained.

All this poses some large barriers to managing open source projects, which require both skills and labor time. Still, we feel that we get tremendous value from open source communities. They share a lot of the same priorities and values we have as libraries and archives. The transparency of open source helps us maintain provenance, and contributing to a community aligns well with our research and service obligations. The flexibility of open source tools is also important, since the information architecture of archives can be quite complex and there's no single service that can do everything for us. Most importantly, we feel that the capability to fully understand and actively engage with our tools empowers us, and gives us the agency to make technology work for our human needs, rather than passively adjusting our work to more generic systems. We feel that increased capacity to implement and maintain open source tools here at the University Libraries is one of the key benefits to this project.

We've developed a web application for rapidly creating detailed metadata records. We quickly discovered that it was going to be a big challenge to connect up to six different information sources into complete and useful metadata records in a way that didn't rely on large amounts of human labor doing repetitive tasks. To meet our transparency goals, we also needed to document where all the information in our final records came from. The record for William Kemmler, for example, will have front and back images for a small (3x5) index card, a larger (4x6) index card, a large stack of reference material, data from the Espy File, and relationships with both Series 1 and Series 2 in the Watt Espy papers finding aid.

Normally we wouldn't create a custom web application for every digitization project, but creating the Espy Metadata Tool allowed us to ramp up our skills with Ruby on Rails and web applications in general. Now we almost understand how Samvera and Hyrax works, and we have the skills necessary to implement a Hyrax-based repository for our special collections storage, and adapt it to our local needs and workflows. The Espy Metadata Tool has some features that saves a lot of time, most notably a Redis-based autocomplete that lets us quickly link a random newspaper article with a record by the individual's name or date of execution.

It was also really easy to make things like keyboard shortcuts. Here we really got to see the value of using common open source tools. For essentially everything we wanted there was already an existing library or a great example with some awesome documentation. Although these features were really cool and fun to work on, the real benefit is that it save us a lot of repetitive tasks. Now the computer does all of the tedious work, saving our labor for the still really challenging and intellectual processes that are requires to make metadata records.

We learned a lot about linked data and data modeling in general. Since the Samvera system we plan to implement alongside this project uses native linked data triples, we looked at how we could expose all the data created over the course of the project using linked data standards. The benefit here is that the data itself would be self-describing so that computers could understand the data's context and relationships.

In practice, this posed some difficult challenges. Our first attempt to use the Portland Common Data Model (PCDM) to model all of this data resulted in a model that was much more complex than useful. Once we started working on the metadata application and putting the model into practice it became very obvious how much of it was unnecessary and how a much simpler model was easier to work with in general. The model we're currently using with the metadata application might not also be our final version. While it contains all of the information and makes sense for linking all that information together, it is very unintuitive as a final record. We have to export the data to use to manipulate the master image files at the end for things like image rotations, so we decided to rework the model a bit later in the process once we have more experience working with the data. Since we're envisioning that researchers will work with the data directly, making the model clean and intuitive is a usability concern.

We spent lots of time talking about metadata fields, what is useful to document and how we would present them. There are many cases when the data in the Espy File implicitly demonstrated the priorities, values, and mental framing of the original researchers. Some of the decisions we made are both complex and imperfect, and we will make another update soon that talks about the metadata we're capturing and not capturing and our process of making these decisions.

We've decided not to use linked data vocabularies to encode descriptive metadata. When we started to look at linked data vocabularies for our descriptive metadata, it quickly became clear how problematic they are to describe this type of detailed metadata that requires really precise meaning. Overall, most of the vocabularies outside of those most commonly-used are poorly documented in general, and vocabularies are often fudged or used imperfectly. This failed to provide the nuance we require for this project. A practical example is that crime was encoded as "Crime Committed" in the Espy File, yet we feel that it's much more appropriate to call that field "Crime Convicted of." Yet, there does not appear to be any existing vocabulary that effectively describes the legal terms associated with capital punishment. We concluded that to effectively encode this the descriptive metadata as linked data would require us to create, document, promote, and maintain our own custom vocabulary.

Now there is a case to be made that this is exactly what we should have done. We have a strong institutional commitment to the National Death Penalty Archive here at UAlbany, and some great access to experts on our advisory board and in the School of Criminal Justice. If there is a real need for a linked data vocabulary to describe the legal terms associated with capital punishment, we are strongly situated to create one.

We actually found that developing a linked data vocabulary posed some interesting conflicts with our mission as archivists. Rather than just digitizing, describing, and providing access to information, we would be attempting to create an empirical set of information from our more limited archives. Not only would this be a much higher standard to meet, it would be without precedent for us. We would be going beyond our mission as an archives and on to something new. In itself, this wouldn't necessarily be a bad thing, but it was a commitment we really were not prepared to make.

When we stepped back to look at the big picture, it was clear that for all this effort, encoding the descriptive metadata as linked data really didn't add that much practical value. While we anticipate computational use for this collection, that need can be filled just by exposing the data and describing it well. Researchers typically want to understand the provenance themselves, and are unlikely to merely trust and rely upon linked data vocabularies for this context. Another practical concern is that researchers are likely to interact with this data using methods that they are already comfortable with. When we consider the common ways researchers work with data right now, we see that almost all of these methods lack the ability to leverage the added value of linked data.

This isn't to say that linked data has no value, just that we found that there is a significant cost if you're doing anything more than just using Dublin Core or MODS. The more complex and precise the data is, the higher that cost is. In our case the value added seems to be more theoretical, task-specific, or far off in the future. Many of the use cases for linked data can be also be met with a little effort just by open and well documented data over an API.

We complained about Fedora. Relenting on using vocabularies for descriptive metadata provides some added technical challenges. In asking around we've heard some opinions on Fedora 4's move to native linked data and the relationship between Fedora and Samvera. Many repositories are struggling with the move, and a few have even reluctantly stuck with Fedora 3. Others seems to think that part of the problem is how Samerva uses Fedora. While replacing the database of a web application with Fedora makes integrating it very easy, Fedora isn't necessarily designed to function in this way. Another issue is that swapping a database with Fedora in itself isn't a preservation plan. Fedora still has some awesome preservation features built-in, but it also doesn't quite merge well with how our university currently manages permanent data, for better or worse.

None of these challenges are insurmountable. Generally, it's always a much better idea to stick with what everyone else is doing rather than branching off alone, particularly with our limited resources and expertise. There is also real value to the Samvera/Fedora stack, even if there are some imperfections. Outside of our descriptive metadata example, it makes managing other metadata in linked data really easy. For example, we can merely deposit content and Fedora manages linked data PREMIS events automatically. We plan on storing our descriptive metadata in Fedora as content, as well as in a regular database using ActiveRecord.

We are now linking together all of complexity of the information we have into detailed, transparent, and effective metadata records. Meanwhile we will continue to test and experiment with Hyrax and work with Library Systems to configure our production servers. We have learned a lot in the past 6 months, and we hope we're on the right track with our thinking. We always welcome feedback, so if you have any useful thoughts, please let us know.