Classics, DH, IA

Digital Humanities, Dead Languages, and Real-World Web Business, or, How Do I Get Ovid on Twitter?

Note: this is cross-posted on the University of Texas Linguistics Research Center blogAnd after publishing I had a really interesting Twitter conversation that also pointed me to some new resources.

I spent this weekend in Vancouver, at IA Summit. “IA” is short for information architecture, a discipline whose purpose is organizing web information for humans. That covers everything from making a website’s nav bar useful to creating a user-friendly database out of wacky government information. And the folks who do it aren’t necessarily people who code, by the way. This is its own tech discipline, and if you’re obsessively organized and enjoy thinking about categories, you’ll find your people here.

I was supposed to present a poster about organizing the classical corpus online. Unfortunately, before the conference, I had the kind of week that started with me losing my passport, and got steadily worse. I did get my passport, and made it to the conference, but that was about it; neither myself nor my poster was in any shape to present. Luckily, my buddy Todd invited me to write a post for the LRC, about my poster topic, and with more room to develop my ideas. Of course I said, Yes please!

I have a Classics PhD and taught for a good ten years before I made the switch to software engineering and beyond. I care deeply about Classics, and just as deeply about how we put things on the web. I’m not, however, a Digital Humanities expert. I approach ancient material with the perspective I’ve gained from being in the modern tech biz, which I acquired after making my “conscious uncoupling” from academia. For that reason, and other reasons I’ll discuss, I think it’s worth thinking about how private-sector web development does (and does not) intersect with Digital Humanities (DH).

I do build classically-inspired web things, though. I put the Aristophanes translation I co-authored online, attempting to use appropriate W3C standards. This was difficult because they still haven’t decided what HTML poetry should look like. I used what was, at the time, a cutting-edge, Medium-inspired user interface (UI) for reading footnotes. A couple of years ago, I built a Vergil bot, inspired by the amazing artbots already out there. And I have tried building a few versions of an Ovid database. I like to combine my professional interests that way.

What we talk about when we talk about putting things on the web

Here’s what I’ve come to understand: When we’re talking about ancient materials, there’s a big difference between digitization and publication. Digitization, to me, means “get that stuff online, STAT!” So, we can digitize medieval manuscripts by putting pictures of each page online. Or digitize texts by uploading PDFs somewhere on the web.

That’s not publishing IMHO. While you’ve put the picture, or the PDF online, you haven’t published the actual material that lives inside the document. You could really argue it’s more akin to fancy photocopying (just with mass distribution), rather than to anything inherently digital; as far as the web is concerned, the material inside the photocopy (or PDF, or paper, or papyrus) is unseeable.

To make an artifact speak the native language of the web, we need to extract the text, marginalia, and maybe even illuminations into structured data that computers (or rather, people programming computers) can read. I know it may sound heartless to talk about ancient texts as “data”, but realistically, structured data is what the web is built on. PDF’s are shareable, for sure, but they’re not re-mixable, for lack of a better word. If I’m making a website, for example, and I want to use the material inside the PDF, rather than the PDF itself, I’ll have to take steps to extract and transform the data — unless, there’s a nice source of structured data already available.

The goal of true digital publication, conceived in this way, is structured, accessible data. In terms of media paradigm shifts, you might think of it as the current equivalent of the printing press:

We want Tibullus and Ovid to corrupt the youth! PDF’s probably won’t cut it, though. So, how do we do that? Well, webpages and apps and social media are an interesting place to start — but again, to make that happen, we need a certain kind of information readily available.

Back to the conference. I got the latest info on how Google’s bots read your website. I learned about mathematically improbable taxonomies. And I surprised a few people when I told them that Classics had done pretty well for itself in terms of data structuring, at least in XML. Perseus (a project at Tufts) digitized a great deal of the classical corpus in 1999. How do I know it’s 1999? Because the creation date is in the XML markup, as it should be.

But here’s the thing: in my opinion, Classics DH is still stuck in 1999. It’s great that there’s a web standard – TEI – that is the basis of structuring that data. But it’s not enough, at least coming from a non-DH perspective.

In the first place, web standards are changing rapidly. There’s JSON, which is a new kid in town, but one that a lot of folks like better than XML. There’s also the fact that TEI, as far as I can tell, doesn’t talk to any non-linguistics web schema organizations. In the context of scholarship, this makes sense, but it also biases the entire information structure towards a very, very specific audience.

Perhaps this is why, even as someone familiar with metadata and web development, I often find DH a bit disorienting, like I’ve entered an alternate web development universe. DH projects often use web standards that aren’t quite what I’d expect. It’s like Bizarro world, except less evil.

Different audiences want different things

When building something online, it’s really important to know your user. And there are multiple possible audiences for our humanities projects, with differing expectations. Art historians might want the highest-fidelity photo of a piece. Archaeologists might want the same – and want it in 3D, which is possible, but requires the use of proprietary, weird software that doesn’t run in the browser. Literary scholars want the text, with notes (and again, that might mean the marginalia from the documents, or later, other texts that comment upon the original text). Students might want a dictionary or commentary to help them make their way through the material. Everybody else wants a translation, probably, or something that gives them context enough to understand what’s going on, without wading through purely academic details.

In terms of putting stuff online, DH tends to assume experts. I know that many would argue with that, and I also know that the NEH prefers projects that serve the common good. I believe this is a good faith effort. But one of my points here is that we haven’t really built an architecture that is capable of serving the general public. And even the NEH grant announcements consider data projects those that are “searching, analyzing, and understanding large bodies of material”, rather than structuring the existing data for public use.

One thing I learned at the IA Summit was that often you can’t organize information just once: two different audiences might require two different information systems. Etsy, for instance, had to use two different tagging systems: one for its makers (who are experts in the craft they practice) and another for its buyers (who were really shopping for an idea, without expert knowledge).  An expert might make a beautiful cloche hat and list it for sale without ever using the word “hat”. This made it impossible for would-be hat buyers to find. It’s a familiar problem, when you let experts run things. Lacking beginner’s mind, they create material that’s impossible to navigate for actual beginners.

Easier Parsed Than Published

It seems to me that the current state of DH is still very much focused the concerns of experts, and especially on parsing what’s already there. There’s a really great Classical Language Toolkit, for instance, which is branched from a widely used Python library for parsing language. And that’s great! It may even interest NLP folks outside of Classics – but they’re also what I’d consider a niche audience.

I’m concerned with publishing the materials for a more general audience. And now, after this conference, I’m surer than ever that we need to think of public humanities as a separate project, with a separate information architecture.

I’m not kidding when I say I want an Ovid API. The only existing Classics API I know about is the Aeneid API, which my Vergil bot depends on. Full disclosure: I’m not a fan of Vergil. But I built the bot anyway, because that was the only Latin poetry API I could find.

One of the first things I’d expect from a discipline that wants to share its stuff is an good, solid API. So perhaps that is my main question, as a private-sector tech person: why aren’t there more Classics APIs?

Actually I’m pretty sure I know why, and here’s where we need to talk about taxonomies (just a fancy way to say “how we choose to organize things”). Classics already has a taxonomy, and as a trained Classicist, I know that if I want to read a certain poem by Ovid, I’ll can look it up under “Ovid, Amores, Book 1, poem 5”. But that taxonomy isn’t any help to a layperson who just wants to read some poetry, or to an interested amateur who might want to do something creative with the data, or to a literary type who wants to easily republish one of the poems on their site.

And that’s a relatively simple example. Add in some other potential queries, like “Vergil, Aeneid, Book 3, line 1” or “Plato, Theataetus, section 209d”, and we can see the lack of consistency. Those searches also incorporate terms aren’t unique. Both those things are bad, from a design perspective.

In the ancient world the “unique key” of a poem was its first line; since there were no titles, that’s how people knew what poem you meant. In the 20th century (quite modern by Classics standards) a taxonomy was developed for Linear B; a text’s location is built into the identifier, creating unique ids, so it’s actually not bad from a web perspective. All of this is to say, how we organize information is mutable, and recreatable, and the very foundation of what we’re doing online.

So these are the things I wonder about.  How do we put the Classics corpus online? Is a Classics API possible? Or even an API for all the poems by one author? What would that look like? And: Is the smallest data unit a single line of poetry? What about meter? How do we deal with that? To what degree do we want search to be a thing? And what about translations?

It’s a terrible mess. And/or a really interesting information architecture challenge.

The Goal: Public Facing Architecture

I love JSTOR’s Shakespeare app and its related API. I think it’s everything a public-facing humanities project should be. Granted, there are only 38 plays and they’re already in English. But that’s okay! In tech lingo, a great MVP starts small, as an example for other versions moving forward.

I’d like something similar for Classics and other dead languages. I’d like well-documented APIs and truly public web projects. But that won’t happen with the current state of information architecture (as far as I can tell — happy to hear about what I’m missing!). DH could certainly learn from modern web practices, just as modern web practices might learn from Classics – which was, after all, a very early adopter of technology. Even before the Perseus project there was Pandora. Not the music app, but the HyperCard-based dictionary tool. It was an ingenious hack at the time.

Given that the humanities’ survival is threatened more than ever – and yes, we say that every year, but it’s also true – it’s time to focus on tools for non-experts. In the normal web development process that would mean you’d get your subject matter experts (the Classicists) and your tech experts (the information architects/user researchers) and your audience (regular  people interested in history, maybe?) together to hash out what, exactly, this thing would look like.

But I don’t see the normal web process happening in DH.

Which is precisely my point: it’s time to think about ancient material with an eye towards modern web development. That first means structuring ancient information in the modern format that’s expected in the private sector. Working with the Linguistics Research Center material, I’ve played with extracting the pages via Python, just so we can deliberately restructure them in a sane database – or maybe even an app! Who knows? I’ve done the same with Ovid: I’ve used Python to turn Perseus’ XML into a simplified data structure, and even a database, just to see what will happen. In both cases I’m trying to take what’s there and make it more generally usable data according to the modern web development process that I understand. I consider this my hobby, though I’d like to see it become a more regular part of DH or tech discussions.

I’m heartened by the fact that Sappho is now on Twitter, as is Linear B. I think these are really great goal posts in thinking about Classics as public humanities. But the question is: will Classics (and DH) take up the challenge of architecting its material for further projects like this? I certainly hope so.