Skip to content

The short-and-sweet TEI handout

by elotroalex on August 12th, 2010

For the past few weeks, I’ve been working with some colleagues to develop a model for incorporating the full digitization of a scholarly work into a literature course. The process goes from scanning to OCR to tagging. The goal is for students to learn some of the basics of digital humanities while producing image-text PDFs and TEI-lite versions of works that might be publishable online in the future. Below is a sample of the student handout. This version of the handout is meant for the tagging of a 200-300 page scholarly work to be done by a team of 4-5 students. You can download and use a full version of the handout here. In the next few weeks, I will be working on developing a similar model meant for primary texts, so stay tuned.

 

XML and books

XML stands for EXtensible Markup Language and as the word markup implies, it is a tool used to describe data. HTML, which you may be more familiar with, shares many similarities with XML, most importantly the use of tags < >. The “data” in HTML consists of instructions for browsers; in XML, on the other hand, there is no predefined use or vocabulary for the tags (hence the “EXtensible” in XML). For example, if you own a pink Chihuahua named Pepe, you could “express” it in XML this way:

<dog type=”mine” name=”pepe”>

<color>pink</color>

<breed>Chihuahua</breed>

</dog>


TIP: For a longer introduction to XML, visit w3Schools.

In a sense, all texts can be said to contain data of a certain kind. Literature and criticism are no exception to this rule. XML helps you name and organize that data. With XML, the possibilities are legion: we could, for example, name the kinds of content we find (<metaphor>, <character>, etc.), describe the physical attributes of a book (<paper>, <ink>, etc.), how a text is laid out (<column>, <page_break>, etc.) or the logical units of a text (<line>, <paragraph>, etc).

Because there are so many possibilities, scholars and scientists all over the world have agreed to use standards in their fields. In digital humanities, the most important standard set of predetermined tags, or tag-set, is the one provided by the Text Encoding Initiative (TEI). In this class we will be using an even smaller subset of that standard called TEI-lite to introduce you to the practice of tagging.

TIP: To deepen your knowledge of TEI and TEI-lite, you can explore TEI by example or read the TEI-lite documentation.

A basic TEI file includes a text (the linguistic content) and meta-data (information about the text). The first three tags you will learn already express the basic structure of a TEI file. The topmost tag includes all other tags and is named <TEI>. The tag which includes the information about the text is called the <teiHeader> and always precedes the tag which includes the text, appropriately named <text>.

The overall structure of the TEI file then looks something like this:

<TEI>

<teiHeader>

[information about the text]

</teiHeader>

<text>

[the text itself]

</text>

</TEI>

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS