Šapinuwa Tablets

Yesterday Alwin Kloekhorst, our professor of Anatolian linguistics, told me about a Web site where a large corpus of previously unpublished Hittite tablets is now available. The tablets come from the excavations in Šapinuwa between 1990 and 2020.

Photo by Klaus-Peter Simon, CC-BY-SA 3.0

The tablets are published as PDF files which can only be downloaded one by one, so I wrote a couple of scripts to download the data, extract the text from the PDFs, and perform some initial analysis. The text extraction was a little less trivial than a regular PDF to text conversion, because the transliteration of Hittite texts relies on text formatting (italic and superscript), and it needed to be preserved in the extracted text. Thankfully PDFplumber worked great for extracting the text and layout information.

The total size of the Hittite corpus not including these tablet is a little under a million words. The new corpus consists of over 3000 tablets and fragments and over 90 000 words, representing a nearly 10% increase. If my analysis is correct, there are at least 800 word forms in the new corpus that haven’t been previously attested. Hittite is rich in clitics, so some of these forms represent combinations of known elements, but still there should be plenty of genuinely new words and forms.

I’m very much looking forward to all the research that will become possible thanks to the new publication. Maybe I’ll even be able to write a paper or two myself. 🙂



Leave a comment