Tracing Joyce’s Notes for Ulysses

(1)

Tracing Joyce’s Notes for Ulysses

AUTOMATION STRATEGIES FOR THE DIGITAL

REPRESENTATION AND SOURCE-TRACING OF JAMES

JOYCE’S COMMONPLACE-NOTEBOOKS FOR ULYSSES

Master Thesis in the programme MA Digital Humanities By Joshua Schäuble (s3155102)

Supervisor: Malvina Nissim Second Reader: Tekla Mecsnóber Abstract:

To write Ulysses (1922), James Joyce used a writing technique called common placing. He organized and collected his ideas together with fragments of his own reading (novels, newspapers, scientific and economic literature) in notebooks. The entries in these notebooks – comma-separated and grouped by headlines – were successively worked into the novel at all stages of its genesis. Together with the notebooks, hundreds of source documents that witness this textual genesis are extant. For years, literary scholars traced the individual notes from Joyce’s commonplace-books through the critical corpus and through Joyce’s reading to figure out how Joyce organized his writing. The interest is bi-directional: both the external source of a note (e.g. book Joyce reads) and the notes’ reuse within the extant witnesses is of interest.

This thesis develops strategies to computationally model the Ulysses commonplace-books and their relations to external sources and internal witnesses. A generalizable TEI model for encoding the notebooks is developed and applied to the sample notebook NLI MS 36,639/4 (short NLI4). Based on this test data, a first case study attempts to utilize the Google Books API to identify literary sources for the NLI4 notes. For well documented reasons, this case study failed. In a second case study, the Apache Lucene search engine is used to automatically retrieve the NLI4 notes’ reappearance in the text-genetic Ulysses corpus, which is a TEI P5 version of Ulysses: A Critical and

Synoptic Edition (1984). These automated results are discussed and successfully tested against

manual tracing results provided by Ronan Crowley (Antwerp) and Luca Crispi (Dublin). In a final chapter, a visual representation for the note repositories, embedded in an upcoming digital scholarly Ulysses edition, is outlined.

(2)

ACKNOWLEDGMENTS

The success of this thesis depended strongly on the support of a small group of people – scholars that advised me, entrusted me with their data and explained the complex relations of the materials

involved.

First, I want to thank my supervisor Malvina Nissim for our inspiring weekly meetings at the beginning of this project and for keeping a digital door open for me after I left Groningen. Many thanks also to my second reader, Tekla Mecscnóber, who supported my ideas and invested

time in advising me before I had even consolidated a thesis proposal.

Thank you to Johan Bos and Susan Aasman for organizing the Digital Humanities master course and for offering both administrative and conceptual advice in the weekly “thesis lab”.

A very special thank you goes to Luca Crispi and Ronan Crowley. Without their valuable NLI4 transcript this thesis could not have happened.

Ronan Crowley can’t be named enough in this context – in fact he will be named 135 times in the following thesis. Thank you for spending numerous hours over the last few years to explain the

Ulysses corpus to me and – in regard to this thesis – for sharing your experience and for giving me

your personal research records to experiment with.

(3)

List of Figures

(5)

All developments are currently available at https://www.spinning-yarns.com/ulysses

The program code can be accessed via the Oxygen XML Editor on port 8080. A temporary user account has been generated for correction and evaluation purposes (username: xxxxxx, password:

(6)

(7)

“I’ve been working hard on it all day,” said Joyce. “Does that mean that you have written a great deal?” I said.

“Two sentences,” said Joyce.

I looked sideways but Joyce was not smiling. I thought of Flaubert. “You’ve been seeking the mot juste?” I said.

“No,” said Joyce. “I have the words already. What I am seeking is the perfect order of words in the sentence.”

(James Joyce to Frank Budgen, in Budgen 1971: 19)

1. Introduction: On the Need for a New Digital Approach

Taken solely in the context of the published reading-text editions of James Joyce’s masterpiece Ulysses (1st_{ed. 1922), the epigraph above might give the impression of a writer who could access a historically}

unique repertoire of memorized words. Depending on which of the several historical editions of the novel that one consults and the rules for tokenization one applies, Ulysses consists of over 260,000 words (Bulson 2014: 1) and has an impressive vocabulary of around 30,000 “word-form units” (Ellegard 1960: 228). This tally includes a high number of non-standard compound words and neologisms that were coined by the author himself. Numerous corpus linguistic analyses that compare the vocabulary sizes of English writers support this perspective. Not even the complete works of Shakespeare with an estimated vocabulary size of over 15,000 words (ibid.: 227) compares to this single monolithic novel1_.

Why would an author who possesses such an extraordinary vocabulary spend hours “seeking the mot juste” instead of investing his creative energies into finding the “perfect order of words in the sentence”?

In fact, Joyce did not spend hours trying to find the right words to write Ulysses – he spent years. By the time of the cited conversation with his friend Budgen, he had already filled dozens of notebooks and note sheets with words and phrases he collected in the course of his reading. The earliest extant notebook that fed into the first episode of Ulysses (the so called “Alphabetical Notebook”) dates back

1_{An author’s measured or estimated vocabulary size depends on various parameters such as the tokenization}

(8)

to 1910 (Crispi 2004). When in 1918 Joyce said, “I have the words already” he did not refer to his undeniably extraordinary memory but instead to the complex system of note-taking that he had developed and used for the previous few years.

The act of taking, organizing and reusing notes accompanied all stages of Joyce’s writing process. No resource influenced his texts more than his collections of notes. By taking notes, Joyce not only compiled a vocabulary but also he stored and organized ideas, knowledge and information. For Joyce’s notebooks are, in fact, commonplace-books – compendia of extracts from other works copied for the express purpose of creative reuse. Unfortunately, only a small number of these compendia survived. And yet, these few extant notebooks are the center of focus for many Joyce scholars, those who follow the school of genetic literary criticism. By figuring out where Joyce took these notes from and how he worked them into the text, these genetic critics have already made substantial contributions to the decryption of one of the most complex works in literary history. While these achievements validate a text genetic research perspective, the bespoke and labor-intensive methods of research that they apply to achieve them need to be consolidated computationally.

(9)

copy of a candidate volume), and yet traditional humanists still manually key individual notes into a search engine. Candidate sources still have to be evaluated, which typically means they must be read. Literary scholars like Ronan Crowley (University of Antwerp) and Luca Crispi (University College Dublin), among others, have developed their own systems to manually identify both the sources and targets of single notes and to record this information in their own idiosyncratic and project-specific fashion. There is no consistently used data model available that allows researchers to represent digitally the author’s notebooks and their relationships to other documents. Each researcher who works on the notebooks finds her own way of representing and storing the connections discovered and, since these scholars are classical humanists, their skills in data modelling are limited. What they produce is, in the best case, a Word or PDF document with footnotes; in the worst case, it might even be a handwritten transcript of a notebook. In such a format, painstakingly captured data is inaccessible for both computational analysis and further research in literary studies.

Previous research on tracing the notes for Ulysses goes in one or the other (or both) of two directions: ‘downstream’ research looks for Joyce’s reuse of notes in the Ulysses corpus, it focuses on tracing the note’s targets; ‘up-stream’ research tries to recover the sources for Joyce’s notetaking and thus to figure out what texts fed into his writing. To date, each strategy or direction of scholarly inquiry has had two major problems, which are solved by this thesis: (1) the absence of a consistently applicable computational model to represent the notebooks and their note relations; and (2) the shortage of automation routines to support tracing the notes within and outside the corpus.

Therefore, the following two research questions will be answered by the work in hand:

(a) How can the Ulysses commonplace-notebooks best be digitally modelled in TEI and enhanced by a relational database to store note sources and targets within the Ulysses corpus?

(10)

2. Research Design: The Outline of an Embedded Case Study

2.1. Premises and Preliminary Considerations

The process of tracing the notes of Joyce’s commonplace-books through a corpus of extant Ulysses source-documents is determined by three major premises. First, the extant note repositories (notebooks or note sheets) must be consistently digitized in a computationally accessible way. Secondly, the extant text-genetic corpus of drafts, fair copies, typescripts and proof documents, which are to be queried with these notes, must also be compiled and made computationally accessible. And only once these two preconditions are met, a stabile search engine must be implemented that provides powerful full-text search functionalities and that uses a reliable reference system to store the hits for subsequent analysis.

The upstream direction of the note-tracing process demands some additional considerations. To identify candidate sources (i.e. works that Joyce might have read and drawn notes from), a huge corpus of historical texts – the bigger the better – must be queried. A corpus that is big enough to promise interesting results cannot be compiled and prepared within a project of the given range. Therefore, a publicly available corpus must be found, which also offers a sufficiently robust application programming interface (API for short). Such an API should support at least two different full-text-search functions: verbatim full-text-search and proximity full-text-search (preferably including word stemming). Additionally, the API must provide filters to narrow down the hits based on metadata features such as publication date and the language of a work.

The stated premises depict only the top level of decisions that had to be made to outline the given research design. Each of them demands the elimination of numerous alternatives, of which some might produce better results. Although these decisions are carefully balanced against alternative approaches, the given research design must not be understood as comparative analysis of computational methods. Instead, it provides an applied case study in the tradition of the Digital Humanities. It aims to enhance traditional literary scholarship through the application of selected computational methods.

2.2. The Odyssey’s Roadmap

(11)

automate querying millions of books and derive an abstract gold standard to identify those works that Joyce might have read without some knowledge of the state of the art in Joyce research. For this reason, chapter 3 provides an overview of the textual genesis of Ulysses and the meaning and role of Joyce’s notebooks. In this chapter, the most important resource for manual note tracing is also discussed, Ulysses: A Critical and Synoptic Edition (1984, ‘CSE’ for short).

For the automation of both the source and the target tracing, the notebooks must be consistently digitized in a computer-readable format. To provide an applicable data model for Joyce’s commonplace-books, two options were considered: a relational database model and a representation in XML. The notebooks are more structured and itemizable than Joyce’s prose and they might, in fact, be representable in a relational database. Yet, wherever unique textual phenomena occur within the notebooks, a relational model is inflexible. Therefore, the benefits of an XML representation outnumber those of a rather experimental relational approach. The Text Encoding Initiative (TEI) provides the most widely accepted and applied XML namespace to annotate texts for humanities research. XML-based formats like TEI are system independent and tailored for the annotation of semi-structured data. In addition, textual scholars and, in particular, digital scholarly editors are familiar with this technology, which makes a TEI-based encoding model more likely to be applied to further Joyce notebooks. Chapter 4 describes the development of a TEI-encoding model for the extant Ulysses notebooks and its application to notebook NLI MS 36,639/4 (or NLI4, for short) as a proof of concept and to produce a sample dataset for the following steps. The choice of NLI4 as a sample dataset for this case study is reasoned by the following arguments:

(a) A full transcript of the notebook in MS Word, produced by Crowley and Crispi – both experts in genetic Joyce studies – was generously provided. Such a resource could not have been produced by non-experts, unfamiliar with Joyce’s handwriting or the ways in which the notebooks organize information.

(b) With this transcript came quantifiable information about the note relations of NLI4, which has been manually compiled by Crowley and Crispi. This information is valuable to assess the quality of the automation results against manual work.

(c) Since NLI4 is a late stage notebook, most of its notes went directly from the book into the typescript and proof documents of the novel. As explained in section 3.2, these late-stage documents in the making of Ulysses can be systematically searched with the aid of the CSE, which offers a promising basis to automate such queries.

(12)

of historical books that has ever been digitized, Google Books also meets most of the premises for source tracing described in section 2.1. It has full-text search functions for verbatim and proximity searches, and it allows the filtering and ordering of results. Unfortunately, there are major functional restrictions inherent in the API, in contrast to the browser-based Google Books search engine. These complicate the automation process and the subsequent analysis of the query results. The case study will show that the Google Books corpus can be queried automatically for Joyce’s notes and that the resulting source candidates could then potentially be ordered by algorithmically calculated probability scores. Unfortunately, both functional and legal restrictions limit the potential of this automation approach, as Google’s terms and conditions prohibit the storing of any information retrieved from Google Books. A user cannot “keep cached copies longer than permitted by the cache header” (Google APIs Terms of Service, §5e(1))2_{. The chapter describes how, Google’s legal restrictions aside, the limited}

size and representative meaning of the sample data given in NLI4 prohibits the development of a gold standard for machine-learning approaches.

From a technical perspective, the tracing of note targets is simpler to automate than the tracing of note sources, because the corpus of target documents is, in theory, fully accessible and controllable. This aspect of the process is documented in chapter 6. To identify target documents within the Ulysses corpus (answering the question “where and when in the textual development did Joyce implement a note into the novel’s text?”), an online version of Ulysses: A Critical and Synoptic Edition that has been in development since 2014 under the working title DCSE, will be adjusted and extended3_{. Manual}

strategies to search the printed CSE will be applied and automated within the digital version. For this case study, a new note-tracing module was developed and integrated into the DCSE eXist-db4

application. The existing DCSE collection of synoptic episode encodings has been indexed with the Apache Lucene Index and a Lucene-based search engine has been implemented to let the new notebook module access and search the existing synoptic episode files. The search engine supports five different search modes such as a verbatim phrase-search and various proximity-search modes. The query results are tested against the manual results that Crowley captured for NLI4 in his MS-Word based transcript.

The new eXist-db module for notebook representation is designed to integrate the Ulysses note repositories into the digital CSE data. The module can dynamically visualize any notebooks that are encoded in the same way as NLI4. The long-term aim will be to extend this digital environment to

2_{When I got friendly reminded of this legal disclaimer from the Google Books support team in a personal email,}

I had already queried, stored and analyzed data from both Google Books and the Google Books API. For legal reasons, I do not want to use, store or distribute this data and so I will describe the method and speculate what could be achieved with it.

3_{More information on the development of the DCSE is given in section 3.3.}

(13)

(14)

3. Of Notes, Drafts, Final Episodes and Editions: The Ulysses Corpus

3.1. A Condensed Genesis of Ulysses

The first fourteen episodes of Ulysses were published in infrequent instalments of roughly 6,000 words in the New York literary magazine the Little Review between 1918 and the end of 1920. The London-based Egoist also republished a few episodes. Extant note documents, letters and drafts testify, however that Joyce had started to conceptualize the novel several years earlier. The promise of serialization and implicit deadlines played a significant role in Joyce’s decision to finalize the early episodes for print (Groden 1977:6). From his notebooks and note sheets, he successively ‘accreted’ material to drafts of the individual episodes in copybooks and on loose-leaf drafts. Then he consolidated these drafts and produced a fair copy for the episode in work. Fair copies were passed on to a typist, who typically sent back three typescripts – one original and two carbon copies – for Joyce to revise. Joyce passed on two of the three revised typescripts to Ezra Pound, who distributed them to the magazines for approval and publication. One copy he kept for himself. Interestingly, his handwritten revisions were not consistently added to all exemplars of a given typescript.

Because of the novel’s controversial contents, Joyce could not find a book publisher in the UK. Although the text had already been bowdlerized before it appeared in the Little Review, an obscenity trial in February 1921 scared off potential publishers, terminated the likelihood of further serialization and killed the chances of an unexpurgated American edition (Johnson 2008: xli). Luckily, this hopeless situation ended after Joyce moved to Paris, where he got in contact with Sylvia Beach. Beach, the owner of the Shakespeare and Company bookshop, offered in 1921 to publish the book under the “auspices of Shakespeare and Company, to have it printed in Dijon by Maurice Darantiere, and to finance it by advance subscription” (Johnson 2008: xlii). Also, she granted Joyce “multiple proof pullings so that [he] could augment earlier episodes” (xliii). Working from the typescript exemplars he had kept, Joyce used this opportunity to rework and revise the early episodes while, at the same time, he was still working on the later episodes of the novel.

(15)

After the placard stage, text was set in the designated page layout and arrangement of a printed volume. Textual alteration was more costly at this stage because longer additions could necessitate the realignment (or resetting) of multiple consecutive pages. Nevertheless, this did not stop Joyce from carefully revising each gathering of page proofs that passed his desk. Often, he was obliged to correct the French typesetter who had misread or misinterpreted his handwritten revisions. Crucially, at this late stage he drew on notes recorded in the notebooks for the additions to the text. Only when this process ran up against Joyce’s birthday did the 1922 first edition go into print – and of course Joyce was still not satisfied with the result. Various correspondences between him and the publisher and him and his friends document his interest in further revisions (Banta and Silverman1987).

Such correspondences in addition to the numerous extant text witnesses that often contain contradicting textual variants, fueled the scholarly discussion about an edition that represents “Ulysses as Joyce intended it”.

3.2. The Critical and Synoptic Edition

In 1984 Ulysses: A Critical and Synoptic Edition (the CSE) was published. It summarized the genesis of most of the source documents available at the time. Known documents from the period between the first notebook in 1910 and the 1922 first edition included several note documents, some copybooks containing episode drafts, numerous episode fair copies, dozens of typescripts and around 350 proof documents. These documents were (and still are today) heterogeneously distributed between European und US-American institutions. Amongst the most prestigious holders of such Ulysses source documents are the Rosenbach Museum and Library in Philadelphia; the Poetry Collection Library at the State University of New York at Buffalo, New York; the Harry Ransom Center at the University of Texas at Austin; the Department of Rare Books and Special Collections in Princeton; the Beinecke Rare Book & Manuscript Library in Yale; the Houghton Library at Harvard University; and, on the European side, the Department of Manuscripts in the National Library of Ireland, Dublin.

Compiling and digitizing primary sources from so many prestigious holding institutions seems like an unlikely venture and yet such an enterprise was undertaken in the late 1970s after the appearance of the James Joyce Archive (JJA in short), a series of 63 facsimile volumes of Joyce’s prepublication materials (sixteen of which relate to Ulysses). A team of philologists around Hans Walter Gabler in Munich attempted to extract from the available facsimile editions the authorial writing that produced

Ulysses. The resulting CSE was one of the first scholarly editions to be produced by the aid of

(16)

of Text Processing Tools (TUSTEP), which was itself first released in 1978. TUSTEP allows a user to

collate pairs of texts with advanced options for manual editorial interventions, such as the definition of manual merging points within the two texts. The merged transcripts were collated again in heuristically consecutive document pairs to extract the textual differences and thus to derive the textual genesis. These textual differences between chronologically consecutive versions of the same text passages were annotated in a coherent synoptic text. In the three-volume print edition, this synoptic text is displayed on the left-hand side, while the right-hand side shows a derived reading text in a consecutive lineation. The latter is used today as the standard reference system in Joyce scholarship. Several later reading editions stick to the line fall introduced with the CSE.

Not all of the source documents that were known to be extant in the ’70s and ’80s could be integrated into the synoptic presentation of the CSE. Textual witnesses that could not be processed by TUSTEP in pairs and integrated into the linear textual genesis had to be omitted. The textual variance between two versions of the same text in the early draft stage is too extensive to be annotated within a synoptic text in print. Also, the philological evidence of authorial alterations is of a different quality on earlier drafts. In the draft stage, the textual development happens in the author’s mind while he is recomposing a different version of the same text on a new document. For those documents that chronologically succeed the fair copy the situation is different: after the fair-copy stage, typists and later typesetters simply act as copyists, reproducing Joyce’s writing as faithfully as they can. Joyce revised the typed version and the typesetter integrated Joyce’s hand-written alterations in the next recursion. Therefore, after the fair copy stage, all textual alterations are materially evident on the source document, as long as this document is extant. Due to these factors, Gabler’s team omitted almost all of the drafts and note documents that preceded the fair copy stage (the draft of the “Eumaeus” episode that was then known to be extant forms an important exception to this rule). As a base text, on which all alterations are annotated, the CSE uses the so-called “Rosenbach Manuscript” (RM). The Rosenbach foundation holds a complete fair copy of Ulysses episodes written in Joyce’s own hand. As Gabler argues, textual variance indicates that not all of the surviving typescripts were produced directly from the corresponding portion of the RM. Therefore, for several episodes, another now-lost fair copy or “final working draft” (Gabler 1984: 317) that precedes the RM was the source of the typescript. This well-reasoned theory, which was central to the CSE, was the source of considerable controversy in the 1980s and early 1990s (Rossman 1989, Kidd 1990). All the same, the textual genesis after the typescript stage as outlined in the CSE is undisputed.

(17)

the “Eumaeus” copybook, it does not include any notebooks, note sheets, copybooks or episode drafts that precede the fair-copy stage. This is important to know when the CSE is used as a corpus for the retrieval of note targets: In particular, earlier notebooks than NLI4, which have been worked into the text before the fair copy stage, cannot be traced to the precise document of entry, even if the note text is found in the text. To understand how the CSE can be used to trace notes, however, one must first learn to read it.

Figure 1: Ulysses - A Critical an Synoptic Edition (1984) vol. 1 p.316, Section of a Synoptic Page

(18)

In the context of this thesis Line 6 of Figure 1 contains the most interesting alteration5_{. Here the}

acronym “Y. M. C. A.” was added on proof level 1. In section 3.1 the complex segmentation and back and forth sending of the proof documents was described. To reduce the complexity of the corpus, the CSE abstracted gatherings of such proof documents into proof levels. A proof level represents a virtual compilation of documents that together witness a revision campaign of the entire episode. Proof Level 1 means that this change was introduced on the placard that belongs to the first “round of proof pullings” of the episode. Chronologically, these revision campaigns or “rounds of proof pullings” are not coherent. The first placard pulling of the first pages of an episode might have passed Joyce’s desk weeks before the first pulling of the end of the same episode, and yet the CSE compiles both documents on the same proof level, because they mark the same abstract level of textual development. A more detailed interlinear annotation of the corresponding documents would have made the CSE even less accessible to scholars. This has to be considered when the CSE is used to trace the exact document on which a change was made. An appendix to the CSE lists the grouped proof documents for all proof levels of all episodes. This helps to narrow down the tracing to a small set of documents, but to know which one of them contains the text segment with the searched alteration, the non-digital scholar still has to manually check the equivalent documents in the JJA facsimile volumes.

3.3. The CSE Goes Digital

The three printed volumes of the CSE never lived up to the research potential of the genetic information they contain. Considering the limitations of the print medium, which is static, spatially constrained and by no means dynamic or interactive (outside of the mind), one must applaud the annotation system that the editorial team developed. It is a model of elegance, even compared to the latest version of the TEI. It is capable of annotating the textual development of one of the most complex genetic corpora in modern English literature in a single synoptic text, while remaining human legible. Clearly, then, this scientific representation of a complex research object was years ahead of its time. Only a small community of Joyce scholars ever took the trouble to acquaint themselves with the range of symbol involved: annotated square and angle brackets, carets, superscript characters and numbers.

At the time, the vast majority of Joyce critics were not interested in textual matters. In the late seventies and early eighties, the Joyceans had been discovering ‘theory’ […]. It was therefore inevitable that the most important early reviews [of the CSE] were written either by textual critics who had not worked on Joyce, or by Joyceans without […] training in editorial theory and practice. (Lernout 2006: 230)

5_{In fact, the addition “Y. M. C. A.” in the fifth line of the episode Lestrygonians was copied from NLI4 p.10verso.}

(19)

While the first reviews were very positive (McDowell 1984), the CSE was soon the subject of controversy – more so than any other scholarly edition before it. Even prominent broadsheets like the

New York Times and the Washington Post reported regularly on the CSE and devoted column inches

to the so-called “Joyce Wars” (e.g. Remnick 1985, McDowell 1988). Negative coverage focused mainly on the to some controversial editorial decision to reject the first edition as a base text and instead to constitute a “corrected text” on the basis of the preceding fair copy (Lernout 2006: 231). This illustrates that the edition’s main value – the elucidation of the textual development after the fair-copy state – was misunderstood. The new French movement of critique génétique (genetic criticism) that Gabler drew from had not then found its way into English literary criticism.

[Genetic criticism] examines tangible documents such as writer’s notes, drafts, and proof

corrections, but its real object is something more abstract – not the existing documents but the movement of writing that must be inferred from them. [… It] never posits an ideal text beyond those documents but rather strives to reconstruct, from all available evidence, the chain of events in a writing process. (Deppman, Ferrer and Groden 2004: 2)

The most critical examinations throughout the late 1980 and 1990s only focused on the question on whether Gabler’s new computational method constituted a better reading text than the previous editions. The opportunity to use the edition to study how Joyce constituted the text and not his editor was largely ignored.

After the second revised edition of the CSE was published in 1986, Gabler left off working on the edition for ten years. The very public controversies in the 1990s made it unlikely that further funding would be forthcoming to develop the corpus in the digital sphere. In any event, annotation systems that could improve the genetic character of the data did not then exist. Text technological markup languages such as HNML (or HyperNietzsche Markup Language) and the TEI only began to approach the complexity of the printed CSE’s annotation system in the late 1990s. Gabler carefully kept track of these developments6_{. In 1997, he supervised the Magisterarbeit of Tobias Rischer, who undertook the first}

major migration of the CSE data, converting the entire corpus to TEI P3. This early version of the TEI was still SGML based (the ‘P’ stands for ‘proposal’) and, while it did contain a model for critical editing, did not support genetic text annotations.

Rischer neither digitized nor retro-digitized the CSE. Essentially, the CSE was never a ‘born-analogue’ print edition. The print volumes were only a reduced, static visualization of what was a much larger, born-digital data repository. This repository took the form of the critically assessed TUSTEP collation results and control structures. Rischer’s achievement, then, was to unify these text files and to represent them in a coherent digital format, just like Gabler’s team had unified them, in print, as a

6_{Already in 2000 Gabler conceptualized an interactive digital Ulysses edition in his article “}_{Towards an electronic}

(20)

synoptic text. As a markup format, TEI P3 did not make the data more legible to humans, and there was no software environment available or in development to produce more comprehensible visualizations. And yet, both Gabler and Rischer well understood the possibilities for this work:

For philological reasons, it can be interesting to have available a literary text in an electronic form, e.g. for automated linguistic analyses, to prepare concordances or to provide easier access to texts with complex structures. […] The resulting SGML-document could now be used for example as a data-basis in an interactive Ulysses edition. (Rischer 1997: 15, author’s

translation7₎

In 1997, however, the “Ithaca” or home-coming for an interactive edition of Ulysses was still many years and many “episodes” away. The TEI P3 version of the data was never published and, only four years later, Rischer again migrated the data to TEI P4, which had then switched to an XML basis. Proposal 4 still did not support genetic editing in the spirit of the print edition and, although Rischer had found ways to work around this problem by encoding the synoptic genesis in a classical critical apparatus, the new results did not match the edition’s philological objective. As much becomes clear when publicly available XSLT transformations that render TEI as HTML are applied to the data8_{. While}

such templates can be applied successfully to publish various TEI P4 encodings, the transformation results on the TEI P4 version of the Digital CSE (or ‘DCSE’) have little in common with what the print edition was able to express. In short, a project to produce customized visualizations based on Rischer’s encoding was still not feasible.

For another decade, the DCSE rested on Gabler’s personal computer, until in 2012 the second version of TEI P5 was released. At last, a version of the TEI standard included an encoding model for genetic editing. One year later, an international collaborative project was launched to test the model’s potential for generic analyses and to develop visualization modules for text genetic corpora9_{. Lead by}

Brett Barney (Whitman Archive, Nebraska Lincoln), Anne Bohnenkamp-Renken (Freies Deutsches Hochstift, Frankfurt) and Malte Rehbein (Chair of Digital Humanities, Passau), the project had an advisor and coordinator in Gabler. As luck would have it, Gabler received the first results of a TEI P5 migration of DCSE data the night before the first project meeting. The migration was voluntarily written

7_{In the original German: “[F]ür […] philologische Zwecke kann es interessant sein einen literarischen Text in}

elektronischer Form zur Verfügung zu haben z.B. für automatisierte linguistische Untersuchungen zur Konkordanzerstellung oder als leichteren Zugang zu komplex strukturierten Texten […] Das entstandene SGML-Dokument könnte nun zum Beispiel als Datenbasis in einer interaktiven Ulysses-Edition verwendet werden” (Rischer 1997: 15).

8_{For example, two basic transformation scenarios were given as an extension package to the Cladonia XML}

Editor, see http://www.exchangerxml.com/editor/extensions.html. Only with the release of TEI P5 the TEI consortium officially provided basic XSLT stylesheets to publish TEI in HTML.

9_{For more information about the project “Diachronic Markup and Presentation Practices for Text Edition in}

Digital Research Environments” visit the following links: http://gepris.dfg.de/gepris/projekt/236702990 and

(21)

by Gregor Middell in Java to provide a “quick and dirty” proof of concept for the text genetic TEI model. Despite numerous validation errors, this work was very promising. Moreover, the Passau project-partner now had a viable dataset.

The Passau team10_{took over the Ulysses encodings as one of their test datasets and, for the first time}

since the 1980s, the data could be worked on as part of a publicly-funded research project. The production of a custom tailored online edition of Ulysses was not within the scope of the larger project, however. Nevertheless, significant improvements were made to the data during the project runtime. For all sample datasets, a modularized eXist-db application was developed. It contains a module to isolate the individual text stages from a synoptic encoding; a “Diachronic Slider” module that reenacts the textual development of a user-selected span of text; incipient stages of a module for facsimile-transcript mapping and – particularly for the Ulysses data – a visualization module that digitally reproduces the static synoptic view of the printed CSE. The latter was developed and used to proof-read the DCSE data against the print edition in order to catch those instances (and case types) where data migration had corrupted data integrity. It was now possible to detect hundreds of instances of migration loss and see errors visualized, many of which could be corrected (semi-) automatically. Without a visualization that allows comparison – or eye collation – with the print edition, these errors are largely invisible.

By the end of the project’s runtime, in December 2016, the data integrity of major parts of the eighteen

Ulysses episodes had been restored. For the first time, the text-genetic perspective of the print edition

could be captured by a digital annotation system, and the data could be visualized dynamically. Yet, within this collaborative project, Ulysses was only part of a larger enterprise that had no aspiration to isolate the particular test dataset from the developed eXist-db application as a standalone edition. This opportunity only presented itself once the project had ended, and the gains made seemed to call for a continuation of the work.

A major factor that supported this endeavor and its continuation was the new potential to extend the DCSE data with new information, data that had never been integrated into the print edition. As a private side project, the team had already succeeded in integrating metadata that allowed a user to trace authorial alterations on the abstracted CSE proof levels back to the precise, material proof document (instead of only to a set of such proof documents). This success marked a new milestone in the development of the DCSE. Finally, the digital version no longer dragged behind the print edition, which had been so far ahead of its time, but for the first time the digital enterprise could overtake

10_{The team consisted of Ronan Crowley and Joshua Schäuble with Claus Melchior and Hans Walter Gabler as}

(22)

(and, indeed, overhaul) the CSE. From here on, new Ulysses-specific modules become conceivable, modules that integrate the note repositories and draft documents that never made it into the printed CSE. And, interestingly, the number of such documents has increased significantly since the turn of the millennium, when the National Library of Ireland acquired a major collection of Joyce manuscripts that were presumed missing. The work in hand is a first approach to integrate these documents into the upcoming Ulysses: A Digital Critical and Synoptic Edition.

3.4. The Rediscovery of Long-Lost Documents

In September 2001, just after the terror attacks in New York, Michael Groden, one of the leading scholars of Joyce’s compositional process, was contacted by the National Library of Ireland and asked to cross the Atlantic in order to evaluate a substantial cache of Joyce’s prepublication materials. The Library had an offer of first refusal on the material from a trusted source. After a first unspecific telephone call, in which Groden expressed concerns about the travel, the NLI sent a checklist of the documents on offer via email. His reaction is best described in his own words:

When I saw Noel Kissane's checklist, I nearly fell out of my chair. None of my wildest speculations about what other manuscripts might still be extant could have prepared me for what this list seemed to promise. […] By the time I finished reading the checklist, any reservations I had about flying to London or being inconvenienced by the trip were long gone.

(Groden 2002: 2)

In November, Groden traveled to London in order “to report on the documents’ authenticity, contents, and value” (2002: 4). Given that the documents were authentic, the NLI wanted an expert opinion on whether they should acquire a part of the collection or the whole set and at what price. Groden assessed the documents and confirmed their authenticity. To give just one convincing example: the Joyce Collection at the University at Buffalo already contained a draft of the episode “Oxen of the Sun,” written in a series of nine numbered copybooks, of which numbers 3, 5, and 9 were missing. These missing copybooks were among the documents on offer, as well as various other Ulysses witnesses, several of which critics had postulated.

(23)

cache of documents for £8 million (then €12.6 million). Only afterwards was the existence of the collection, the identity of its vendor, and the successful sale to the NLI made public.

The documents were sold by Alexis Léon, son of Paul Léon and Lucie Noël, who were close friends of Joyce during his years in Paris. Paul Léon, a professor for sociology and of Jewish descent, first met Joyce in 1928, six years after the first edition of Ulysses was published. He soon became Joyce’s unpaid secretary. Léon’s knowledge of languages (he spoke seven fluently) made him a valuable helper during the composition of Joyce’s last novel Finnegans Wake, on which the Irish writer labored from 1923 to 1939. After the Fall of France in May 1940, Joyce moved back to Switzerland and left behind in his Parisian apartment many valuable documents and first editions. The Léons temporarily fled to Saint-Gérand-le-Puy but soon moved back to the occupied capital. Here Léon saved numerous boxes and trunks from Joyce’s apartment and distributed them amongst friends of the family for storage. The contents of Joyce’s Paris apartment were publicly auctioned in May 1941, just four months after the writer had died in Zürich of a perforated ulcer, and Léon lent 20,000 Francs to buy back first editions and other valuable materials. In August of the same year, he was arrested and imprisoned in the Nazi concentration camp KZ Royallieu. Léon died in 1942 in Auschwitz; Lucie Noël and their son Alexis survived the holocaust. In James Joyce and Paul L. Leon: The Story of a Friendship (1950), Noël vividly describes the friendship between her husband and Joyce. She died in 1972. Only in the year 2000 did Alexis León happen, by chance, to discover the manuscripts among his mother’s papers. The material had been placed in storage for nearly 30 years (Groden 2002:5). Little is known about why Noël kept precisely this set of documents. Arnold Goldman writes in 2003:

[O]nly months before her death in 1972, I’d interviewed Lucie Noël […] in her Paris apartment

[…]. Possibly sitting within arm’s length of what by 2002 would be worth £12 million (say $12

million) [sic]11_{, it didn’t occur to me to ask if she owned any Joyce memorabilia. As she lived}

very modestly, I suppose that she didn’t know that she had anything of such value as the manuscripts her son would discover thirty years later.

Full lists of the acquired documents including detailed descriptions are given by Groden (2002) and in the National Library’s own collection list (Kenny 2004). The acquisitions contain source documents of various types. In particular, there are newly discovered pre-fair-copy drafts of no less than nine Ulysses episodes, and the recovery of these caused many literary scholars to refocus their attention “on the composition of Ulysses” (Crowley 2007: 1). Nevertheless, in the context of the work in hand, the notebooks are of primary interest. With the 2002 accession, the NLI acquired four Ulysses notebooks12

and one earlier notebook with accounts, citations and book lists13_{which is dated to 1903-1904.}

11_{Goldman gives a wrong value in pounds, the dollar value is accurate though (see Groden 2002: 7).} 12_{MS 36,639/3, MS 36,639/4, MS 36,639/4, MS 36,639/5A and 5B}

(24)

Crispi categorizes Joyce’s notebooks into first-order and second-order notebooks. First-order notebooks are notebooks that accompanied Joyce’s own reading: “While reading, Joyce usually jotted down words and phrases directly in notebooks that he always kept by his side for this purpose” (Crispi 2016: 76). From the many first-order notebooks for Ulysses that must have existed only two are known to be extant14_{. While Joyce would work some of these first-order notes directly into “whatever text he}

was currently working on” and carefully crossed them out in colored crayon, most of the notes were copied into second-order notebooks, where they were ordered by categories such as a novel’s characters or episodes. According to these categories, the notes where then later worked into the appropriate text and, again, carefully crossed out on the notebook page in colored crayon. Uncancelled notes were “transferred and resorted once again in yet other notebooks for future use” (Crispi 2016: 76). All four Ulysses notebooks among the Joyce Papers 2002 are second-order notebooks.

Emphasizing the historical timeline, Groden prefers to categorize note repositories according to their place in the writing of the novel; he divides the new NLI notebooks into early and late notes (2002). NLI MS 36,639/3 (NLI3 for short) is the earliest Ulysses notebook in the NLI catalogue. The notes in this notebook are ordered by different categories such as “Names and Places”, “Recipes” and by character names such as “Simon”, “Leopold” and “Stephen”. This categorization indicates that, although it is an early notebook in the making of Ulysses, it is not a first-order notebook in Crispi’s categorization. The notes have probably been copied and reordered from another notebook. Also, the remaining three NLI notebooks are second-order notebooks but, in contrast to NLI3, they were produced at a late stage in the making of Ulysses. All three are ordered by episode titles, and Groden dates all of them to Joyce’s years in Paris, between 1920 and late 1921.

Around the same time that the new NLI materials appeared, the search giant Google announced they would participate in the mass digitization of physical books. Google Books was launched in 2004 and, of course, it raised scholars’ hopes that they would gain full-text access to this massive corpus. Such a resource would have fundamentally altered the ways in which they trace Joyce’s reading notes and thereby understand Joyce’s reading. This development may account for the reason why NLI4 – the first of the three later-stage notebooks – got more attention than the other notebooks from scholars attempting to trace Joyce’s notes. As explained in section 3.2, alterations that were made to the text after the fair-copy stage (at a late stage) can be traced to the source document by the aid of the CSE. A lot of manual research had already been done on the notebooks previously known to have survived, which made them somewhat less interesting. NLI3, the first notebook of the new set, is an early-stage notebook and thus not a good candidate for target-tracing with the CSE: its notes were integrated into

14_{These are the previously mentioned “Alphabetical Notebook” (4609 Bd Ms 1/Cornell 25) at the University}

(25)

(26)

4. From an Annotated Transcript in Word to a TEI Encoding

The following chapter describes how a TEI encoding-model for the digital representation of NLI4 was developed and applied. TEI is the de-facto standard for text annotation in the humanities and, in particular, in the field of digital scholarly editing (Przepiórkowski 2011). The model developed is based on two sources: (a) high resolution facsimiles of the notebook, which are available for research purposes in the online catalogue of the National Library of Ireland15_{and (b) a transcript which was}

produced from these facsimiles by Crowley and Crispi16_.

Although there are no two identically-structured notebooks in the extant corpus of Ulysses source materials, the similarities between the extant notebooks is greater than their differences. Most of them are structured according to episode headlines and, moreover, the individual page layouts are relatively similar. Therefore, the encoding model described here has a high likelihood of working for the other notebooks (see 4.6). As argued in section 5.3, the results of computational methods to trace individual notes within and outside of the Ulysses corpus will improve with every additional data source that is consistently encoded in the same format.

4.1. The NLI Facsimiles and the Crowley Transcript

Before discussing NLI4 as a digital artifact, one must recognize it first as a material document. Covered with plain blue wrapper, the notebook measures 21.7x17.2cm and contains 24 unruled pages, all of which were inscribed by Joyce. In the NLI catalogue description, the single pages are referred to as

recto (front side of a physical page – odd-numbered in traditional pagination) and verso (back side of

a physical page – numbered evenly in traditional pagination). The front cover (recto and verso) is not inscribed, and the recto of the back cover contains the note “Λabcd” in pencil.

The first 17 pages of NLI4 are ordered and headlined with the novel’s episode titles, from episode one “Telemachus” (NLI4 p.1recto) to episode 18 “Penelope” (p.9recto). Intriguingly, the notebook does not contain a page marked for episode 11 “Sirens.” Most of the identifying headlines are underlined in blue, green or red crayon. Under each episode headline follows a list of partially comma-separated notes, of which the majority are canceled in blue, green or red crayon. After these initial episode pages, the eighteenth page (p.9verso) is headlined with “Eventuali,” and this is followed with five additional pages of material for “Penelope” (p.10recto-p.12recto) and, finally, one last page headed “Circe” (p.12verso). Three of the first seventeen pages contain an additional episode-headline elsewhere on

15_See:_{http://catalogue.nli.ie/Record/vtls000357762/HierarchyTree#page/1/mode/1up}_{. For the work on this}

thesis the tiled facsimiles have been scraped and recomposed by the aid of a Python script in the highest available resolution. All references to the notebook in this dissertation –including the transcription – are based on these facsimiles.

16_{Dr. Crowley kindly provided his transcript for the work in hand. He fully approved any further processing of}

(27)

the page, which was most likely added later to use up free space within the notebook and after the corresponding episode page had been filled with notes. These three pages are 2v, 5v (both continuing the notes on “Circe”) and 6v (continuing the notes on “Cyclops”). Several pages contain additional notes in the left margin and around the page’s episode headline.

The number of notes per page – and, more importantly, per episode – varies significantly. There is only a single note recorded in the entire notebook for “Telemachus,” the first episode of Ulysses (on 1r); by contrast, the longer Ulysses episodes (in particular, “Circe” and “Penelope”) are represented by several hundred notes. NLI4 is a late-stage notebook. To compile it, Joyce took unused notes from previous notebooks and redistributed them according to the episode rubric of NLI4. The imbalance in the number of notes per episode indicates that the composition of “Circe” and “Penelope” was still very much in Joyce’s mind and that he had made a conscious decision to extend them; “Telemachus” had been completed several years earlier and Joyce had little intention of reworking it to any great extent. This general observation has already been made by Michael Groden in 1977, long before it was known that NLI4 survived. Groden calls this writing technique, which mainly affects episodes 15 to 18, “encyclopedic expansionism” (1977: 52). The notes-per-episode distribution of NLI4 supports Groden’s earlier observations.

In the course of previous text-genetic research on the 2002 NLI acquisitions, Crispi and Crowley collaborated on a page-by-page transcript of the complete notebook in Microsoft Word. Each page of this transcript is headed with the NLI catalogue description of the equivalent facsimile page, followed by a document-oriented transcript that preserves the lineation and alignment of the original notebook. Since Microsoft Office does not allow strike-through colors that differ from the text color, Joyce’s multicolored crayon-crossings have been recorded by coloring the text itself accordingly. Illegible words or phrases are recorded inconsistently. In some instances, potential readings are listed in square or round brackets. In other cases, the illegibility is recorded by single or multiple question marks in brackets or even without any brackets in the continuous text. This inconsistency hints at the human-reader orientation of the transcript. Elsewhere, Joyce corrected himself by overwriting a note or by crossing it out and adding an interlinear correction. Such instances have also been recorded (if legible) within the graphic means of Microsoft Word.

(28)

CSE level17_{the notes entered the novel. For potential matches, the line number of the CSE reading text}

is captured in the footnotes.

Although Crowley and Crispi worked systematically, the transcript was produced to support their individual research needs and did not feed into a published edition of the notebook. Microsoft Word and the underlying Office Open XML (OOXML) are not designed to support descriptive textual annotations and scholarly text editing. Within the 27 pages of NLI4, there are several page layouts and additionally the transcript contains various unique representations of textual phenomena and smaller inconsistencies in the data representation. Such inconsistencies may be manually added superscripts for footnote references of unidentified notes18_{, unflagged comments within the transcript (e.g.}

Crowley sets a [sic!] annotation, whenever Joyce misspelled a word) and the already mentioned markings of words that are illegible on the facsimile. The combination of the various page layouts, the named inconsistencies and the complexity of OOXML made it unprofitable to write a fully automated transformation script in order to produce a TEI representation from the Crowley transcript. Instead, a semi-automated approach proved optimal.

4.2. Capturing the Document Structure

The first step in every text encoding project is to analyze the material’s textual and documentary structure. Traditionally, the Text Encoding Initiative (TEI) is focused on annotating text in a form that abstracts the text from its carrier document. This means that textual and semantic features are prioritized over documentary features such as the document specific pagination. This is necessary to reduce conflicting hierarchies within the XML-based TEI schema. To give an example: annotating the beginning and the end of each paragraph gives us more information about the semantics of a text than annotating the beginning and the end of each page. Since XML is strictly hierarchical and paragraphs do not necessarily end with the page, it is not possible to model both hierarchies within the same well-formed XML document. In the TEI’s traditional understanding of text in such cases the textual hierarchies are prioritized and the documentary features (e.g. page-, column and line breaks) are annotated as standalone elements within the textual hierarchies.

In 2012 this rule of thumb got questioned by the introduction of an element set for genetic editing into the TEI P5v2 Guidelines. Based on the French critical movement of critique génétique, genetic criticism

17_{Gabler introduced a level system to cluster the hundreds of extant documents into critically evaluated}

campaigns of authorial revision. This was necessary to reduce the complexity of the text-genetic corpus for the annotation of the print edition. Unfortunately, these abstracted CSE levels do not directly allow to infer the actual document (draft, manuscript, typescript, page proof, placard) on which changes have been introduced to the text. For more information see section 3.2.

18_{E.g.: Crowley added a superscripted X to notes that he could not yet find within the CSE but of which he}

(29)

focuses on the detailed analysis of extant source documents such as notebook, drafts, manuscripts and typescripts, to derive information about the author’s individual working process and the composition of a literary work. This new element set prioritized documentary features for the first time and thus allowed editors to describe source documents and their genetic constitution in depth. This alternative perspective raised a new question for text encoding projects, however, because now editors must decide between linear transcription (focusing on textual features), diplomatic transcription (focusing on genetic features) or a hybrid of the two.

One of the first digital scholarly editions to apply the new TEI concepts for genetic editing even before the model was officially introduced into the TEI was the Faust-Edition. Brüning et al. argue that, in the case of Faust, only two separate encodings (a textual and a documentary) for each text witness fulfill the full scholarly objectives of the project. Nevertheless, in their conclusion they argue that it is most important that “all parts of the edition [here: diplomatic and linear transcript] have to be closely connected, and the available connections have to be intuitively and intelligibly visualized” (Brüning et

al. 2011: §37). Since very few editorial projects have the necessary resources to produce two separate,

yet closely connected, encodings, hybrid encodings, which merge the diplomatic and textual approach, are more likely to become common practice. Documentary zones (rectangle and polygon shapes on the facsimile, e.g. for lines or interlinear additions) and genetic revision campaigns are captured separately in the teiHeader and in a preceding tei:sourceDoc description. Instead of adding a complete documentary transcript within the coordinated zone elements in this preceding tei:sourceDoc, these zones are linked to the corresponding elements of a traditional textual transcript. In some instances an element in the textual transcript represents a documentary zone very neatly, e.g. when a verse line (tei:l) is linked to a rectangular tei:zone. In other cases, the scope of such links may be dubious, e.g. when a standalone line break (tei:lb), which is common within the textual prose transcripts, is linked to a rectangular tei:zone within the tei:sourceDoc description. The standalone line break element does not contain any information about the range of the line and about which of the following elements in the DOM tree belong to the same line while the tei:zone element with given polygon coordinates is meant to represent the physical dimensions of the complete line.

Although the current TEI proposal (TEI Consortium 2017) contains no information about the

best-practices for hybrid encodings, the technical difficulties19_{that come with such an attempt are solvable.}

For the Ulysses notebooks, a hybrid encoding is the most efficient, flexible and extensible approach. The hybrid encoding model described in following paragraphs is optimized to guarantee access to both the textual units (notes and their assignment to an episode) and documentary units (physical pages).

19_{in the example above: it is possible to isolate all the notes between two standalone line break elements from}

(30)

Moreover, a later addition of further documentary features such as zone coordinates and the annotation of text genetic revisions is taken into account. An actual encoding of such documentary features was applied as a proof on concept to the first 8 pages of the notebook.

NLI4 is structured by the Ulysses episodes. The episode titles have been added to the empty pages of the notebook in the same order in which the episodes appear in the novel. Only then did Joyce assign notes from existing note repositories and copy them underneath the most appropriate episode headline. The majority of notes were assigned to episodes 18 (“Penelope”) and 15 (“Circe”) and when the designated pages for these episodes were filled, Joyce headlined and filled additional pages for these two episodes in the final pages of the notebook. Only for these additional pages does an episode unit exceed a physical page unit. In a strictly textual TEI transcript, the episode note block (as a textual feature) would be encoded regardless of the page break (as a documentary feature). The latter would then be captured only as a standalone element, which makes it technically more complicated to access all the notes of a physical page or to tell which physical page a note is written on. For the given document structure, this would complicate a later processing and visualization of the data unnecessarily.

In Joyce’s notebooks, it is the exception rather than the rule that the textual unit (episode note block or individual note) exceeds the documentary unit (physical page). Although conflicting hierarchies do occur here, they are rare enough to preserve the physical page unit as the main hierarchical structure and to subordinate the episode note block, which occasionally exceeds the physical page and has to be divided into two linked units then. The notebook has been analyzed for the relation between episode note blocks and pages. Three different page layouts can be identified:

1. Pages with one episode note block consisting of a headline and notes.

2. Pages with two episode note blocks. The second episode note block (episode headline plus notes) is added later underneath the first episode note block to fill the blank space that occurs for episodes with less notes.

3. Pages with episode note blocks continued from an adjacent (typically facing) page. Such pages contain only notes but no episode headline.

(31)

To markup the described structural components in TEI, tei:div elements with a qualifying @type attribute have been used. A TEI template for each page has been created and the note blocks from Crowley’s preprocessed transcript have been copied and pasted into these templates.

Figure 2 compares the facsimile of NLI4 p.5verso to the corresponding TEI template. The page (encoded as tei:div with @type="page") has two episode note blocks (tei:div with @type="noteblock"): One with the episode headline “Wandering Rocks” and one with the episode headline “Circe” (each encoded with tei:head). The notes of the first episode are organized in a single center block (tei:div with @type="centerblock"). The second episode note block has a center block and additional notes on the left-hand margin. All the margin notes are nested within a tei:div with the attribute @type="marginblock".

Figure 2: Facsimile NLI4 p.5verso and its Basic Document Structure in TEI

Episode note blocks which exceed the page boundary are divided to preserve the page as the primary hierarchy. The cohesiveness of such divided blocks is encoded by assigning ids and linking @next/@prev attribute references to the next/previous part. It should be noted that this is a workaround for a phenomenon that happens sparingly. The complete annotation model including the NLI4 encoding and documentation is accessible on GitHub.

4.3. Distinguishing Individual Notes

(32)

are many instances of discrete notes that are not comma separated, and some coma-separated phrases may even belong to the same note.

Ultimately, only Joyce himself knew which units constituted cohesive notes, and yet there is a set of indicators aside from comma separation that can be used to distinguish them. A second indicator that is explicitly visible on the document is crayon color. Joyce struck notes through as he used them in different colored crayon (blue, red and green). Like comma separation, the crayon color is an unreliable indicator. In many cases, several adjacent notes are crossed in the same color, which makes it impossible to detect the end of a given note and the beginning of its immediate neighbor. In other instances where the crayon color changes between two notes, crayon strikes can be inaccurate (Joyce occasionally colored consecutive words twice in different colors). Also, many consecutive notes have not been marked as used – i.e. there is no crayon color – which again makes distinguishing their borders more complicated.

Besides these documentary features to discriminate notes (comma separation and crayon coloring), there are also implicit textual indicators. The first one that comes to mind is of course meaning and context. One may argue that a human reader should be able to distinguish the borders of individual notes simply based on their meaning and context. Unfortunately, in the case of Joyce’s notes this is easier said than done. An individual note often consists of cryptic keywords such as “door always visavis other” (NLI4 p.4recto) rather than coherent text. A verbatim borrowing appears with little to no sense of its signification in its source. If two units of cryptic keywords succeed one another without any explicit separation, it is not necessarily possible to distinguish them without any further knowledge about where these notes come from or where the same words were reused.

Therefore, a final indicator to identify note borders is inter-document relations. Since NLI4 is a second order notebook, most of the notes it contains were copied from previous notebooks. On the other hand, notes that are struck through in crayon have been worked into the text or were recompiled and reorganized in a later notebook. Any knowledge about where the listed phrases might come from or where they reappear helps one distinguish the borders of an individual note. On the human end, this means that a deep knowledge and long working experience with the extant documents is required to distinguish those textual units in the notebook that belong together. On the computational end, it means a reverse approach to the one described here is worth considering: If more documents from the genetic Ulysses corpus were digitized consistently, these documents could be collated computationally to find reoccurring strings. Such collation results could then be used not only to identify what a note is but also – by a text critical evaluation of the collation results – to work out how

Tracing Joyce’s Notes for Ulysses