|
ImpNotes
Implementation notes for the Berkeley Prosopography Services project
Representing the corpus in XMLThe HBTIN project corpus is encoded in ATF, a text format that includes much information, but not in a widely supported format. The first step is to convert this to an XML format for easier machine processing. We are basing the XML on schemas from the Text Encoding Initiative (TEI) which has a large user community, including folks doing prosopography (marking up documents by hand). Since we want the tools we develop to be more broadly applicable than just HBTIN, we resolved to leverage a schema that has broad support among researchers in cultural heritage and the humanities. To show the current thinking on the TEI for BPS syntax, we have a working example of what the markup should look like for a sample tablet. Additional background notes on our thinking are collected below. Technology StackThere are three major components to the architecture, each comprising a distinct service bundle and each (currently) built on different technology stacks. Details of these stacks will be added to the TechStackNotes page. The text pre-processing component is currently built as a series of perl scripts that process ATF into TEI-for-BPS. It is not yet available as a web service. The social network analysis component will be a Java-based web service backed by a MySQL DB. This can live initially in a generic Tomcat container, but we will eventually migrate this to one of the OSS ESB/SOA stacks, probably like the FUSE stack. The Presentation and graphing component will incorporate both simple web services for some basic formats as well as a graph-support client to provide richer interactive support for the graph visualizations. The packages we explored for use in this component are annotated on the GraphTechStackNotes page. Current ATF syntax notesDocument dumps are structured as tablets with identifiers. There are some noise tokens we can ignore (#atf: lang etc.), although for other projects and applications of this, the language may be significant. Can mix languages within a given corpus file.). We should generally expect to see a structure like the following, indentation added for clarity: &P#### = LL ##, ##
@tablet
@obverse
1. text
{#notes (can be multi-line) - copy and ignore}
2. text
{#notes (can be multi-line) - copy and ignore}
...
@reverse
@top
@column 1
1. text
{#notes (can be multi-line) - copy and ignore}
$ seal = {string identifier} OR $ (seal largely broken)
{#notes (can be multi-line) - copy and ignore}
2. text
{#notes (can be multi-line) - copy and ignore}
{3. text may or may not have additional lines of text, also with notes as above}.
@column 2
{column production as above)
...
@bottom
@column 1
{column production as above)
@column 2
{column production as above)
...
@left
@column 1
{column production as above)
@column 2
{column production as above)
...
@right
@column 1
{column production as above)
@column 2
{column production as above)
...The text lines are individually structured and notated, however they actually run together to form a full text. In particular, the markers of parentage run across lines, as do other role markers. So we will actually build up a text and parse it running across lines. Do we need to reference line # in the onomasticon, or is it sufficient to have the tablet and face? [Laurie: would be nice, but not important for now.] Could keep track of line number in parser if need be. Should be able to mark up with TEI for structure as input, and then consider marking up further as output. Issue of making probabilistic associations to persons, and multiples? Or, could we have an entity if not clear, with notes on associations to others. Something else builds the prosopography and can present and eventually resolve the (i.e., support expert/editorial resolution of) ambiguities. Ideas from TEI-Lite docs<teiCorpus>
<teiHeader>
<!--[header information for the corpus]-->
</teiHeader>
<TEI>
<teiHeader>
<!--[header information for first text]-->
</teiHeader>
<text>
<!--[first text in corpus]-->
</text>
</TEI>
<TEI>
<teiHeader>
<!--[header information for second text]-->
</teiHeader>
<text>
<!--[second text in corpus]-->
<name type="foo">kjhskjhaskdjh</name>
</text>
</TEI>
</teiCorpus>Define a corpus of the tablets, with a general header (we can put in some notes on what it is, e.g., from Laurie). Then, each item is a TEI block. The tablet number and other junk goes into the header (see also Front Matter), and then the rest is in the text blocks. Then probably use <group> to represent the different faces (or fragments).
Examples (taken from UMich TEI site): <persName key=pn9> <forename sort=2>Sergei</forename> <forename sort=3 type='patronym'>Mikhailovic</forename> <surname sort=1>Uspensky</surname> </persName> This example also demonstrates the use of the sort attribute common to all members of the personPart class; its effect is to state the sequence in which <forename> and <surname> elements should be combined when constructing a sort key for the name. <persName key=MRSRO1> <addName type=honorific>Mrs</addName> <surname>Robinson</surname> </persName> <persName key=FRTG1> <forename>Frederick</forename> <addname type=epithet>the Great</addname> </persName> <placeName> <settlement key=RNY1 type=city>Rochester</settlement> </placeName> <geogName key=MIRI1 type=river> <name>Mississippi</name> <geog>River</geog> </geogName> <placeName key=NEPA1> <distance>10 miles</distance> <offset>north of</offset> <settlement type=city>Paris </settlement> </placeName> Can then add new elements as we see fit, especially to define the structure we want to. Other tools can ignore. UI NotesThe PASE project has a UI help file that shows some screen shots and provides explanations of how their UI works. Seems like some good ideas there. |