My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
ImpNotes  
Implementation notes for the Berkeley Prosopography Services project
Updated May 28, 2009 by LudicrousResearcher@gmail.com

Project Home

Representing the corpus in XML

The HBTIN project corpus is encoded in ATF, a text format that includes much information, but not in a widely supported format. The first step is to convert this to an XML format for easier machine processing. We are basing the XML on schemas from the Text Encoding Initiative (TEI) which has a large user community, including folks doing prosopography (marking up documents by hand). Since we want the tools we develop to be more broadly applicable than just HBTIN, we resolved to leverage a schema that has broad support among researchers in cultural heritage and the humanities.

To show the current thinking on the TEI for BPS syntax, we have a working example of what the markup should look like for a sample tablet.

Additional background notes on our thinking are collected below.

Technology Stack

There are three major components to the architecture, each comprising a distinct service bundle and each (currently) built on different technology stacks. Details of these stacks will be added to the TechStackNotes page.

The text pre-processing component is currently built as a series of perl scripts that process ATF into TEI-for-BPS. It is not yet available as a web service.

The social network analysis component will be a Java-based web service backed by a MySQL DB. This can live initially in a generic Tomcat container, but we will eventually migrate this to one of the OSS ESB/SOA stacks, probably like the FUSE stack.

The Presentation and graphing component will incorporate both simple web services for some basic formats as well as a graph-support client to provide richer interactive support for the graph visualizations. The packages we explored for use in this component are annotated on the GraphTechStackNotes page.

Current ATF syntax notes

Document dumps are structured as tablets with identifiers. There are some noise tokens we can ignore (#atf: lang etc.), although for other projects and applications of this, the language may be significant. Can mix languages within a given corpus file.). We should generally expect to see a structure like the following, indentation added for clarity:

&P#### = LL ##, ##
  @tablet
    @obverse
      1. text
      {#notes (can be multi-line) - copy and ignore}
      2. text
      {#notes (can be multi-line) - copy and ignore}
      ...
    @reverse
    @top
      @column 1
        1. text
        {#notes (can be multi-line) - copy and ignore}
        $ seal = {string identifier} OR $ (seal largely broken)
        {#notes (can be multi-line) - copy and ignore}
        2. text
        {#notes (can be multi-line) - copy and ignore}
	{3. text may or may not have additional lines of text, also with notes as above}.
      @column 2
        {column production as above)
      ...
    @bottom
      @column 1
        {column production as above)
      @column 2
        {column production as above)
      ...
    @left
      @column 1
        {column production as above)
      @column 2
        {column production as above)
      ...
    @right
      @column 1
        {column production as above)
      @column 2
        {column production as above)
      ...

The text lines are individually structured and notated, however they actually run together to form a full text. In particular, the markers of parentage run across lines, as do other role markers. So we will actually build up a text and parse it running across lines. Do we need to reference line # in the onomasticon, or is it sufficient to have the tablet and face? [Laurie: would be nice, but not important for now.] Could keep track of line number in parser if need be.

Should be able to mark up with TEI for structure as input, and then consider marking up further as output. Issue of making probabilistic associations to persons, and multiples? Or, could we have an entity if not clear, with notes on associations to others. Something else builds the prosopography and can present and eventually resolve the (i.e., support expert/editorial resolution of) ambiguities.

Ideas from TEI-Lite docs

<teiCorpus>
  <teiHeader>
    <!--[header information for the corpus]-->
  </teiHeader>
  <TEI>
    <teiHeader>
      <!--[header information for first text]-->
    </teiHeader>
    <text>
      <!--[first text in corpus]-->
    </text>
  </TEI>
  <TEI>
    <teiHeader>
      <!--[header information for second text]-->
    </teiHeader>
    <text>
      <!--[second text in corpus]-->
         <name type="foo">kjhskjhaskdjh</name>
    </text>
  </TEI>
</teiCorpus>

Define a corpus of the tablets, with a general header (we can put in some notes on what it is, e.g., from Laurie). Then, each item is a TEI block. The tablet number and other junk goes into the header (see also Front Matter), and then the rest is in the text blocks. Then probably use <group> to represent the different faces (or fragments).

  • <div type="foo"> For structuring within a text. Note that while we have front, back, etc of the tablets, the associated TEI elements have different semantics. We can define whatever vocabulary we want for @type, so that seems to fit our needs.
  • <lb/> (line break) marks the start of a new (typographic) line in some edition or version of a
  • <note> contains a note or annotation. This is a useful way to inject our notes.
  • Names and references. However, I prefer more specific features below, even though they are outside the TEI Lite spec (they are part of the full TEI).
    • <rs> (referencing string) contains a general purpose name or referring string.
    • <name> (name, proper noun) contains a proper noun or noun phrase.
    • The @type attribute is used to distinguish amongst (for example) names of persons, places and organizations, where this is possible.
    • The @key attribute provides an alternative normalized identifier for the object being named, like a database record key. It may thus be useful as a means of gathering together all references to the same individual or location scattered throughout a document
    • See also the presentation on TEI and prosop.
    • I actually like none of this, as it is too abstract. I am leaning towards the notion of a <person> element upon which we can put a key when and if we know who it is. Then we can put <persName> within to wrap the names. This takes further children <forename> and <surname>, each of which take types so we can distinguish between a father's name and a clan name (we could also use <addName> with a type for clan name). At least one reference says that patronymics should all use <forename> but with type='patronym'. This probably is more consistent with the fact that the father's name can function logically as a forename when the father is the actual subject. We should do this, and then mark the clan name with either <surname>, or <addName>. Where there are honorifics, we can mark these with <roleName> (type values commonly used include: nobility, honorific, office, military, and epithet). For all of the name elements:
      • @key allows us to point to the DB id for this name (if we know it, e.g., when we know the clan name proper and this is a variant orthography).
      • @reg allows us to put in the regularized form (unlikely to do this in the input, but we might consider it for some output.
      • Additional markup can indicate age, sex, occupation, nationality, socio-economic status, etc.
  • Dates and Times.
  • There is a bunch of support for place as well. ALl these take @key, @reg, and @type as for the names above.
    • <placeName> contains an absolute or relative place name.
    • <settlement> contains the name of the smallest component of a place name expressed as a hierarchy of geo-political or administrative units as in "Rochester", New York; "Glasgow", Scotland.
    • <region> in an address, contains the state, province, county or region name; in a place name given as a hierarchy of geo-political units, the <region> is larger or administratively superior to the <settlement> and smaller or administratively less important than the <country>.
    • <country> in an address, gives the name of the nation, country, colony, or commonwealth; in a place name given as a hierarchy of geo-political units, the <country> is larger or administratively superior to the <region> and smaller than the <bloc>.
    • <bloc> a geo-political unit containing one or more nation states.
    • <geogName> a name associated with some geographical feature such as "Windrush Valley" or "Mount Sinai".
    • <geog> contains a common noun identifying some geographical feature contained within a geographic name, such as "valley", "mount" etc.
    • <distance> that part of a relative temporal or spatial expression which indicates the distance between the place or time denoted by it and the place or time referred to within it.
    • <offset> that part of a relative temporal or spatial expression which indicates the direction of the offset between the two place names, dates, or times involved in the expression.
  • There a proposed element to encode events in someone's life: <persEvent>. See also M.j. Driscoll's paper. The element takes a @type attribute, for which we can define our vocabulary.
  • Another element of possible interest declares relations between people, but it looks pretty funky and too light weight (if you're going this far, use some RDF).
  • Driscoll describes an @cert attribute to mark uncertainty. TEI defines a <certainty> element that looks awkward to use.
  • TEI has a module for damage indication. However, if some section is completely illegible or gone, they prescribe a separate set of elements (particularly <gap>).

Examples (taken from UMich TEI site):

<persName key=pn9>
  <forename sort=2>Sergei</forename>
  <forename sort=3 type='patronym'>Mikhailovic</forename>
  <surname sort=1>Uspensky</surname>
</persName>

This example also demonstrates the use of the sort attribute common to all members of the personPart class; its effect is to state the sequence in which <forename> and <surname> elements should be combined when constructing a sort key for the name.

<persName key=MRSRO1>
  <addName type=honorific>Mrs</addName>
  <surname>Robinson</surname>
</persName>

<persName key=FRTG1>
  <forename>Frederick</forename>
  <addname type=epithet>the Great</addname>
</persName>

<placeName>
  <settlement key=RNY1 type=city>Rochester</settlement>
</placeName>

<geogName key=MIRI1 type=river>
  <name>Mississippi</name>
  <geog>River</geog>
</geogName>

<placeName key=NEPA1>
  <distance>10 miles</distance>
  <offset>north of</offset>
  <settlement type=city>Paris
  </settlement>
</placeName>

Can then add new elements as we see fit, especially to define the structure we want to. Other tools can ignore.

UI Notes

The PASE project has a UI help file that shows some screen shots and provides explanations of how their UI works. Seems like some good ideas there.


Sign in to add a comment
Powered by Google Project Hosting