My favorites | Sign in
Logo
             
Search
for
Updated Mar 24, 2008 by jolonf
GoBibleDataFormat  
The Go Bible Data Format

This page describes the data files that are produced by GoBibleCreator and packaged into the final JAR. It also describes their format.

There are three types of data files produced by GoBibleCreator:

  • Global Index File
  • Book Index Files
  • Verse Data Files

These appear in the JAR file as follows:

  • Bible Data/Index (Global Index File)
  • Bible Data/[Book Name]/Index (Book Index File)
  • Bible Data/[Book Name]/[Book Name] [File Number] (Verse Data File)

As an example, here are the files produced for a collection which only contains the book of Matthew:

  • Bible Data/Index
  • Bible Data/Matthew/Index
  • Bible Data/Matthew/Matthew 0
  • Bible Data/Matthew/Matthew 1
  • Bible Data/Matthew/Matthew 2
  • Bible Data/Matthew/Matthew 3
  • Bible Data/Matthew/Matthew 4
  • Bible Data/Matthew/Matthew 5

The files named "Matthew n" contain the actual verse data. GoBibleCreator splits the verse data into multiple files to speed up file loading and also to limit the memory used . GoBibleCreator currently has a maximum file size limit set to 24KB, therefore if a book exceeds this size it will result in multiple verse data files. When a user searches or navigates in Go Bible, the GoBibleCore will load each verse data file as it is needed releasing the previous one. This optimisation may not be needed on modern devices which may load entire books into memory very quickly (this hasn't been tested).

The following sections describe the format of each data file type.

The Global Index File: Bible Data/Index

The file stored at Bible Data/Index is the Global Index File and contains the information needed by GoBibleCore to locate a chapter quickly.

It has the following format:

  • byte - Number of books
  • [For each book]
    • utf - Book display name
    • utf - Book's file name
      • Used for both the book directory and verse data files, this an ASCII safe version of the file name as opposed to the display name which can contain any unicode characters. Note: Book file names can have spaces in them, eg. "1 Corinthians" and by default GoBibleCreator will output file names with spaces.
    • short - Start chapter (version 2.2.3 and earlier used byte instead of short)
      • Chapter numbers start at 1 or whatever was read in from the source ThML or OSIS file, so theoretically a chapter could start at 0 but I haven't seen this yet, and it would result in the user seeing a chapter 0 on screen. The only time when this field would be larger than 1 is if a book must be split across different collections (ie. JAR files), this was a requirement when devices had JAR size restrictions smaller than the largest book (Psalms). As far as I am aware no MIDP 2.0 phone has a memory limit which requires books to be split.
    • short - Number of chapters (version 2.2.3 and earlier used byte instead of short)
      • If the book has been split then this will contain only the number of chapters in this collection, but normally it will be the total number of chapters in the book.
    • [For each chapter]
      • byte - The file number which contains this chapter (so that GoBibleCore knows where to look for it)
      • int - Number of characters in the chapter (this is required so that GoBibleCore knows where in the verse data file a chapter begins)
      • byte - Number of verses in the chapter

Note: the data types used above are the standard Java data types. GoBibleCreator uses java.io.DataOutputStream to create the file. This means that the utf strings are not standard UTF-8 but are Java's UTF-8 format. Also all primitive types are signed. See the java.io.DataOutputStream class reference in the Java API documentation for more information about it's format.

The Book Index File: Bible Data/[Book Name]/Index

The Book Index File simply contains the length of each verse of every chapter. This is used to quickly locate a verse.

The Verse Data File: Bible Data/[Book Name]/[Book Name] [File Number]

Unlike the Java utf strings stored in the Global Index File, the verse data is stored as true UTF-8. The file has the following format:

Note that the verse data consists of all verses from all chapters for this particular verse data file appended together. There is no separation between verses. ie. There is no end-of-line delimiter or null terminator. The only way to know the length of the verse is to use the verse lengths from the Book Index File.

The only unusual aspect of the verse data is that style markup can be inserted. Currently this has only been used to indicate Christ's Words in red. However, it could also be used for italics, however the code is not there in GoBibleCreator or GoBibleCore, even though it would be relatively simple to add.

The red style is turned on with the UTF-8 character 0x1 and turned off again with the same character. Since Go Bible has very little need for standard ASCII control characters I envisage that any UTF-8 character below 10 could probably be used without an issue.


Comment by ado.hoppo, Mar 24, 2008

two corrections =>start chapter and number of chapter are byte, not short typed

and I recommend to read this APIs for understanding of JAVA UTF-8 format

http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)

http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8

Comment by jolonf, Mar 24, 2008

Actually start chapter and number of chapters are shorts but this change only appeared in GoBibleCreator 2.2.3 (which is the version in SVN). I haven't uploaded the built version of GoBibleCreator 2.2.3 to the Go Bible web page, but I should (it currently has version 2.2.2).

Comment by jolonf, Mar 24, 2008

The Go Bible web site now has GoBibleCreator 2.2.3 which should produce files containing shorts for start chapter and number of chapters:

http://gobible.jolon.org/developer/GoBibleCreator/GoBibleCreator.html

Comment by ado.hoppo, Mar 24, 2008

well I've downloaded KJV 2.2.3 and it has start chapter and chapter count stored in one byte, not in two (so I am a bit confused)

Comment by jolonf, Mar 24, 2008

Actually I think what has happened here is that I made the change to 2.2.3 without bumping the version number. I have then uploaded this version to SVN, however the KJV 2.2.3 on the Go Bible website was produced before I made the change. So really the version in SVN should have a bumped version number.

This an unfortunate circumstance since it means that some 2.2.3 collections will be using bytes and others will be using shorts. In the past this hasn't been an issue since the GoBibleCore and GoBibleCreator are intrinsically linked.

I actually have a GoBibleCreator 2.2.4 which fixes some ThML bugs. The ideal solution would probably be to release 2.2.4 and update KJV 2.2.3 to 2.2.4. We can then say that 2.2.4 and onwards uses shorts and earlier versions use bytes. However, it is possible that there are some collections in the wild that are 2.2.3 and use shorts.

Like I said this sort of thing hasn't been an issue in the past but it will be an issue for viewers other than GoBibleCore which rely on the GoBibleDataFormat. Now that the source is in SVN it should make managing this type of thing a bit easier.

Comment by jolonf, Mar 24, 2008

Okay, I've committed my 2.2.4 version to SVN (which uses shorts) and tagged it as 2.2.4. I've built and packaged a GoBibleCreator 2.2.4 which is now on the Go Bible web site.

So I think we can say that versions earlier than 2.2.4 use bytes, 2.2.4 and later use shorts.

Comment by ado.hoppo, Mar 24, 2008

Every problem may be smartly and easily solved but only in case we know there is one...

with this informations I'll find the way to bypass this small problem =)

Comment by jolonf, Mar 24, 2008

Very true, thanks for pointing out these problems Hoppo.

Comment by DFHMCH, Apr 08, 2008

Please add a summary section to describe the limits for:

  • maximum number of characters allowed in a verse
  • maximum number of verses allowed in a chapter
  • maximum number of chapters allowed in a book
  • maximum number of books allowed in a collection
Comment by DFHMCH, Apr 08, 2008

Please describe how any special constructions such as XML character codes are processed by GoBibleCreator, e.g.

  • &

Please check the facts! cf issue 22.

Comment by jolonf, Apr 08, 2008
  • max characters per verse: 32767
  • max verses per chapter: 255
  • max chapters per book: 32767
  • max books per collection: 255

The limit on verses per chapter and books per collection seems unnecessarily small can probably be increased to 32767 or even 232.

Comment by jolonf, Apr 08, 2008
Please describe how any special constructions such as XML character codes are processed by GoBibleCreator,

All characters are encoded in UTF-8, special characters like & might be better described in the GoBibleCreator documentation. However, special markup like Christ's words in red should be described here. Christ's words in red begin with the unicode value 0x01 and end with the same value. It's possible that other values such as 0x02, 0x03, 0x04 and so on may be used for italics, bold, headings, etc. in the future.

Comment by khampathosting, Apr 29, 2008

For future, if it can support line break, it will be very useful.

Comment by DFHMCH, May 02, 2008

All Go Bible source text have to encoded as UTF-8 (without BOM), but which Normalization Form is preferred/required? Is it NFC?

See Unicode Standard Annex #15 - Unicode Normalization Forms

Comment by DFHMCH, May 17, 2008

In the minimum requirements for ThML and OSIS files shown in the Go Bible website pages, it is clear from the examples that GoBibleCreator relies on the sequential order of the scripture text items to number the verses, and does not pick up verse numbers from within XML tags.

Thus one assumption in GoBibleDataFormat is that all translations of the Bible are like the KJV in having one unique verse number tag for each scripture text.

Such an assumption is actually an unwarranted presupposition, as there are many Bible translations where the translators have had to combine more than one scripture verse into a sentence or paragraph and tag this with a verse range. This is necessary when the translated text does not have any obvious position in which to place the verse breaks that correspond to KJV verse numbering.

The consequence of this is that users of GoBibleCreator generally need to take extra care when preparing ThML or OSIS source text files. In such circumstances, they would have to adopt a suitable workaround so as to ensure that subsequent verses are numbered correctly.

Such a workaround may be in the form of allocating a continuation device for the subverses in these verse-subverse combinations. I have often used a pair of brackets for this device.

It is important therefore that after building an application with GoBibleCreator, the programmer should check that every chapter contains the correct number of verses, and every book the correct number of chapters.

Comment by DFHMCH, Jan 28, 2009

What is the maximum size for the Info: line in the collections text file?

Is there any way in which this line can be formated for better display appearance by the Menu | About option?


Sign in to add a comment
Hosted by Google Code