|
GoBibleDataFormat
The Go Bible Data Format
This page describes the data files that are produced by GoBibleCreator and packaged into the final JAR. It also describes their format. There are three types of data files produced by GoBibleCreator:
These appear in the JAR file as follows:
As an example, here are the files produced for a collection which only contains the book of Matthew:
The files named "Matthew n" contain the actual verse data. GoBibleCreator splits the verse data into multiple files to speed up file loading and also to limit the memory used . GoBibleCreator currently has a maximum file size limit set to 24KB, therefore if a book exceeds this size it will result in multiple verse data files. When a user searches or navigates in Go Bible, the GoBibleCore will load each verse data file as it is needed releasing the previous one. This optimisation may not be needed on modern devices which may load entire books into memory very quickly (this hasn't been tested). The following sections describe the format of each data file type. The Global Index File: Bible Data/IndexThe file stored at Bible Data/Index is the Global Index File and contains the information needed by GoBibleCore to locate a chapter quickly. It has the following format:
Note: the data types used above are the standard Java data types. GoBibleCreator uses java.io.DataOutputStream to create the file. This means that the utf strings are not standard UTF-8 but are Java's UTF-8 format. Also all primitive types are signed. See the java.io.DataOutputStream class reference in the Java API documentation for more information about it's format. The Book Index File: Bible Data/[Book Name]/IndexThe Book Index File simply contains the length of each verse of every chapter. This is used to quickly locate a verse.
The Verse Data File: Bible Data/[Book Name]/[Book Name] [File Number]Unlike the Java utf strings stored in the Global Index File, the verse data is stored as true UTF-8. The file has the following format:
Note that the verse data consists of all verses from all chapters for this particular verse data file appended together. There is no separation between verses. ie. There is no end-of-line delimiter or null terminator. The only way to know the length of the verse is to use the verse lengths from the Book Index File. The only unusual aspect of the verse data is that style markup can be inserted. Currently this has only been used to indicate Christ's Words in red. However, it could also be used for italics, however the code is not there in GoBibleCreator or GoBibleCore, even though it would be relatively simple to add. The red style is turned on with the UTF-8 character 0x1 and turned off again with the same character. Since Go Bible has very little need for standard ASCII control characters I envisage that any UTF-8 character below 10 could probably be used without an issue. |
Sign in to add a comment
two corrections =>start chapter and number of chapter are byte, not short typed
and I recommend to read this APIs for understanding of JAVA UTF-8 format
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8
Actually start chapter and number of chapters are shorts but this change only appeared in GoBibleCreator 2.2.3 (which is the version in SVN). I haven't uploaded the built version of GoBibleCreator 2.2.3 to the Go Bible web page, but I should (it currently has version 2.2.2).
The Go Bible web site now has GoBibleCreator 2.2.3 which should produce files containing shorts for start chapter and number of chapters:
http://gobible.jolon.org/developer/GoBibleCreator/GoBibleCreator.html
well I've downloaded KJV 2.2.3 and it has start chapter and chapter count stored in one byte, not in two (so I am a bit confused)
Actually I think what has happened here is that I made the change to 2.2.3 without bumping the version number. I have then uploaded this version to SVN, however the KJV 2.2.3 on the Go Bible website was produced before I made the change. So really the version in SVN should have a bumped version number.
This an unfortunate circumstance since it means that some 2.2.3 collections will be using bytes and others will be using shorts. In the past this hasn't been an issue since the GoBibleCore and GoBibleCreator are intrinsically linked.
I actually have a GoBibleCreator 2.2.4 which fixes some ThML bugs. The ideal solution would probably be to release 2.2.4 and update KJV 2.2.3 to 2.2.4. We can then say that 2.2.4 and onwards uses shorts and earlier versions use bytes. However, it is possible that there are some collections in the wild that are 2.2.3 and use shorts.
Like I said this sort of thing hasn't been an issue in the past but it will be an issue for viewers other than GoBibleCore which rely on the GoBibleDataFormat. Now that the source is in SVN it should make managing this type of thing a bit easier.
Okay, I've committed my 2.2.4 version to SVN (which uses shorts) and tagged it as 2.2.4. I've built and packaged a GoBibleCreator 2.2.4 which is now on the Go Bible web site.
So I think we can say that versions earlier than 2.2.4 use bytes, 2.2.4 and later use shorts.
Every problem may be smartly and easily solved but only in case we know there is one...
with this informations I'll find the way to bypass this small problem =)
Very true, thanks for pointing out these problems Hoppo.
Please add a summary section to describe the limits for:
Please describe how any special constructions such as XML character codes are processed by GoBibleCreator, e.g.
Please check the facts! cf issue 22.
The limit on verses per chapter and books per collection seems unnecessarily small can probably be increased to 32767 or even 232.
All characters are encoded in UTF-8, special characters like & might be better described in the GoBibleCreator documentation. However, special markup like Christ's words in red should be described here. Christ's words in red begin with the unicode value 0x01 and end with the same value. It's possible that other values such as 0x02, 0x03, 0x04 and so on may be used for italics, bold, headings, etc. in the future.
For future, if it can support line break, it will be very useful.
All Go Bible source text have to encoded as UTF-8 (without BOM), but which Normalization Form is preferred/required? Is it NFC?
See Unicode Standard Annex #15 - Unicode Normalization Forms
In the minimum requirements for ThML and OSIS files shown in the Go Bible website pages, it is clear from the examples that GoBibleCreator relies on the sequential order of the scripture text items to number the verses, and does not pick up verse numbers from within XML tags.
Thus one assumption in GoBibleDataFormat is that all translations of the Bible are like the KJV in having one unique verse number tag for each scripture text.
Such an assumption is actually an unwarranted presupposition, as there are many Bible translations where the translators have had to combine more than one scripture verse into a sentence or paragraph and tag this with a verse range. This is necessary when the translated text does not have any obvious position in which to place the verse breaks that correspond to KJV verse numbering.
The consequence of this is that users of GoBibleCreator generally need to take extra care when preparing ThML or OSIS source text files. In such circumstances, they would have to adopt a suitable workaround so as to ensure that subsequent verses are numbered correctly.
Such a workaround may be in the form of allocating a continuation device for the subverses in these verse-subverse combinations. I have often used a pair of brackets for this device.
It is important therefore that after building an application with GoBibleCreator, the programmer should check that every chapter contains the correct number of verses, and every book the correct number of chapters.
What is the maximum size for the Info: line in the collections text file?
Is there any way in which this line can be formated for better display appearance by the Menu | About option?