|
DictionariesCharterProposal
Charter Proposal, Dictionaries and Glossaries
Charter Proposal - Dictionaries and Glossaries
Status of this proposalThis charter proposal has been accepted by vote of the IDPF membership as of 1/31/2012 and a Working Group formed. Please see the Dictionaries Working Group Main Page for current activity on this project. The final charter document is available at http://idpf.org/charters/2012/dictionaries/. Need for this proposalDictionaries, glossaries, thesauri, and similar works are ubiquitous published resources that users expect to have available in the EPUB3 ecosystem. The primary use of a dictionary or glossary from a user point of view is the ability to search for a term and quickly retrieve its definition or translation. Currently, EPUB has no mechanism for an author to mark up the needed semantic information to enable such reading system search features, making it impossible to publish a dictionary in EPUB that serves its primary purpose. While EPUB-based reading systems often bundle dictionaries with devices and offer a word lookup feature, this is achieved by storing the dictionary in a proprietary format and essentially treating it as part of the reading system software, rather than an independent publication. The current situation does not allow users to choose the dictionary content that best suits their needs, and instead limits them to using a single bundled dictionary. Publishers of EPUB3 content wish to make a broad range of reference resources available to users and to serve needs that cannot be met by a general monolingual dictionary typically bundled with a reading system: children need dictionaries designed for their reading level, language learners need dictionaries that translate from a foreign language to their native tongue, and users reading material in fields such as medicine and law need dictionaries covering a broad specialized vocabulary. Publishers also wish to offer users the ability to look up words in a publication's glossary while reading, thereby enhancing the user's experience of educational and other types of content. Reading system developers wish to utilize and innovate around these types of publications. This proposal describes the scope, required functionality, and timeline to deliver a standard for producing EPUB3 Publications that meet the use cases that are also included in this proposal. ScopeIn-scope (Deliverables)The scope of this project is to define a declarative mechanism for the representation of dictionaries and glossaries in EPUB Publications sufficient to enable development of reading system features specific to these publication types. As further detailed in Use Cases and Needed Publication Properties below, the delivered mechanism shall have the following top-level functional properties:
Out of Scope
Integration ConstraintsThe defined mechanism shall integrate with EPUB 3 as follows:
Timeline and ParticipationProject participation is open to IDPF members and invited experts. (Note that invited expert status needs to be renewed for each IDPF project.) The project charter spans one year in total. Once formed, the working group will decide on feature prioritization and possibly also versioning strategies, after which the milestones below can be dated.
This project is intended to be run concurrently with the project on indexes, and so shares the the charter span with that project. Working Group LeadsSuggested Leads of this working group are:
Use CasesActors: publishers, users
System: reading system, content
Needed Publication PropertiesPackage Metadata
Entry Structure
Headwords and Inflections
Other Semantic Markup
Structure and SemanticsN.B.: The following terms are representative of the range of lexical and semantic qualities that will be needed to support stated use cases and also allow for innovation. For the purposes of this charter proposal to initiate a working group, these terms are not intended to be interpreted as a strict requirement for inclusion into a specification. GlossariesLexicographical
DictionariesLexicographical
Bilingual / Multilingual DictionariesLexicographical
ThesauriLexicographical Morphological Phonetic Syntax/Grammar
DefinitionsaffixA prefix, infix, or suffix that is attached to another form to make a word with a distinct meaning, eg, laugh + ed. (1) alternate headwordA form related to a primary headword but generally carrying a somewhat different meaning. For example, an entry with the primary headword aestivate might have aestivation as an alternate headword. An alternate headword should be indexed for search purposes along with the primary headword. antonymTerms with opposite sense or meaning. audio pronunciationAn audio file containing a recording of the pronunciation of a particular headword. This feature of many electronic dictionaries can be offered in addition to or in place of the traditional written pronunciation. caseAn inflection of a noun, adjective, or pronoun according to its function in a sentence. German, Russian, and Latin are examples of languages in which words have many different written forms according to case. cultural noteA note providing detailed cultural context on a headword. dateThe date of the first recorded use in a language of a particular headword. definitionAn explanation of the meaning of a particular sense of a headword. dictionary resourceA collection of entries that have headwords in a particular source language and that a reading system can access to look up terms a user selects while reading a publication. displayed inflectionAn inflection of a headword that is part of the viewable content of an entry. Irregular inflections are often explicitly printed in entries to provide guidance to the user, eg, the displayed inflection "mice" in "mouse noun, plural mice" entryThe fundamental organizational unit of a glossary or dictionary, consisting of at least one headword and a definition, translation, or equivalence cross-reference. equivalenceA statement that a headword or particular sense of a headword is equivalent in meaning to another dictionary headword, typically supplied in lieu of a definition and acting as a cross-reference to the equivalent entry cited. An example would be a short entry for color in a British English dictionary that informs the user this is a US equivalent of colour: 'color noun (US) = colour'. etymologyAn explanation of the historical origin of a headword, eg, a statement that it is derived from a particular Latin word. exampleA sentence or phrase illustrating the usage of a headword in a particular sense. genderA label indicating the gender of a noun, generally subsumed in part-of-speech at the beginning of an entry; in bilingual dictionaries, often a stand-alone label associated with a particular translation. glossaryA glossary section of a publication that a reading system can access to look up a term a user selects while reading that particular publication. headwordThe word occurring at the start of an entry whose meanings the entry covers; in a broader sense, a word whose meanings are discussed at any point in the entry (see alternate headword, variant headword, run-in headword, and run-on headword). In a monolingual dictionary or glossary, the headword is defined, while in a bilingual dictionary the headword is translated, and in a thesaurus synonyms are provided. In most languages, entries are arranged alphabetically according to the spelling of the headword. holonymA relation between a whole and a part, eg, a wiki is a holonym of constituent wiki pages; 'has-parts'. hypernymA relation between a class and sub-class; 'has-types'. hyponymA relation between a sub-class and a class; 'is-type-of'. idiomAn idiomatic expression that is defined or translated in an entry. For example, an entry for cold might contain the idiom 'to get cold feet'. inflectionAn affixed form of a headword that conveys a specific grammatical meaning; for example, the past tense of a verb (eg, 'ran' is an inflection of 'run') or plural form of a noun (eg, 'mice' is an inflection of 'mouse'). Related to the concept of stemming in indexes. lookupA search for a user-selected term in dictionary or glossary headwords (including alternate, variant, run-on, and run-in headwords) and inflections. When a user initiates a glossary lookup, the reading system should search the local publication's embedded glossary, while when a user initiates a dictionary lookup, the reading system should search the user's preferred resources. Matching glossary or dictionary entries are then displayed to the user, typically in a pop-up window. meronymA relation between a part and a whole, eg, a wiki page is a meronym of a wiki; 'is-a-part-of'. quotationA quotation from a cited source illustrating the usage of a headword in a particular sense. part-of-speechA label indicating the grammatical function of the headword (noun, verb, adjective, interjection, transitive verb, reflexive verb, etc.) phrasal headwordA headword of two or more words typically formed from another headword and listed within that headword's entry. For example, the items 'get out' and 'get up' listed in the entry for 'get' would be phrasal headwords. preferred resourceAn available dictionary resource which a reading system uses during lookup based on a user's indicated preferences. pronunciationOne or more written phonetic pronunciations given for a headword. register labelA label indicating usage register of a headword or sense, eg, formal, slang, offensive. regional labelA label indicating geographic range of a headword or sense, eg, Latin America, Western US, Australia. run-in headwordA headword occurring in the middle of an entry, generally associated with a particular sense. run-on headwordA headword occurring at the end of an entry and that is derived from that entry's headword. For example, the adverb softly at the end of the entry for the adjective soft would be a run-on headword. senseA particular meaning of a headword, and a unit for organizing information pertaining to this meaning. Sense units are typically distinguished from one another by numeric and/or alphabetic labels. sense labelA short phrase that restricts and clarifies the meaning of a particular sense. source languageThe language of the term(s) which a user wishes to looks up; in bilingual dictionaries, the language of the headwords in a section of the publication. stylistic labelA label identifying stylistic usage of a headword or sense, eg, literary. subject labelA label indicating subject area of a headword or sense, eg, biology, architecture. synonymTerms with identical or similar meanings. Groups of synonyms are often tied to a particular sense of a headword in a thesaurus or dictionary. temporal labelA label indicating current usage status of a headword or sense, eg, archaic. tenseAn inflected form of a verb that indicates when the action is taking place.
text entry searchA feature by which a user can directly input text into a search field and select entries with matching headwords from a list. Reading system developers could implement such a feature in a variety of ways, depending on their preference: by displaying matching results only after the user has input a full string and launched the search, or displaying partial matches as the user types, or positioning a highlight in a scrollable, complete list of dictionary headwords (to cite just a few possibilities). translationIn a bilingual dictionary, the translation of a particular sense of a source language headword into the translation language. translation languageIn bilingual dictionaries, the language in which translations are offered for headwords in the source language. usage sectionA note providing usage information on a headword, or a more extensive section covering the difficult and confusing aspects of a particular headword's usage. variant headwordAn alternative spelling of a primary headword that carries the same meaning and that should be treated as of equal rank to it for search purposes. For example, an entry with the primary headword kabbalah could have numerous variant headwords: 'kabbalah also kabbala or kabala or cabala or ...' (2) voiceA relationship between the subject and object of a verb that is either active or passive. References1. Crystal, David. (1995). The Cambridge Encyclopedia of the English Language (pp. 448-60). Cambridge: Cambridge University Press. 2. Merriam-Webster, Incorporated. (2003). Merriam-Webster's Collegiate Dictionary, Eleventh Edition. Springfield, Massachusetts: Merriam-Webster, Incorporated.
| ||||||||||||||||||
I'm posting a note from Ina Gravitz, as she doesn't have access. "What about an entry for variant headword as you have for alternate headword?" in the definitions section?
This is a well-understood problem space. There should be a requirement that this work is informed by existing mechanisms defined in popular markup languages, such as the Text Encoding Initiative - see here for their dictionary work: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DI.html
Hi
The next mentionned web dictionaries, for references, have been developped from paper books sources.
- pronunciation Is it IPA characters or Voice or both ? Not many people understand the phonetic alphabet. In case of bilingual dictionaries the pronunciation of the translation can be more important than the pronunciation of the source language headword. (ex: pronunciation of "to make", one translation of "faire" on a french-english bilingual reference: http://www.larousse.fr/dictionnaires/francais-anglais/faire/32616)
- are synonyms linked with senses ? In french, "bâtir" is a synonym of "faire" for the sense "Constituer par son action, son travail, quelque chose.." but it's not for the the sense "Fournir un produit agricole" for wich "récolter" is a synonym. (reference : http://www.larousse.fr/dictionnaires/francais/faire/32701/synonyme)
- what about quotations ? some dictionaries have quotations from well knom people (writers, politics ...) attached to the headword (reference : http://www.larousse.fr/dictionnaires/francais/faire/32701/citation)
- can we explain some lexicographical difficulies ? In latin languages some words have lexicograficals characteristics. (reference : http://www.larousse.fr/dictionnaires/francais/faire/32701/difficulte)
- what about proper nouns ? Some dictionaries have commons and proper nouns mixed. In a lexicografical point of view a "proper noun" is not a part-of-speach.
I think that there are a number of important use cases missing:
Is this proposal intended to support authoring of dictionaries? For example, see Dictionaries in P5: Guidelines for Electronic Text Encoding and Interchange
Or, is this proposal concerned about final (probably hard-to-modify) EPUB documents, which are generated from other sources such as TEI?
Does the proposed extension make it difficult to use HTML rendering engines as a basis for EPUB reading systems? Or, is everything intended to be represented by existing HTML5 constructs? If this is the case, why do we need this extension? Aren't in-house conventions by publishers good enough?
The proposed schedule is preposterous.
@eb2mmrt: "Is this proposal intended to support authoring of dictionaries?"
No, that is beyond the scope of this proposal. I think we will add that to "out of scope" to make it clearer.
"Or, is this proposal concerned about final (probably hard-to-modify) EPUB documents, which are generated from other sources such as TEI?"
Yes, we are concerned here with EPUB documents that have been generated from a publisher's internal dictionary data format.
"Does the proposed extension make it difficult to use HTML rendering engines as a basis for EPUB reading systems? Or, is everything intended to be represented by existing HTML5 constructs?"
On the former question, the answer is no. There is nothing particularly tricky about dictionaries in terms of HTML layout and rendering; the difficulty comes in out-of-scope database-type search. On the latter question, I would say that our goal should be to use existing EPUB3/HTML5 constructs to represent the semantic information needed for dictionary search wherever feasible. If a working group is formed, then one of its main tasks will be to determine when essential dictionary semantics can be represented in EPUB3/HTML5, and when we need to make use of relevant outside standards such as TEI.
"If this is the case, why do we need this extension?"
If authors and publishers have a standard for how to represent the essential dictionary entry/headword structure in publications, then that enables reading systems to devise standard ways of indexing this information for search (a separate problem mentioned in "Out of Scope") and providing users with the search capabilities they expect when using dictionaries.
Today, the only thing an author can do with a dictionary in EPUB is to treat it as a standard linear book. Say this publication contains a list of hyperlinks to each headword: a very small dictionary might have 10,000 headwords, while large ones can have 100,000+, making any such navigation approach completely unusable. Moreover, this does not even address the issue of using a dictionary as a system resource for lookup of words while reading other publications.
This last feature is generally present in reading systems, but has been implemented as a firmware-level feature, with a built-in dictionary stored in device memory in proprietary format. There is no way for users to look up words with different dictionaries that might better serve their needs. Devising a standard for publishers to capture dictionary search terms and deliver rich dictionary publications is the necessary first step in changing this state of affairs. There is then further work to be done at a reading system level, but standard semantics are needed as a foundation to build on.
Re schedule: the proposed charter length is one year (through 2012). I will clarify the prose around the milestones table to make clear that the WG once formed can rearrange milestones once feature prioritization has been completed.
The aggressive milestones that are in the table now should be read as a marker to remind us and the coming WG about the general consensus from the workshop that we should focus on producing an initial basic version of this feature, which would not aspire to address every use-case -- and then increase functionality with additional versions moving forward as and when this is deemed the right thing to do.
@janwright: Noted.
@bentrafford: That makes sense, except I might qualify it by saying our work should aim to use and adapt HTML5 first, and be informed by other markup languages such as TEI as well.
BTW, if the proposal seemed like we may have been trying to define an authoring language for dictionaries, that wasn't the intent. I've tried to clarify that point above.
@Gabino.Alonso.Garcia: Pronunciation: I had in mind the written phonetic transcription (IPA characters) when defining pronunciation, but you're quite right to point out the audio component, and that they may be attached to other items than the headword. We'll distinguish the two concepts when we revise it.
Synonyms linked with senses: they certainly can be linked, and I can mention that. I should add that the structure and definition sections of this document were simply meant to describe rather than prescribe.
Quotations: good point, I'll add that to the definitions section.
Lexicographical difficulties: I had meant for "usage note" to cover this kind of information. The example you cite shows how detailed this information can be; I'll try to broaden this definition.
Proper nouns: Excellent point! I think the "entry structure" section could mention that there are different types of entries generally, and that there should be a way of distinguishing high-level types in the publication.
@syeates: Good point about multilingual dictionaries, I'll add that.
On the user overriding the language of the work, are you referring to the language of the dictionary? Or the publication from which the user is looking up words in the dictionary?
On language preferences, could you provide examples of how these preferences might be used in practice? Are you suggesting that whenever a lookup is initiated, the reading system look for available dictionaries in the user's language preference order, regardless of the language of the referring publication? (In theory, that publication's metadata states what language is most appropriate, though in practice, it could be somewhat or entirely inaccurate.)
@janwright (for Ina Gravitz): I have added separate entries for "alternate headword" and "variant headword". It turns out I was not distinguishing the two correctly before, and that is now fixed.
On language preferences, could you provide examples of how these preferences might be used in practice? Are you suggesting that whenever a lookup is initiated, the reading system look for available dictionaries in the user's language preference order, regardless of the language of the referring publication? (In theory, that publication's metadata states what language is most appropriate, though in practice, it could be somewhat or entirely inaccurate.) http://sites.google.com/site/webdirectorydirectorioweb/ http://sites.google.com/site/posicionateeninternet/ http://www.hostalramos.com http://www.posicionamientowebtop10.com http://www.el-horoscopo-diario.com