My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
PublicationFormat  
Documenting recommended publication formats
Updated Sep 10, 2009 by dprem...@gmail.com
This page is under construction and subject to significant revision.

Introduction

The GBIF Global Names Architecture work of the ECAT work programme includes extending GBIF indices to include the occurrence of taxonomic names within publications. Our work involves extending the functionality of the uBioRSS application by incorporating advances in Taxonomic Name Recognition, undertaken as part of the GNA work, with advances in processing data indices. In addition we will expand the scope of indexed content. Our goal is to:

  • enable checklist datasets registered in Checklist Bank to serve as profiles for accessing indexed literature
  • allow users to refine profiles by selecting sets of publications for monitoring
  • incorporating links to publications within GBIF data portal taxon pages
  • offering taxonomic name indexing services to publishers for facilitating discovery and access to literature via taxonomic contexts
  • providing tools and services to Participants for processing publications for taxonomic name recognition.

Rationale

The GBIF Strategic Plan targets the integration of indexed biodiversity data for all groups of organisms and that the amount and richness of the data served via these indexes are sufficient to meet needs of major user groups. Data types targeted for integration include genetic resources, multi-media data, and literature. The indexing of taxonomic names occurring within publications provides the integrative capacity to meet this target.

Publisher Requirements

Publications to be indexed should preferably be pdfs, but for GBIF indexing purposes we support most common formats, i.e. pdf, html, xml, microsoft office (doc,xls,ppt) and more.

To ease the discovery of online publications we recommend that publishers provide ideally 2 things, see below for more details on each of them

  • a RSS feed listing the latest publications, in case of journals the TOC of the current issue
  • an archive file listing all available publications in a simple format. For open access journals this can include links to the full pdf, but otherwise can contain just metadata and possibly abstracts of the back catalogue

RSS feeds

By providing an Rss feed for articles/publications it is possible to include metadata for each feed item/entry, i.e. article or publication. The downside of classical RSS feeds is that they only provide access to the last published articles by default - usually around 10-25 (there are extensions to feeds in Atom for example that allow paging, but that is not widely used and gets tricky).

Unfortunately there is no single reference/citation standard in use, so there are various ways of expressing publication metadata. The most common ones are using simple Dublin Core, the more detailed ones Content and Prism, which only exists as RDF and therefore is limited to RSS1.0 .

Good best practice guidelines can be found here: Good Practice Guidelines for Publishers of TOC RSS feeds: http://web.fumsi.com/go/article/share/3356

The recommendation is to use rdf based RSS1.0 with PRISM if possible. As RSS2 is less expressive it should only be used when no resources to provide RSS1.0 are existing.

Publisher RSS survey

What do publishers do already? Many publishers support the idea of TOC rss feeds and also link to pdfs from there. A good review of what people do currently can be found in Analysing the ticTOCs collection of journal TOC feeds

We did an anaylsis of 980 biologically relevant feeds in ubio to see what formats are the most common ones (the missing feeds are broken ones):

rss_0.92 = 3
rss_1.0 = 336
rss_2.0 = 431
rss_0.91U = 6
atom_1.0 = 2

... and a more detailed breakdown by namespaces and elements used in feeds. Numbers indicate the number of feeds found that make use of the element in their items:

2	atom_1.0
2	atom_1.0::http://purl.org/dc/elements/1.1/
1	atom_1.0::http://purl.org/syndication/thread/1.0total
1	atom_1.0::http://www.w3.org/200
6	rss_0.91U
6	rss_0.91U::http://purl.org/dc/elements/1.1/
2	rss_0.91U::http://rssnamespace.org/feedburner/ext/1.0origLink
3	rss_0.92
3	rss_0.92::http://purl.org/dc/elements/1.1/
336	rss_1.0
1	rss_1.0::http://base.google.com/ns/1.0image_link
1	rss_1.0::http://base.google.com/ns/1.0news_source
1	rss_1.0::http://base.google.com/ns/1.0publication_name
1	rss_1.0::http://base.google.com/ns/1.0publication_volume
1	rss_1.0::http://base.google.com/ns/1.0publish_date
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/byteCount
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/category
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/complianceProfile
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/copyright
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/coverDate
2	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/coverDisplayDate
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/distributor
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/eIssn
182	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/endingPage
173	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/isPartOf
56	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/issn
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/issueIdentifier
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/issueName
177	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/number
57	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/publicationDate
66	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/publicationName
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/publicationYear
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/publisher
26	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/section
238	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/startingPage
1	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/teaser
54	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/versionidentifier
233	rss_1.0::http://prismstandard.org/namespaces/1.2/basic/volume
334	rss_1.0::http://purl.org/dc/elements/1.1/
1	rss_1.0::http://purl.org/dc/terms/created
1	rss_1.0::http://purl.org/dc/terms/issued
1	rss_1.0::http://purl.org/dc/terms/tableOfContents
1	rss_1.0::http://purl.org/rss/1.0/modules/aggregation/source
1	rss_1.0::http://purl.org/rss/1.0/modules/aggregation/sourceURL
1	rss_1.0::http://purl.org/rss/1.0/modules/aggregation/timestamp
1	rss_1.0::http://purl.org/rss/1.0/modules/annotate/reference
38	rss_1.0::http://purl.org/rss/1.0/modules/prism/endingPage
38	rss_1.0::http://purl.org/rss/1.0/modules/prism/number
39	rss_1.0::http://purl.org/rss/1.0/modules/prism/publicationDate
39	rss_1.0::http://purl.org/rss/1.0/modules/prism/section
38	rss_1.0::http://purl.org/rss/1.0/modules/prism/startingPage
38	rss_1.0::http://purl.org/rss/1.0/modules/prism/volume
1	rss_1.0::http://purl.org/rss/1.0/modules/slash/comments
1	rss_1.0::http://purl.org/rss/1.0/modules/slash/department
1	rss_1.0::http://purl.org/rss/1.0/modules/slash/hit_parade
1	rss_1.0::http://purl.org/rss/1.0/modules/slash/section
1	rss_1.0::http://purl.org/syndication/thread/1.0total
3	rss_1.0::http://rssnamespace.org/feedburner/ext/1.0origLink
54	rss_1.0::http://web.resource.org/cc/license
1	rss_1.0::http://www.openurl.info/registry/fmt/xml/rss10/ctxobjects
144	rss_1.0::http://xmlns.com/foaf/0.1/maker
1	rss_1.0::www.refworks.com/xml/created
1	rss_1.0::www.refworks.com/xml/do
1	rss_1.0::www.refworks.com/xml/id
1	rss_1.0::www.refworks.com/xml/jo
1	rss_1.0::www.refworks.com/xml/k1
1	rss_1.0::www.refworks.com/xml/modified
1	rss_1.0::www.refworks.com/xml/ol
1	rss_1.0::www.refworks.com/xml/rwtype
1	rss_1.0::www.refworks.com/xml/sn
1	rss_1.0::www.refworks.com/xml/sr
1	rss_1.0::www.refworks.com/xml/ul
431	rss_2.0
2	rss_2.0::http://prismstandard.org/namespaces/1.2/basic/endingPage
2	rss_2.0::http://prismstandard.org/namespaces/1.2/basic/number
2	rss_2.0::http://prismstandard.org/namespaces/1.2/basic/publicationDate
2	rss_2.0::http://prismstandard.org/namespaces/1.2/basic/section
2	rss_2.0::http://prismstandard.org/namespaces/1.2/basic/startingPage
2	rss_2.0::http://prismstandard.org/namespaces/1.2/basic/volume
398	rss_2.0::http://purl.org/dc/elements/1.1/
1	rss_2.0::http://purl.org/rss/1.0/modules/slash/comments
6	rss_2.0::http://rssnamespace.org/feedburner/ext/1.0origLink
3	rss_2.0::http://search.yahoo.com/mrss/content
3	rss_2.0::http://search.yahoo.com/mrss/credit
1	rss_2.0::http://search.yahoo.com/mrss/description
2	rss_2.0::http://search.yahoo.com/mrss/thumbnail
1	rss_2.0::http://search.yahoo.com/mrss/title
1	rss_2.0::http://search.yahoo.com/mrssthumbnail
2	rss_2.0::http://wellformedweb.org/CommentAPI/commentRss
1	rss_2.0::http://www.itunes.com/dtds/podcast-1.0.dtdduration
1	rss_2.0::http://www.itunes.com/dtds/podcast-1.0.dtdexplicit
1	rss_2.0::http://www.itunes.com/dtds/podcast-1.0.dtdkeywords
1	rss_2.0::http://www.itunes.com/dtds/podcast-1.0.dtdsubtitle
1	rss_2.0::http://www.itunes.com/dtds/podcast-1.0.dtdsummary
1	rss_2.0::http://www.pheedo.com/namespace/pheedoorigLink
1	rss_2.0::http://www.topix.com/partners/rsscomment/comments

RSS 1.0 with Prism

Recommended Format

taken from fumsi.com

PRISM

The Publishing Requirements for Industry Standard Metadata (PRISM) specification defines a standard for interoperable content description, interchange, and reuse in both traditional and electronic publishing contexts. PRISM recommends the use of certain existing standards, such as XML, RDF, the Dublin Core, and various ISO specifications for locations, languages, and date/time formats. Beyond those recommendations, it defines a small number of XML namespaces and controlled vocabularies of values, in order to meet the goals listed above.

Example

  • http://www.nature.com/ng/current_issue/rss
  • <?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns="http://purl.org/rss/1.0/">
    <channel rdf:about="http://www.nature.com/ng/current_issue/rss">
    <title>Nature Genetics</title>
    <description>Publishes the very highest quality research in genetics.</description>
    <link>http://www.nature.com/ng/current_issue/</link>
    <dc:publisher>Nature Publishing Group</dc:publisher>
    <dc:language>en</dc:language>
    <dc:rights>&#169; 2009 Nature Publishing Group</dc:rights>
    <prism:publicationName>Nature Genetics</prism:publicationName>
    
    <prism:issn>1061-4036</prism:issn>
    <prism:eIssn>1546-1718</prism:eIssn>
    <prism:copyright>&#169; 2009 Nature Publishing Group</prism:copyright>
    <prism:rightsAgent>permissions@nature.com</prism:rightsAgent>
    <image rdf:resource="http://www.nature.com/includes/rj_globnavimages/ng_logo.gif"/>
    <items>
    <rdf:Seq>
    <rdf:li rdf:resource="http://dx.doi.org/10.1038/ng0609-635"/>
        ...
    </rdf:Seq>
    </items>
    </channel>
    <item rdf:about="http://dx.doi.org/10.1038/ng0609-635">
    <title>The cup half empty</title>
    <link>http://dx.doi.org/10.1038/ng0609-635</link>
    <description>One-sixth of the world's population does not have enough food to sustain life, 
    and the world's food supply needs to double by 2050 without increasing demand for water or fuel. 
    Agricultural genetics is one of the easier parts of the solution.</description>
    <content:encoded><![CDATA[
    
    <p>
    <b>The cup half empty</b>
    </p>
    <p>Nature Genetics 41, 635 (2009). <a href="http://dx.doi.org/10.1038/ng0609-635">doi:10.1038/ng0609-635</a>
    </p>
    <p>One-sixth of the world's population does not have enough food to sustain life, 
    and the world's food supply needs to double by 2050 without increasing demand for water or fuel. 
    Agricultural genetics is one of the easier parts of the solution.</p>
    ]]></content:encoded>
    <dc:title>The cup half empty</dc:title>
    <dc:identifier>doi:10.1038/ng0609-635</dc:identifier>
    <dc:source>Nature Genetics 41, 635 (2009)</dc:source>
    <prism:publicationName>Nature Genetics</prism:publicationName>
    <prism:volume>41</prism:volume>
    
    <prism:number>6</prism:number>
    <prism:section>Editorial</prism:section>
    <prism:startingPage>635</prism:startingPage>
    <prism:endingPage>635</prism:endingPage>
    </item>
        ...

The RSS1.0 feed is rdf based and as such the list of items can reference the individual item (see rdf:Seq above).

RSS 2.0

  • example feed: http://www.bioone.org/action/showFeed?type=etoc&feed=rss&jc=bitr
  • <?xml version="1.0" encoding="UTF-8"?>
    <rss version="2.0">
       <channel>
          <title>BioOne: BIOTROPICA: Table of Contents</title>
          <link>http://www.bioone.org/loi/bitr?ai=tc&amp;af=R</link>
          <description>Table of Contents for BIOTROPICA. List of articles from both the latest and ahead of print issues.</description>
          <language>en-US</language>
    
          <pubDate>Thu, 14 May 2009 04:17:05 GMT</pubDate>
          <docs>http://blogs.law.harvard.edu/tech/rss</docs>
          <generator>Atypon Literatum</generator>
          <managingEditor>helpdesk@allenpress.com</managingEditor>
          <ttl>120</ttl>
          <image>
    
             <title>BIOTROPICA</title>
             <url>http://www.bioone.org/na101/home/literatum/publisher/bioone/journals/covergifs/bitr/2004/00063606-36.4/cover.jpg</url>
             <link>http://www.bioone.org/loi/bitr?ai=tc&amp;af=R</link>
          </image>
          <item>
             <title>Beyond Paradise—Meeting the Challenges in Tropical Biology in the 21st Century</title>
    
             <link>http://www.bioone.org/doi/abs/10.1646/1609?ai=tc&amp;af=R</link>
             <description>BIOTROPICA, Volume 36, Issue 4, Page 437-446, December 2004. 
    		&lt;br/&gt;
    	</description>
             <author>helpdesk@allenpress.com (Kamaljit S. Bawa et al)</author>
             <category>article</category>
             <pubDate>Wed, 14 Jan 2009 16:55:33 GMT</pubDate>
    
             <guid>http://www.bioone.org/doi/abs/10.1646/1609?ai=tc&amp;af=R</guid>
             <comments>http://www.bioone.org/action/showMessage?message=Copyright+%28c%29+2009%2C+Atypon+Systems.+All+rights+reserved&amp;ai=tc&amp;af=R</comments>
          </item>
             ...
  • example feed: http://pubget.com/feed?q=latest%3ANature+Genetics
  • <?xml version="1.0" encoding="UTF-8"?>
    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
      <channel>
        <title>Pubget: latest:Nature Genetics</title>
        <link>http://pubget.com/search?q=latest%3ANature+Genetics</link>
        <description>Pubget is like PubMed, except you get the PDFs right away</description>
        <item>
          <title>Mutations in mitochondrial carrier family gene SLC25A38 cause nonsyndromic autosomal recessive congenital sideroblastic anemia.</title>
    
          <link>http://pubget.com/search?highlight=19412178&amp;q=latest%3ANature+Genetics</link>
          <description>
    The sideroblastic anemias are a heterogeneous group of congenital and acquired hematological disorders whose morphological hallmark
     is the presence of ringed sideroblasts-bone marrow erythroid precursors containing pathologic iron deposits within mitochondria. 
    Here, by positional cloning, we define a previously unknown form of autosomal recessive nonsyndromic congenital sideroblastic anemia, 
    associated with mutations in the gene encoding the erythroid specific mitochondrial carrier family protein SLC25A38, and demonstrate that SLC25A38 is important for the biosynthesis of heme in eukaryotes.
     Authors: &lt;a href='/search?q=authors%3A%22Duane L Guernsey%22' &gt;Duane L Guernsey&lt;/a&gt;, 
    &lt;a href='/search?q=authors%3A%22Haiyan Jiang%22' &gt;Haiyan Jiang&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Dean R Campagna%22' &gt;
    Dean R Campagna&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Susan C Evans%22' &gt;Susan C Evans&lt;/a&gt;, 
    &lt;a href='/search?q=authors%3A%22Meghan Ferguson%22' &gt;Meghan Ferguson&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Mark D Kellogg%22' &gt;
    Mark D Kellogg&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Mathieu Lachance%22' &gt;Mathieu Lachance&lt;/a&gt;, 
    &lt;a href='/search?q=authors%3A%22Makoto Matsuoka%22' &gt;Makoto Matsuoka&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Mathew Nightingale%22' &gt;
    Mathew Nightingale&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Andrea Rideout%22' &gt;Andrea Rideout&lt;/a&gt;, 
    &lt;a href='/search?q=authors%3A%22Louis Saint-Amant%22' &gt;Louis Saint-Amant&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Paul J Schmidt%22' &gt;
    Paul J Schmidt&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Andrew Orr%22' &gt;Andrew Orr&lt;/a&gt;, 
    &lt;a href='/search?q=authors%3A%22Sylvia S Bottomley%22' &gt;Sylvia S Bottomley&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Mark D Fleming%22' &gt;
    Mark D Fleming&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Mark Ludman%22' &gt;Mark Ludman&lt;/a&gt;,
     &lt;a href='/search?q=authors%3A%22Sarah Dyack%22' &gt;Sarah Dyack&lt;/a&gt;, &lt;a href='/search?q=authors%3A%22Conrad V Fernandez%22' &gt;
    Conrad V Fernandez&lt;/a&gt; and &lt;a href='/search?q=authors%3A%22Mark E Samuels%22' &gt;Mark E Samuels&lt;/a&gt;</description>
    
          <guid>http://pubget.com/search?highlight=19412178&amp;q=latest%3ANature+Genetics</guid>
          <pdf>http://www.nature.com/ng/journal/v41/n6/pdf/ng.359.pdf</pdf>
        </item>
             ...

Atom

Even though Atom technically is a very good standard, the lack of use by current publishers suggests to better not use it at this point.

Archive

The archive of all publications should be a list of dublin core records. There are 2 ways of encoding such an archive, a simple CSV text file or XML

CSV archive

A CSV file with each row representing a single publication. This format is very simple to produce and is compatible with the darwin core text guidelines, in particular the ECAT references extension.

It does not allow for line breaks in the metadata - something common in abstracts. If you dont have abstracts or can replace the line breaks, please consider this format. A simple example file with 1 record looks like this:

dc:identifier	link	dc:bibliographicCitation	dc:title	dc:creator	dc:date	dc:source	dc:subject	dc:description
doi:10.1038/ng0609-637		Hartge, P., Genetics of reproductive lifespan. Nature Genetics 41, 637 - 638 (2009) 	Genetics of reproductive lifespan	Patricia Hartge	2009-06-01	Nature Genetics 41, 635 (2009)	genomics, epidemiology	Five genome-wide association studies of the timing of menarche and menopause have now taken us beyond the range of candidate gene and linkage studies. The list of new genetic associations identified for these two traits should shed light on the mechanisms of ovarian aging, as well as breast cancer and other diseases associated with reproductive lifespan.
 ...

XML archive

The same informations as the CSV file, but encoded as XML which allows for linebreaks and markup within the abstracts. A simple xml schema is provided to validate resources encoded in Dublin Core alone. Example:

<?xml version="1.0" encoding="UTF-8"?>
<resources xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xsi:noNamespaceSchemaLocation="http://gbif-ecat.googlecode.com/files/publication_archive.xsd">
    <resource>
        <dc:identifier>doi:10.1038/ng0609-637</dc:identifier>
        <dc:identifier>http://www.nature.com/ng/journal/v41/n6/pdf/ng0609-637.pdf</dc:identifier>
        <dc:title>Genetics of reproductive lifespan</dc:title>
        <dc:creator>Patricia Hartge</dc:creator>
        <dc:date>2009-06-01</dc:date>
        <dc:source>Nature Genetics 41, 635 (2009)</dc:source>
        <dc:subject>genomics; epidemiology</dc:subject>
        <dc:language>en</dc:language>
        <dc:rights>Copyright © 2009 Wiley-Liss, Inc., A Wiley Company</dc:rights>
        <dc:description>
            Five genome-wide association studies of the timing of menarche and menopause have now taken us beyond the range of candidate gene and linkage studies.
            The list of new genetic associations identified for these two traits should shed light on the mechanisms of ovarian aging, as well as breast cancer and other diseases associated with reproductive lifespan.
        </dc:description>
    </resource>
     ...
</resources>

Sign in to add a comment
Powered by Google Project Hosting