My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
general  
Introduction to FITS
Updated Apr 25, 2012 by spencer_...@harvard.edu

Introduction

The File Information Tool Set (FITS) identifies, validates and extracts technical metadata for a wide range of file formats. It acts as a wrapper, invoking and managing the output from several other open source tools. Output from these tools are converted into a common format, compared to one another and consolidated into a single XML output file. FITS is written in Java and is compatible with Java 1.6 or higher. The external tools currently used are:


How to Use FITS

FITS can be used as a command line tool or within other projects using its API. It relies on an environment variable named FITS_HOME to find the needed configuration and xsl transforms directories.

Command Line

Windows .bat file and Linux/OS X shell launcher scripts are provided. These scripts build the necessary Java classpath and set the FITS_HOME environment variable automatically.

Command Line Options

  • -i The input file you want to examine
  • -o The destination of the output XML file.
  • -r process directories recursively when -i is a directory
  • -h Prints the usage message
  • -v Displays the FITS version number
  • -x convert FITS output to a standard metadata schema
  • -xc output using a standard metadata schema and include FITS xml

If -o is not specified then the output is sent to the console window.

The general syntax is:

>fits.[bat|sh] -i input_file -o output_file

API

When using the API the FITS_HOME environment variable must be passed in with the Fits() constructor. See the Developer Info section.


Overview of FITS Life Cycle

  1. configuration load
    • FITS_HOME environment variable set up
    • Fits.xml configuration file loaded
    • Tool wrappers created
    • Output consolidator configured.
  2. for each tool wrapper
    • each tool executed on file creating a ToolOutput object containing a fits xml document
      • If necessary, XSLT is applied to tool output to create the FITS compatible xml
    • FITS mapping file applied (xml/fits_xml_map.xml)
  3. consolidation
    • Identities consolidated
      • format tree (xml/fits_format_tree.xml) consulted
    • Output from tools unable to identify the file or those who identified a less specific type are thrown out
    • Fileinfo sections merged
    • Filestatus sections merged
    • metadata sections merged
  4. Output
    • The consolidated fits xml file is written to a file or the console
    • If using the API a FitsOutput object is returned

Output Format

Each tool wrapper must implement the Tool interface and return a ToolOutput object. ToolOutput must contain a valid FITS XML JDOM object. Each tool's output is validated against the local FITS XML schema when the ToolOutput object is created. The schema is located in xml/fits_output.xsd.

During consolidation tool output conflicts are accounted for by adding a status attribute to the element.

After consolidation a single FITS XML file will reference the online schema located at http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd

Status Attributes

If multiple tools disagree on an identity or other metadata values, a status attribute is added to the element with a value of "CONFLICT". If only a single tool reports an identity or other metadata value a status attribute is added to the element with a value of "SINGLE_RESULT". If multiple tools agree on a an identity or value, and none disagree, the status attribute is omitted.

Tool Ordering Preference

The ordering preference of the tools in xml/fits.xml determines the ordering of conflicting values. If the report-conflict configuration option is set to false then only the tool that first reported the element is displayed. The other conflicting values are discarded.

Identities and Technical Metadata

All tools that agree on an identity are consolidated into a single <identity> section. Technical metadata is only output (and a part of the consolidation process) for tools that were able to identify the file and that are listed in the first <identity> section. All other output is discarded.

Tool Output Normalization

It’s possible for tools to output conflicting data when they actually mean the same thing. For example, one tool could report the format of a PNG image as “Portable Network Graphics”, while another may report “PNG”. A tool could report a sampling frequency unit of “2”, while another may report the text string “inches”. If left alone, these would cause false positive conflicts to appear in the FITS consolidated output. These differences are converted in the XSLT that converts the native tool output into FITS XML. In general FITS prefers text strings to numeric values (“inches” instead of “2”), and complete format names to abbreviations (“Portable Network Graphics” instead of “PNG”). If new tools or formats are being added to FITS then thorough testing should be done to ensure that any false positive conflicts are resolved.

Comment by awood...@gmail.com, Mar 14, 2011

Minor typo in "Command Line Options" section: "-o The destination fo the output XML file."

Comment by project member spencer_...@harvard.edu, Mar 14, 2011

thanks. fixed!

Comment by spe...@purdue.edu, May 6, 2011

I noticed one more value on identification[@status="PARTIAL"] and I am not sure on its precise meaning. Does it mean that only subset of tools identified the object but not all of them and there is no conflict in their identification, e.g. 2 out of 4?

Comment by johan.va...@kb.nl, Jun 29, 2011

On the possible values of status attributes: if FITS encounters a file that cannot be identified, this results in <identification status="UNKNOWN"> in the output file. This is not mentioned here, and the value is not included in the FITS output file schema either (which means that the output files is not valid according to its own schema!)

The same would apply to the aforementioned "PARTIAL" value (which haven't encountered myself so far)

Comment by project member spencer_...@harvard.edu, Jun 29, 2011

PARTIAL should be the fits_output.xsd file included in the 0.5 release. But tou are right, UNKNOWN is missing from the latest version of the schema. I just added it and committed the file to SVN. You can get it here: http://code.google.com/p/fits/source/browse/trunk/xml/fits_output.xsd

Comment by johan.va...@kb.nl, Jun 29, 2011

Small follow-up to my previous comment: since FITS also supports XML validation using JHOVE, I ran the output file with <identification status="UNKNOWN"> through FITS. Result:

++++++++

<filestatus> <well-formed toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</well-formed> <valid toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">false</valid> <message toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">cvc-enumeration-valid: Value 'UNKNOWN' is not facet-valid with respect to enumeration '[SINGLE_RESULT, CONFLICT, PARTIAL]'. It must be a value from the enumeration. Line = 3, Column = 36</message> <message toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">cvc-attribute.3: The value 'UNKNOWN' of attribute 'status' on element 'identification' is not valid with respect to its type, 'statusType'. Line = 3, Column = 36</message> </filestatus>

++++++++

So at least FITS/JHOVE correctly detects that these files are not valid.

Comment by project member spencer_...@harvard.edu, Jun 29, 2011

It helps when I update the copy of the schema that we host at http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd

This should fix the validation problem

Comment by johan.va...@kb.nl, Jun 29, 2011

Hi Spencer,

Thanks for clearing that up!

As for "PARTIAL": the 'fits_output.xsd' file that is in the XML directory of FITS 0.5 does not contain this value yet! However, the output files do not refer to this local copy but instead contain a reference to a centrally-stored version of the .xsd file which does include the "PARTIAL" value. So we're having 2 different versions of the schema here, which I think explains the confusion. I'm not sure though if the local version of the file is used at any time by FITS?

Johan

Comment by johan.va...@kb.nl, Jun 29, 2011

BTW previous comment was in reply to your reply to my first comment. Just checked the updated schema and yes that should fix this issue.

Cheers,

Johan

Comment by mcew...@gmail.com, Jun 29, 2011

Ah, you are right. I must have modified my local copy at some point. In any case, both the version in SVN and the copy on our website should now be in sync.

The local copy provided with FITS is used during the file processing. As each tool has its output converted to the FITS format it is validated using the local schema. This can be disabled by setting <validate-tool-output>true</validate-tool-output> in xml/fits.xml to false.


Sign in to add a comment
Powered by Google Project Hosting