This is a static archive of the previous Open Grid Forum Redmine content management system saved from host redmine.ogf.org file /dmsf_files/8020?download=12541 at Fri, 04 Nov 2022 15:28:01 GMT
Data Format
Description Language (DFDL) Working Group
Global Grid Forum,
Data Area
Chairs:
Martin Westhead, M.Westhead@epcc.ed.ac.uk
Guy Rixon,
Alan Chappell, chappella@battelle.org
Secretary(s)
Martin Westhead, M.Westhead@epcc.ed.ac.uk
Email list:
dfdl-discuss@nesc.ac.uk
Web page:
www.epcc.ed.ac.uk/dfdl/
Charter:
Focus/Purpose
XML provides an
essential mechanism for transferring data between services in an application
and platform neutral format. However it is not well suited to large datasets
with repetitive structures, such as large arrays or tables. Furthermore, many
legacy systems and valuable data sets exist that do not use the XML format. The
aim of this working group is to define an XML-based language, the Data Format
Description Language (DFDL), for describing the structure of binary and character
encoded (ASCII/Unicode) files and data streams so that their format, structure,
and metadata can be exposed. This effort specifically does not aim to create a
generic data representation language. Rather, DFDL endeavors to describe
existing formats in an actionable manner that makes the data in its current
format accessible through generic mechanisms.
The DFDL
description would sit in a (logically) separate file from the data itself. The
description would provide a hierarchical description that would structure and
semantically label the underlying bits. It would capture:
- how bits are to be interpreted as parts of low-level data types
(ints, floats, strings)
- how low-level types are assembled into scientifically relevant
forms such as arrays
- how meaning is assigned to these forms through association with
variable names and metadata such as units
- how arrays and the overall structure of the binary file are
parameterized based on array dimensions, flags specifying optional file
components, etc.
Further, if the
data file contains highly repetitive structures, such as large arrays or
tables, such a description can be very concise.
The potential benefits to having such a
standard include:
- Transparency
of physical binary representation -
Preservation of information, independent of low-level format, e.g.,
bit/byte ordering, blocksize etc.
- XML
without explicitly representing the tags - An
XML representation of the data could be inferred from the description, without
actually having to materialize that representation. This could allow the
user to treat the data as if it were XML, thus enabling:
- XSL
conversions,
- Xquery/Xpath,
and
- SAX
read/write directly to/from DFDL.
- Data
file -> database – DFDL makes the structure
explicit.
- Vendor-independent
bulk transfer of relations between relational data bases - that is it would provide a mechanism for concisely describing
binary data relations to allow large transfers of data between databases.
- Generic
tools for:
- Browsing
- Conversion
- Manipulation
- Annotation
of binary files (e.g. these bits represent the
hurricane in an image)
- Absolute
bit preservation in data archiving - can keep
original bits but use the data with new software (that may not be designed
for this format) because the format is now explicit
- Selection
and integration of data - referencing (via
XPath) means that you can select individual data objects or groups and
combine them from one or more files in any order
- Basis
for standard transformation language - based
on XPath and XSL
-
General semantic labeling - since
individual data objects and groups can be referenced, meta-data labels can
be associated with them. Such labels could be generic (like physical units
e.g. degrees centigrade) or application specific.
Goals/Milestones
The goals of the group are as follows:
- To develop
a proposal for a standard Data Format Description Language (DFDL) which
will consist of a general structure description language and then an extensible
set of ontologies for which we will provide a base.
- To work
with other groups within the GGF to ensure that the DFDL proposal conforms
with other emerging Grid standards.
- To foster
the development of reference implementations of libraries and tools that
use the DFDL proposal.
The group aims to be very focused and to
leverage existing implementation work (see references) in the development of
reference implementations. As such our aim would be to complete the work in 18
months. We propose to produce the following documents:
-
Formal language for DFDL structure description
-
XML representation of this language (XML Schema, including
standard APIs to reference it)
-
Requirements for DFDL ontology - what features are required
of a DFDL ontology
-
Basic types ontology (floating point, integer, character
etc.)
-
Basic structures ontology (Strings, arrays, tables etc.).
Milestones:
-
GGF8 (1) strawman
-
GGF9 (1) draft (2) strawman
-
GGF10 (1) draft (2) draft (3,4,5) strawman
-
GGF11 (1) complete (2,3,4,5) draft
-
GGF12 (1,2) complete (3,4,5) draft
-
GGF13 All documents complete
References:
- BinX
http://www.epcc.ed.ac.uk/gridserve/WP5/Binx/
- HDF
http://hdf.ncsa.uiuc.edu/HDF5
- BDF/SAM
http://collaboratory.emsl.pnl.gov/docs/collab/sam
- XDR
http://www.faqs.org/rfcs/rfc1014.html