Skip to content

file listings

Jeremy Faden edited this page Apr 29, 2026 · 33 revisions

See also dataset schema issue tag.

File listings and events lists are similar types of data which can be represented in HAPI. For a while HAPI has been able to represent files, using the "stringType" metadata to identify strings as URIs. File listings were then colloquially just a time and a URI. But we would like to represent file listings as a specific schema in HAPI. This document explores this.

Events lists are more generally just a time stamp and a message associated with the time stamp. Often events lists will have a time range with a start and end time.

We start with a base class which is just a "Listing of Times":

  • time isotime

"File Listing" required elements:

  • startDate - start date/time of coverage (isotime, required)
  • uri (string, required, stringType used for base URI)

Optional and recommended elements:

  • modificationDate (isotime, recommended)
  • size (recommended; Units must be in B and values formatted as an integer, unless the fileSize is greater than 9,007,199,254,740,991 (~9 PB), in which case scientific notation should be used (if file size is greater than ~9 PB, the exact number of bytes cannot be communicated).

Optional elements (if present, use these keywords):

  • checksum (stringType used to constrain checkSumAlgorithm)
  • creationDate (isotime)
  • accessDate (isotime)
  • stopDate - (isotime) stop date/time of coverage

Comment on column names: we tried to be consistent with the names used in the info response.

Ordering of "fileListing" columns:

  • required columns must be present in the order given (time, then fileURI)
  • optional columns must follow required columns and can be in any order
  • any number of user-added columns can be present (other than the listed optional columns) and these can be interleaved among the optional columns

Need a new stringType for checksum - see ticket #273 There are curated lists of hash algorithm names (for use in HTTP headers, for example):

A long would be helpful here, but that should be a separate discussion (and maybe we also add float too, for HAPI 4.0; also complex numbers?

Question (analysis needed): how many units-processing libraries use the same strings for these file size units?

Examples of standards for prefixes used with file sizes

We will eventually have to specify which standard we use for these prefixes.

Events Lists are also extensions of String Listings:

  • time - of time coverage (required,isotime)
  • stopDate - of time coverage (isotime, required) (Documentation acknowledges that this should be the same for an instant)
  • label (required)

Some example extensions to Event List:

  • latitude
  • longitude

Example proposed output, note x_parameterSchema

{
    "HAPI": "3.2",
    "x_createdAt": "2017-02-21T17:27Z",
    "modificationDate": "2026-01-01T00:00Z",
    "x_parameterSchema": "list>fileList>jpgFileList",
    "parameters": [
        {
            "length": 20,
            "name": "Time",
            "type": "isotime",
            "x_format": "$Y-$m-$dT$H:$M:$SZ",
            "fill": null,
            "units": "UTC",
            "timeStampLocation" : "begin"
        },
        {
            "description": "Picture of the creek, unmodified",
            "fill": null,
            "name": "fileURI",
            "length": 26,
            "type": "string",
            "units": null,
            "stringType": {
                "uri": {
                    "base": "https://cottagesystems.com/data/hapi/pics/",
                    "mediaType": "image/jpeg"
                }
            }
        },
        {
            "description": "File modification time",
            "name": "modificationDate",
            "type": "isotime",
            "fill": null,
            "x_format": "$Y-$m-$dT$H:$MZ",
            "length": 17,
            "units": "UTC"
        },
        {
            "description": "File size in kilobytes",
            "name": "fileSize",
            "fill": null,
            "type": "integer",
            "units": "KiB"
        }
    ],
    "sampleStartDate": "2023-01-01T00:00Z",
    "sampleStopDate": "2023-02-01T00:00Z",
    "startDate": "2022-11-01T00:00Z",
    "stopDate": "2026-03-06T00:00Z",
    "cadence": "PT10M",
    "status": {
        "code": 1200,
        "message": "OK"
    }
}

One issue is how to deal with the units on the file size. We could use IEEE units, which seem to be similar (the same?) as what is used in VO units, and astropy units, and probably also IEEE units: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9714443

See also:

Message sent 2026-04-06 to HAPI dev mailing list with status update:

For a summary of where we are now: We would like there to be a schema to indicate that a HAPI response is a listing of files that are available as URIs. (We did not provide this or encourage it so far because we don’t want providers just offering a file listing and saying they made their data available via HAPI.) If people do list files using HAPI, we would prefer that they all use the same format, so that it becomes possible to interpret file listings interoperably from any HAPI service. Therefore, we will offer a schema, that if followed, will allow clients to: a) know that they are getting a file listing, and b) be able to interpret such a listing from any server with computer precision using a single client.

The most basic file listing will be a HAPI dataset that has only 2 required columns:

  1. a time column as the first column (required by HAIP for any dataset); for a file listing, this represents the start time of the data in the file
  2. filename as a URI; this is a string column that has a special string sub-type of URI (this URI sub-type is part of the existing HAPI spec as of version 3.2) with a link to the file the start time of the data in the file. See here for URI string types: https://github.com/hapi-server/data-specification/blob/master/hapi-3.2.0/HAPI-data-access-spec-3.2.0.md#3616-the-stringtype-object

There can be optional elements after this for: file size, end time of data in the file, file modification time, file creation time, last file access time, checksum If any of these items are included, there are constraints that must be followed for them to be recognized by HAPI. Following any of these optional but constrained items, a dataset may include any number of other, additional columns relevant for these files, such as wavelength, frequency range, observed target, DOI, image type, quality flag, data version, processing level, etc. HAPI does not place any restriction on the number or structure of these additional columns. They just need to be valid HAPI parameters. Any “x_” items in these parameters are of course allowed, as always.

2026-04-20

Discussion about fileSize:

  • JavaScript does not even have integers, so what should size be? Pandering to JSON and JavaScript is hard since it doesn't have integers (or comments!)
  • Current thinking: use double and recommend that it be shown as an integer with as full precision as possible so that you get the exact value; if you are above 2GB (more digits than fits in double)
  • JavaScript: may lose precision for integers larger than 9007199254740991 (2^53 - 1)
  • see this binary presentation converter: https://www.binaryconvert.com/result_double.html
  • If a double is in this range: +/- 9,007,199,254,740,991 then represent it exactly, and this value will be represented exactly s a double
  • Discussed and abandoned: We could suggest that people add their own x_exactFileSize as a clandestine long by actually being a string type JSON; such as "123456789012345" (quotes make it a string to JSON, and then it requires special parsing, like a BigInt)
  • What about making fileSize as a string
  • Will summarize and clean this up tomorrow.
  • This is useful to show that most file sizes (much bigger than 2GB) would be precisely represented: https://www.binaryconvert.com/convert_double.html

See also: https://github.com/hapi-server/data-specification/issues/218

Sample info response for a file listing

{
   "HAPI": "3.3",
   "status": { "code": 1200, "message": "OK"},
   "$schema": "https://hapi-server.org/schemas/HAPI-3.2.json#info-fileListing",
   "startDate": "1998-001Z",
   "stopDate" : "2017-100Z",
   "parameters": [
       { "name": "time",
         "type": "isotime",
         "units": "UTC",
         "fill": null,
         "length": 24 },
       { "name": "fileURI",
         "type": "string",
         "stringType": {"uri": { "base": "https://sample.com/listing", "mediaType": "image/fits" } },
         "fill": null,
         "description": "solar images at 580 nm",
         "label": "filename"},
       { "name": "checksum",
         "type": "string",
         "length": 32,
         "stringType": {"checksum": { "algorithm": "md5" } },
         "fill": null,
         "description": "pre-calculated checksum using MD5 algorithm"},
       { "name": "stopDate",
         "type": "isotime",
         "length": 24,
         "units": "UTC",
         "fill": null,
         "description": "end date and time when the image was taken; integration times range from 10s to 30s",
         "label": "image stop date"}
   ]
}

How to handle duration of files and events

How to handle the fact that event listing and file listings involve content that has an intrinsic time range. Regular HAIP data content has each row associated with a point in time, at least with respect to the query for data.

We decided to keep the query mechanism and rules the same, and will just add a statement about the need to expand a query time range to include potential edge cases, something like: Because event lists and file listings refer to items with an implied durations, a HAPI query for items in this kind of list may need to be expanded, since the query will return only items whose start time falls in the query range. If a server wants to communicate a duration, the stopDate should be used.

How to handle duplicate times in file listings or event lists

Repeated time tags are allowed in fileListing or eventList data schemas. Equivalently, we could say that data must never be decreasing.

We just noticed that the HAPI spec never actually states that HAPI times must only ever increase. So we need to add that to the spec! The definitions for "monotonically increasing: vary, so we will avoid that language. The spec shoudl say that values can only ever increase, with no duplicates.

Comments on case and capitalization

Three places where we have specific capitalization:

  • http query parameters: we use snake case, such as include_parameters
  • camelCase everywhere else
  • AlertCamelCase for the name of the first column, the Time parameter (sort of, since it's only one word)

Defining the schema for what the parameters are

Like the unitsSchema and coordinateSystemSchema, we will use parameterSchema as the keyword.

Other options: datasetSchema - this means keywords outside the parameters have extra requirements

Could datsetSchema be an array? So far, these potential values are envisioned:

Should it just be called "dataType"? Do we need to worry about other usage of "schema"? We have "stringType" already.

Clone this wiki locally