History

Storage Sensor Sampling¶

Storage Accounting with Sampling/Monitoring¶

There has been much discussion about the sampling of stored data quantities to obtain usage for UR markup.
We start with these axioms:

Data storage is a continuous usage i.e. it has a start time and an end time or duration.
There are two mechanisms to calculate this being considered:
- record all operations which write to the storage device and delete from storage device
  e.g. a gridftp server modified to record I/O inbound and delete operations
- measure usage periodically
  e.g. for a disk on a linux system using e.g. du

This lead on to the understanding that there would be inaccuracies in sampling process resulting in two proposals:

URs do not specify a time period
- Any time period would make the record in itself inaccurate.
- Resources instead publish status information: the total size of all files stored (total allocation was also discussed) to the accounting systems via URs
- UR consumer would then derive an average/max/min value over a period of time for accounting using whatever algorithm they wished
URs specify a time period
- URs would contain a self consistent usage value
- URs would not reflect the dynamics of the usage

Storage UR will specify a start and end time
- To be set by the sensor/Resource-Provider
- and subject only to local policy decisions
- (Resource provider or service software is in best place to determine sampling rates)
Storage UR will present a data size value: N (bytes or other similar specified units c.f. UR v1)
- i.e. not an integral value
- UR will not mandate how this value is achieved so long as it is a reasonable mechanism and is described publicly for the user to consume
Storage UR data size value will be interpreted by UR consumers as an average constant value across the time period

Allowing resource providers to determine their UR start/end time could in principle lead to very many very short period URs in the system.
- It needs to be noted that granularity will determine performance and therefore we need to request of service providers that they cut appropriately coarse grained URs.
- Sampling can still be done as often as necessary for accuracy purposes but the resulting published UR needs to consider publishing average usage over a suitably long time baseline UR.
Would it be sensible to provide metadata in the UR to indicate the sampling process used?
Would it be sensible to provide metadata to indicate whether the value comes from a sampling mechanism?