OGSA Data Design Team telcon on 4 May 2004. 1. Early Discussion -------------------- Roll Call: Dave Berry, NeSC (Chair) Mario Antonioletti, EPCC (Notetaker) Andrew Grimshaw, University of Virginia Simon Laws, IBM Peter Kunszt, CERN Erwin Laure, CERN Susan Malaika, IBM Howie Lang?, University of Virginia Dave Snelling, Fujitsu Allen Luniewski, IBM Agenda Bashing: Dave Berry asks if everyone has a copy of Simon's and Andrew's slides. (See https://forge.gridforum.org/docman2/ViewCategory.php?group_id=42&category_id=708) After some discussion we decided that these two presentations work at different levels. Simon captured what came out of the four use cases. Should check to ensure that nothing important has been missed from the use cases. Should then go over Andrew's slides which have an embryonic architecture. We looked at Simon's slides bearing in mind what Andrew has already pulled out, trying to map these to Andrew's slides. Andrew shared his desktop and presented both slide sets side by side. Andrew: comments that not all the use cases were discussed last week. The ones being presented are more traditional science ones. Need to consider the more business orientated ones so as not to lose perspective. We agreed to work with the ones that are there and discuss the business use cases at a future telcon. 2. Review list of capabilities -------------------------------- This turned out to take up most of the time available. * Physics use case: Peter: also have a large number of users with chaotic access patterns. Andrew: mostly read only version of the data as well. Andrew: multiple administrative domains... true of all our use cases... Andrew: asks about the number of files.. Peter: millions, got 10^12 from the last experiment but trying to cut down on this. Andrew: how big is a large file for the physics community? Peter: can be of O(10Gb). Andrew: for each basic type would have associated attributes. Want to take each type and begin to grow them out. Simon: discovery versus searching within the file... Mario: do you see flat files as text or binary? Andrew: flat file the same as in Unix could be either ... Dave: do we need a distinction between queryable and non queryable files? Peter: have a file vs metadata... Andrew: captured in the statement "Directories aka namespaces (real and virtual) Susan: why are files and directories separate things? Andrew: directories are just logical name spaces Susan: then you should do the same drilling for RDBMS.... Animated discussion between whether, if you drill down files, and split them into directories and files you should not do the same thing for RDBMs. Andrew: what do you mean by data storage does that really mean that you want to manage the low level stuff like in a SAN Peter: no, you need to put the files in certain locations ... Dave: provisioning in OGSA terms ... Peter: want to manage things like quotas ... Andrew: so you would need to identify locations..? Peter: at a high level ... space reservation - you need to book the space if you are going to be running a job producing this... file system proxy is captured in "look and feel". * Data Distribution Andrew: captured triggers in transformation and servicing ... did not capture network congestion control ... this goes under provisioning. Dave: should data movement be listed separately? Andrew: I was thinking of pre-loading, caching and replication management. Does data movement mean migration or copying? Peter: it could be copying or migration (move from A to B and not at A any more). Andrew: migration fits better with replica ... may need a new title. Service orchestration I captured that as transformation and that they could be chained... Peter: the same thing - if you have a replication and then a service register a file in a file catalogue how do you ensure that the service are updated. Andrew: we did not capture transactions ... we need to capture transactions. Susan: INFO-D scope includes DAIS 3rd party movement and also publish/subscribe through ws-notification - can try to use this to cover the different aspects. * Life Sciences Case Andrew: high availability and performance are really QoS properties Dave S: borderline between the QoS and the properties which are orthogonal ... Allen: need to also list the currency of the data ... Susan: also have transactionality included as a QoS ... there are different levels of transaction support Dave: how much of these things are metadata descriptions ... Allen: don't think that any of these things are metadata ... these are things that you negotiation with the service provider... Discussion about you attach metadata to the object. Dave: does updating imply transaction? Andrew: not in this use case Andrew wonders whether literature should be added as a basic type. This is added as "text mining". *** Andrew leaves at this point. Howie does the slide driving. * Data Mining slide. Data mining seems to come under replication and caching ... maybe need a new slide for distributed access - distributing the query Simon: distributed access could be used as a QoS Simon: the identity question can also apply to distributed access ... how the data is identified and located ... distributed access is closely related to identification of data. Dave: Identification, location and naming should be added to this slide Susan: data extraction is very popular ... Grid only in the aspects of the mining that are CPU intensive and other bits that are IO bound - the Grid is good and handling those kind of things ... not sure whether scoring would fit under transformations ... this is very I/O bound. Not sure if it a special kind of indexing where you split things up that are of interest to you ... don't know where indexing comes in either. Dave S.: what are the processes that are taking place centrally and thus not Grid? Trying to capture the boundary of the application. Susan: the data sources are very distributed ... could evolve to become a distributed process. Dave S.: accessing remote information is definitely a Grid computing but just because the data sources are distributed does not mean that it is distributed ... the requirement of today's applications may not have that just now but we should not ignore it ... Simon: should add indexing - that Susan brought out .... Susan: not sure where it goes .... not sure if it is metadata... Dave: suggest putting it under a basic type ... Dave S.: an index is a state of a file ... it is indexed or not... Dave: legal restriction of metadata should come under security ... Simon: there are restrictions and compliance. Dave S.: is a privacy issue as well which is more than security ... could be implemented under security but it has a higher level set of concept ... Simon: crosses the audit and provenance ... Dave: have not put audit anywhere ... Legal restrictions is classified under "Not discussed" "Audit logs" is under security. privacy and confidentiality are placed under legal restrictions Susan: federated identity ... "identification representation" is added Dave S: I've heard it used in the context where you get access through membership to an organisation and through your identity * Review Simon: do we have a framework that we are comfortable with? Susan: have a set of charts with capabilities .... Dave: grouping of capabilities under common groupings ... do we have relevant groups that is structured in an appropriate way that can later be cast as services. Simon: question is how go from this to the services? Review Andrew's slides as amended during this discussion. Dave: Transparency addressed in the OGSA Data Service document. I read that document in querying a structured piece of data ... Simon: in reality you present a service interface to a piece of data. That data could be anything... Dave: how does that fit with distributed files of the physics applications? Susan: OGSA Data Services is ways that you access the files and ways that you manage the file ... that is all it's saying - ways to access, manage and expose metadata .... ... Dave brings up aspects of performance properties that you may wish to expose. Discussion follows as to whether these characteristics should be exposed. The issue of layering - where different performant layers may be exposed. ... Under look and feel what this means is that particular types of data service will have to say the different types of data there might be and replication Simon: how these are the services are made to fit under the "look and feel" ... difficult to distill the service in a single pass. These come out the discussion ... 3. Architecture discussion -------------------------- Simon: should note what seems important under each of these points Dave B: put an architectural notion under each of the points ... Simon: is it policy, an api, a relationship between services ... Susan: but it depends on the context ... Simon: that is important point Allen: need to define what these artifacts are .... Susan: the architecture is for a Grid fabric ... a sys management thing could be expose differently at different level .. a policy could become an api. Peter: you also want to know what services you have to manage at the service discovery ... concept of containers, grouping, factories .... try to think of it as someone who wants to deploy services ... what kind of containers are available, what services ... Susan: should the context from which people are coming in from: a Grid fabric person, ... People describe the interest that they have in this. 4. Draft list of services ------------------------- - Postponed 5. Plan for FTF meeting ------------------------ - Andrew and Dave will attend. Allen may attend. 6. Wrap up ---------- DONM: Wed 19/5 at same time. AOB: Dave will send out the amended slides with instructions.