OGSA F2F, London -- 22 May 2005 BES Discussion Minutes Attendees --------------- Steven McGough Chris Smith William Lee Andrew Grimshaw Mark Morgan (Minutes) Jay Unger Hiro Kishimoto Fred Maciel Steven Newhouse Takuya Mori Darren Pulsipher Michel Drescher Andreas Savvas Soonwook Hwang Kazushige Saga Mathias Dalheimer Vesso Novov Ravi Subramaniam Donal Fellows Agenda ------ Agenda Bashing ServiceStructure Submit/Create/Info/Destroy/ interface Recap and Summarize Short lunch and informatl discussions Management aspects of BES containers Integreate and discuss sections Agenda Bashing ------------------------ * Anything along the lines of information? - Something in the slides already. Service Structure ------------------------- * Presentation by Andrew * EMS Scope -- we are just the tiny place * Assume - Placement decision made - Where run decided - Going to be abstract names - Security is out of scope -- In scope is "what is the id of user to opreate as user in environment." -- Difficulty here is, can I esnd a job to an endpoint, that is authenticated wrt to the use of the endpoint, that contains a "run me as id" that has no relationship to the authenticater -- There is information in the JSDL -- There is also info in the exchange between the container and the caller. -- Would suggest that information in that exchange, if you think that that has passed some auth. check, is more valueable than some string in the JSDL * Breakdown - Factory -- Create, status, kill, etc. - Management -- Set policy -- Visible properties - JSDL terms that must be understood -- "Name space" of JSDL in the sense of the way JSDL uses names. * Management port type - is of factory, or of job itself? -- Of job itself - Factory, applied to container or job? -- Container.... * JSDL -- we need to understand this very well in this working gorup. - There is a lot of ambiguity in JSDL that we have to decide, "what does this mean for us?" - Seems like there are places in JSDL where things are mandatory and we are not sure why. - Names in JSDL... -- What are they? -- What do they mean? -- How do you specify some things? - Interop is why we are here... * Is there a plan to discuss WS-Agreement? - We have discussed this and decided not to deal with it right now. - Instead, we will figure out what BES is to do and then later we will decide what to use to do that. - WS-Agreement doesn't just define protocol, but also has API. - There seems to be a high degree of concurrence between what BES is talking about and WS-Agreement. - Really, it looked very similar at that level. - We didn't have to make a decision at that level. - It doesn't matter what these things are called. * When are we going to go back to the WS-Agreement discussion? - How about immediately before GGF-14? * In the end, we think that you are going to end up putting other things around BES. * Factory-like stuff - WS-Name createActivity(JSDL Document) -- Point is, we have a container and we are asking it to start something for us. -- It will return a handle to the created service. -- Should this service also consume a CDDLM document in addition to JSDL? -- Are we going to support different types of documents? -- Can you embed CDDLM terms within a JSDL section? -- Difference is the expected lifetime of the service. --- CDDLM is thought of a system configuration management tool for deploying long-lived services. --- How long a service/resource lives is really a matter of perspective. --- The issue is that we have two spec. that are fundamentally doing the same thing but for interoperabilty reasons, we have to choose one. -- Do we have to merge the purposes of the two? -- CDDLM provides more sophisticated functions. -- BES comes after everything is set up, so isn't this out of scope? --- Starting services is where the overlap is. --- When we first did this, it was imagined that container would have some management, staging, etc. However, the group has been going for something MUCH simpler. - Why have a separate staging service, a configuration service, when those are specializations of the execution service? - Why not make it part of the container that it knows how to do those things? Then BES has no clue what it's doing, what's being passed into it. - Andrew's world view. -- Container -- JSDL comes into it. -- Logically, it creates a new activity (with a different WS-Name) which is going to do what is in the JSDL doc that it says it should be doing. -- That could be --- Run Blast --- Run Service Container --- etc. -- In the original picture, there was a Job Manager dealing with the container. -- The Job Manager looks at the JSDL, sees the staging. -- He sends a job to the BES and asks for a staging execution, then sends a job to run the job, then another one to stage out. - Another view of this is that staging or CDDLM is the provisionng. - It is out of scope for BES, but in the future it will probably be part of EMS. - There is one use case for CDDLM which is very similar to this -- initiating an activity. - Problem is that there are bunches of batch systems out there that do already deal with file staging. - JSDL presumes an IO model that is dominated by file staging, though not limited to that. - However, file staging is not the only model. There are other IO preparatory models. - Have we just pushed the problem into a different working group? - Why does this matter? -- If my container gets a JSDL document with staging, it will just do it. However, this breaks interop. - Two aspects to JSDL. -- You have to parse it. -- And then you have to understand it. - Suppose in addition to createActivity, should we also have a port type for doing staging? -- No - What about submitting a JSDL which contained ONLY staging? -- Yes. - Doing this presumes that the activity has some value to something later. -- But, what is the lifetime of this staged file? -- There has to be some knowledge of the workflow policy on when things get cleaned up. -- No problem with implied dependencies. If we decide to do that in BES, that's fine. If you have something more complicated then that, then go get something else like BPEL or whatever. -- There is a problem though in that...what about the binaries? What about the job container? That's not included in this simple input/output process. --- Yes it is. --- In an earlier version, we had some notion of application binaries. Problem is, in JSDL it's a string and we have to define what that strings means. --- It's OK if we want to model a very simple workflow in BES. --- If you can't do part of a workflow and you have a request to do that part, then you have to have well defined error behavior. --- As long as we fuly understand what the workflow model is, then we bound it. --- Then we say, this is as much as you can do with a Basic JSDL document. If you want more, then go somewhere else. -- CDDLM is not using the sophisticated work flow thing, they introduce lifecycle of the jobs -- Life cycle --- deploying phase --- execution phase --- de-deploying phase - We have to keep in mind that we aren't constrained to use CDDLM. - The problem is that, some would argue that there ought to be no choreography in BES. If you want Choreography, use BPEL or something else. -- Not sure we agree with that, because as said before, there are a lot of systems out there that do handle staging. - We need to make sure that we aren't running into the CDDLM activity, and we need to find out how they fit into our space. - They distinguish themselves from provisioning. - If it's out of scope, then we need to make sure that it's stated as out of scope. Someone needs to go look at CDDLM and come back and say how this all fits togheter. - If we are going to make this decisoin that there is a basic execution model with some choreographed life cycle, that we write it down. That we make it clear what the model is. - If you say stage, execute, de-stage, there needs to be an implied reltaionship between those phases. - This is fundamental. Is initiate activity something that can have multiple steps or stages, OR, do we say, this is REALLY basic, there is no choreography, everything is separate, if you say stage in, then you stage in, if you say execute, then you execute. - But, there are tons of systems out there that already have this basic capability. If you specify stage, execute, stage, then the dependencies are taken care of. - So, as long as you put it in one JSDL document, then you don't have to have any thing else. - Are we assuming that BES has that three stage model -- seems like we are - It bothers some because it's not the simplest sollution, but things should be as simple possible without being simpler. - You can say stage in -- the problem is, how long can I count on the data being around. JSDL has the notion of file system which gives you control over how long the thing lives. - The difficulty is that, as soon as you decide that there is no guaranteed workflow model, then you are relying on blind faith, or relying on something like BPEL. - Suppose you are trying to do one of these jobs that does work across multiple sites. Once they are all staged, then you say go. How does BPEL do this if BES only supports the staging workflow model. - But, JSDL allows you to have staging, and jobs that are empty. You can send something that has empty whatever. - There is presently an implied life cycle that says that the implied duration of the stage activity is the whole job. JSDL has an explicit way of dealing with this. - What are the choices in JSDL? -- Delete on Termination -- boolean - We have to make some life cycle statements about things we create or consume. - In the old days, keep/delete/pass. -- Pass meant that there was some other thing that would consume this thing -- you have to make sure that it stays around at least until the next thing consumed it. The next thing could then also say pass. -- This worked great for 15 years until something failed and then the chain was broken. -- We then had to say pass,delete or pass,pass - But, on a grid you don't have control of the disk storage. - Sure I do -- if I create a disk resource -- if I say it's going to be consumed, then it can record that disposition and know that the next thing that comes along that attempts to access that resource, is the trigger that decides what happens to it finaally. - What if something never shows up? - Also, that resource is not being handled by a single scheduler. - It doesn't matter as long as they are coordinated -- and that coordination can be at a very high level -- a human being. - There seems to be consensus that we stick with the JSDL model where an activity may consist of an activity with up to three stages -- it doesn't have to contain all of thos stages. - We need to have a coorsponding state space that maps to those stages. - There is some sort of atomicity guarantee between those phases. * Now, we are going to have a presentation on JSDL. Seems particularly important that we do. Believe that we have a consensus that we want to have (createActivity). The next two are much less controversial, but there may be hairy work involved. * ActivityStatesDocument getActivityStatus(WS-Name[] -- This is asking for one or more activities) - We need to determine the state space. - We have to worry about staging now and in JSDL you could stage in multiple documents. - What does the ActivityStatesDocument have -- it may include references to the JSDL document. -- Have a parallel structure in the status document to the original JSDL document. -- We need to get people to go off and tell us what this would look like. -- Pulsipher volunteers the JSDL group here --- Chris, Andreas, etc. - Faults -- we always return a document and the document may have fault information. * terminateActivity(WS-Name[]) - When you send a kill, you want it to return immediately. We could have a "terminating" as well as a terminated state. - Termination can get stuck. * Why not ask the job for the status? - Well, bulk operations - Sometimes Activites don't respond * Some things have dissapeared - GetActivities - Would a container be a service group -- ServiceGroup? * Shouldn't we define operations on activies as well - Checkpoint - Send Arbitrary Signals - Change Job Description * Are we assuming that terminate is a hard kill? - Yes. - If you want to have a softer kill, that should probably be on the activity itself * If we go down this route, then we have to define operations that are on the activity and which ones are duplicated in both the container and the activity. - We need to clearly distingish difference between things you do in the container and things you do in the activity. Brief Intro to JSDL ------------------- * Presentation by Steve McGough * Stage-in/Stage-out -- if you already have the file, then does it get copied over? - No, if it is there, it is there and doesn't need to be re-staged-in. * How elastic is this system. If there are two jobs that both use a staged in file, what does that mean? - The job doesn't need to know about sharing of resources, but the manager may need to. * The thing that is interesting about this is that, if you tihnk about Choreography, then all choreography boilds down to understanding what the explicity relationships between usage ane executing are. If you say things like A produces something that B needs, then order and workflow can be figured out and in that situatoin, every intermediate resource has associated with it properties (life cycle, associateability, dimension) and you can learn a lot more about scheduling with that model then with the model of "First Run A, then Run B". * WRT JSDL, if we did make a broader Choreography statement at some point, what would be the properties of those resources. - If you said that you were defining in JSDL that those resources are all modeled as point sequencers, that would be OK. - We have to figure out what those default properties are. - If JSDL gets used with BES, then BES should say here is how JSDL get's used. - BES will make a profile on JSDL. - There is no semantic meaning to the elements in the element in JSDL, but BES can profile this to apply semantic meaning. * Re: . Executable is a string. - Could be like /usr/local/bin/..., - or could be RNS, - or.... * Do we in BES want to deal with different namespaces? * Resources -- in BES, we should use the same names and definitions that JSDL has. * What about GLUE schema, would they agree with these? - Not sure. - GLUE came out of CERN but was originally part of data grid. - From an OMII perspective, alighnment with EGEE is more important than CIM. * If BES uses JSDL, then BES has to talk about the semantics of the resource types. * BES doesn't have to say anything about what get's done with things as they come down. What decision do I make about fault vs. Ignore for things that I don't care about. What parts should BES parse, ignore, fault, do something about? Management aspects of the BES container. --------------------------------------- * Fred to present on RM-DesignTeam of the OGSA-WG. * What do we want to do in BES in terms of the management activites on the container itself. * These meta-data as well as policies that we might want to have, we need to have some definitions of those by June. * Would rather have something and say these are prototypes, rather then have a huge whole. * Two issues are what are the attributes we want to represent, and how do we want to represent and manipulate them. * Most of these things generally boil down to attribute/value pair. How do we render that? * JSDL already has some, then we should use thouse, esp. in June time frame. Most of those however are useful in matchmaking. But, are there also policy things that we want to do on containers. Things like, change the scheduling policy, reduce the maximum number of things you will run at any given time. * Are there any common ones of those where you will want to change the number. * Why not move policy things outside of the container. Have an extermal policy manager. * Our bias should be to use the ones given in JSDL. - But JSDL isn't sufficient in some cases. There may be other metadata that needs to be examined, manipulated, etc. * Why isn't this information out of scope? - Well, making decisions is out of scope, but needing to get the information isn't out of scope. * Should we come up with a meta-model for attributes that is extensible? * There is a premitive resource requirements speicfication in JSDL, some are dimensional like memory, CPU size, etc. There is a hint of a meta-model, but it isn't explicit. It would be nice if someone who was doing an RR. * If we are going to adopt anything, then we should adopt what JSDL is using this moment. * Both groups agree that the RRL is something that another group should do. * If you have defined a container well, then using JSDL to test it is good, but using JSDL to define BES isn't necessarily right. - But JSDL as a starting point seems OK. * Why do we need to worry about advertisement when selection is out of scope. - The container is a service, but it is going to have to have properties that may be used by someone who IS doing selection. * When the container advertised it's ability to start things, but submitted jobs may not have that exact content. When a container gets JSDL that has a bunch of resources specified in it, it's within it's pervue to look at that resource specification and decide not to accept the job creation -- how does it report on why it decided not to do a job. * Is the BES about the container, or just about the job launching port type? * Container has different policies/attributes then the jobs. * Instead of focusing on JSDL, why aren't we focusing on WS-Agreement.? - Because it's out of scope. * Are we defining a port type, or a service. If service, then you have to have everything available here, if port type, it can be composed with other things. * Let's say that the resources that a job advertises, and those that it uses are equal. Votes: -------- Service: 13 Port Type: 7 Don't Care: 0 Attributes Yes: 14 Attributes No: 3 Don't Care: 1 Meta Model ---------- * Can we just say that we start with JSDL, and if a few people can come up with a metamodel in time, can we say, hey, here is something that works, lets use it. If it fails, then we have JSDL and nothing lost. - Yes -- Mark, Jay, and Chris to take point. - June 16th telecon, - Decision -- start with JSDL, some people to go off and try to make a meta model (or find one) by June 16th. Additional Container Port Types ------------------------------- * A query interface for activities -- pass in a document * get state model * save state of activities * save state of container * advertise attributes * Assuming a container supports all states, what are those states? BES would define a set of states. - Do we need an unknown state? - Is there any concept of priority? -- Has nothing to do with state. - Both done and failed are exit states. - Maybe we don't want them to dissapear necessarily. * Do we want a get state model? - No. * We have a state model in BES, should we discuss more? - Yes. * Resource dependency/data dependency models are the only ones that solve the "Missed the train" case...process driven models always have this problem. * The nice thing about a resource driven model, you can query the resource and it's either there, or it isn't. * We need to have a phone call. * There is the container and the activity. * There is a working group (Self Manage Services), aren't they dealing with state saving? - Saving the state of a service is out of scope for the service because if we have protocols for saving state of a service, that's some other thing or it's a property of the container that the container is running in. * Would like to argue that we keep checkpointing, even the container, out of scope for this discussion. * What about start and stop? - Will it have lifetime properties itself. - If I terminate a container, do I terminate all of the activities. * What about policy things? * If there is a state of the container -- clearly the container has very gross state (running, not running). Does the container have sub states -- can it be told to not stop, but don't accept any more work? * Do we need to have a state model for the state container itself. - Yes, but is it more then running and not? * If the reason a container resist a JSDL doc is because it's draining, that's a different fault. If you want to have a different port type that says, "You, drain", then you at least are going to have a different port type. Or is a property? * Content to leave a basic running/not as long as you add a new fault that says, not accepting more drops. * Are we going to worry about Orphans? You will always have orphans. * Motion that activities are decoupled from containers in terms of Life Cycle. * Motion that when a container terminates, we make no statement whatsoever about the contained things. * If you want to stop all of the things on that gateway, then you have to do it explicitly through the container prior to shutting the gateway down. * Should we call this a BES, or factory, instead of container? * Transactionally, if you are attempting to kill all of the subordinates of a BES, then you should say shutdown and then kill all the instances. - You need something that says stop accepting new jobs, and something that says go away. - We need to have a new port type that says, stop accepting new jobs and resume accepting new jobs. Activity Port Types --------------------------- * Should we say anything about this? * Is it a WSRF endpoint? * If so, are the port types: - Check Point - Send Signal - Change Job Description (No) - GetStatus * Shouldn't this group treat activities as things which may, or may not have port types. * Let's say that some types of things, like for example a legacy activity, we hvae a specific port type that can have things like - Send Signal - Terminate (Soft Terminate) - Suspend and Resume * Is there some manual action you can take that would affect state transitions. - If you implement suspend/resume, if you suspend an activity, then it may, or may not, affect the resource consumption of an activity. * There is another assumption that each activity will have a port type that you can access through a web service. * Suppose I am a service in web sphere, I may want to stop it for some period of time * Architecturally, you have to prepare for the worse * We just have to make sure that we are clear in our specification what it means to suspend something. We can't guarantee anything about it's resource usage or resource utilization. * Vote indicates that we don't have to define the Activity Port type. * Given that we are passing JSDL documents in, what are we going to consider to be well formed JSDL documents and expect to be able to handle? * Comes down to Posix application and we can state up front that we support the Posix Application. Wrap up ------- * We're not going to define Activity Port type * container we have about 5 or 6 port types * What is the form of the meta data/attributes -- start JSDL, come up with meta model * Mark to send out minutes to ogsa-bes and ogsa-wg. * A number of decisions have been made * Darren, Steve, and Andrew have the pen to put together the document * More telecons to discuss - What is the state space that things can move through