OGSA F2F, London -- 22 May 2005

BES Discussion Minutes

Attendees
---------------
	Steven McGough
	Chris Smith
	William Lee
	Andrew Grimshaw
	Mark Morgan (Minutes)
	Jay Unger
	Hiro Kishimoto
	Fred Maciel
	Steven Newhouse
	Takuya Mori
	Darren Pulsipher
	Michel Drescher
	Andreas Savvas
	Soonwook Hwang
	Kazushige Saga
	Mathias Dalheimer
	Vesso Novov
	Ravi Subramaniam
	Donal Fellows

Agenda
------
	Agenda Bashing
	ServiceStructure
	Submit/Create/Info/Destroy/ interface
	Recap and Summarize
	Short lunch and informatl discussions
	Management aspects of BES containers
	Integreate and discuss sections

Agenda Bashing
------------------------
	* Anything along the lines of information?
		- Something in the slides already.
	
Service Structure
-------------------------
	* Presentation by Andrew
	* EMS Scope -- we are just the tiny place
	* Assume
		- Placement decision made
		- Where run decided
		- Going to be abstract names
		- Security is out of scope
			-- In scope is "what is the id of user to opreate as
			   user in environment."
			-- Difficulty here is, can I esnd a job to an endpoint,
			   that is authenticated wrt to the use of the
			   endpoint, that contains a "run me as id" that has no
			   relationship to the authenticater
			-- There is information in the JSDL
			-- There is also info in the exchange between the
			   container and the caller.
			-- Would suggest that information in that exchange, if
			   you think that that has passed some auth. check, is
			   more valueable than some string in the JSDL
	* Breakdown
		- Factory
			-- Create, status, kill, etc.
		- Management
			-- Set policy
			-- Visible properties
		- JSDL terms that must be understood
			-- "Name space" of JSDL in the sense of the way JSDL
			   uses names.
	* Management port type
		- is of factory, or of job itself?
			-- Of job itself
		- Factory, applied to container or job?
			-- Container....
	* JSDL -- we need to understand this very well in this working gorup.
		- There is a lot of ambiguity in JSDL that we have to decide,
		  "what does this mean for us?"
		- Seems like there are places in JSDL where things are
		  mandatory and we are not sure why.
		- Names in JSDL...
			-- What are they?
			-- What do they mean?
			-- How do you specify some things?
		- Interop is why we are here...
	* Is there a plan to discuss WS-Agreement?
		- We have discussed this and decided not to deal with it right
		  now.
		- Instead, we will figure out what BES is to do and then later
		  we will decide what to use to do that.
		- WS-Agreement doesn't just define protocol, but also has API.
		- There seems to be a high degree of concurrence between what
		  BES is talking about and WS-Agreement.
		- Really, it looked very similar at that level.
		- We didn't have to make a decision at that level.
		- It doesn't matter what these things are called.
	* When are we going to go back to the WS-Agreement discussion?
		- How about immediately before GGF-14?
	* In the end, we think that you are going to end up putting other
	  things around BES.
	* Factory-like stuff
		- WS-Name createActivity(JSDL Document)
			-- Point is, we have a container and we are asking it
			   to start something for us.
			-- It will return a handle to the created service.
			-- Should this service also consume a CDDLM document in
			   addition to JSDL?
			-- Are we going to support different types of
			   documents?
			-- Can you embed CDDLM terms within a JSDL section?
			-- Difference is the expected lifetime of the service.
				--- CDDLM is thought of a system configuration
				    management tool for deploying long-lived
				    services.
				--- How long a service/resource lives is really
				    a matter of perspective.
				--- The issue is that we have two spec. that
				    are fundamentally doing the same thing
				    but for interoperabilty reasons, we have to
				    choose one.
			-- Do we have to merge the purposes of the
			   two?
			-- CDDLM provides more sophisticated functions.
			-- BES comes after everything is set up, so isn't this
			   out of scope?
				--- Starting services is where the overlap is.
				--- When we first did this, it was imagined
				    that container would have some management,
				    staging, etc.  However, the group has been
				    going for something MUCH simpler.
		- Why have a separate staging service, a configuration service,
		  when those are specializations of the execution service?
		- Why not make it part of the container that it knows how to do
		  those things?  Then BES has no clue what it's doing, what's
		  being passed into it.
		- Andrew's world view.
			-- Container -- JSDL comes into it.
			-- Logically, it creates a new activity (with a
			   different WS-Name) which is going to do what is in
			   the JSDL doc that it says it should be doing.
			-- That could be
				--- Run Blast
				--- Run Service Container
				--- etc.
			-- In the original picture, there was a Job Manager
			   dealing with the container.
			-- The Job Manager looks at the JSDL, sees the staging.
			-- He sends a job to the BES and asks for a staging
			   execution, then sends a job to run the job, then
			   another one to stage out.
		- Another view of this is that staging or CDDLM is the
		  provisionng.
		- It is out of scope for BES, but in the future it will
		  probably be part of EMS.
		- There is one use case for CDDLM which is very similar to this
		  -- initiating an activity.
		- Problem is that there are bunches of batch systems out there
		  that do already deal with file staging.
		- JSDL presumes an IO model that is dominated by file staging,
		  though not limited to that.
		- However, file staging is not the only model.  There are other
		  IO preparatory models.
		- Have we just pushed the problem into a different working
		  group?
		- Why does this matter?
			-- If my container gets a JSDL document with staging,
			   it will just do it.  However, this breaks interop.
		- Two aspects to JSDL.
			-- You have to parse it.
			-- And then you have to understand it.
		- Suppose in addition to createActivity, should we also have a
		  port type for doing staging?
			-- No
		- What about submitting a JSDL which contained ONLY staging?
			-- Yes.
		- Doing this presumes that the activity has some value to
		  something later.
			-- But, what is the lifetime of this staged file?
			-- There has to be some knowledge of the workflow
			   policy on when things get cleaned up.
			-- No problem with implied dependencies.  If we decide
			   to do that in BES, that's fine.  If you have
			   something more complicated then that, then go get
			   something else like BPEL or whatever.
			-- There is a problem though in that...what about the
			   binaries?  What about the job container?  That's not
			   included in this simple input/output process.
				--- Yes it is.
				--- In an earlier version, we had some notion
				    of application binaries.  Problem is, in
				    JSDL it's a string and we have to define
				    what that strings means.
				--- It's OK if we want to model a very simple
				    workflow in BES.
				--- If you can't do part of a workflow and you
				    have a request to do that part, then you
				    have to have well defined error behavior.
				--- As long as we fuly understand what the
				    workflow model is, then we bound it.
				--- Then we say, this is as much as you can do
				    with a Basic JSDL document.  If you want
				    more, then go somewhere else.
			-- CDDLM is not using the sophisticated work flow
			   thing, they introduce lifecycle of the jobs
			-- Life cycle
				--- deploying phase
				--- execution phase
				--- de-deploying phase
		- We have to keep in mind that we aren't constrained to use
		  CDDLM.
		- The problem is that, some would argue that there ought to be
		  no choreography in BES.  If you want Choreography, use BPEL
		  or something else.
			-- Not sure we agree with that, because as said before,
			   there are a lot of systems out there that do handle
			   staging.
		- We need to make sure that we aren't running into the CDDLM
		  activity, and we need to find out how they fit into our
		  space.
		- They distinguish themselves from provisioning.
		- If it's out of scope, then we need to make sure that it's
		  stated as out of scope.  Someone needs to go look at CDDLM
		  and come back and say how this all fits togheter.
		- If we are going to make this decisoin that there is a basic
		  execution model with some choreographed life cycle, that we
		  write it down.  That we make it clear what the model is.
		- If you say stage, execute, de-stage, there needs to be an
		  implied reltaionship between those phases.
		- This is fundamental.  Is initiate activity something that can
		  have multiple steps or stages, OR, do we say, this is REALLY
		  basic, there is no choreography, everything is separate, if
		  you say stage in, then you stage in, if you say execute, then
		  you execute.
		- But, there are tons of systems out there that already have
		  this basic capability.  If you specify stage, execute, stage,
		  then the dependencies are taken care of.
		- So, as long as you put it in one JSDL document, then you
		  don't have to have any thing else.
		- Are we assuming that BES has that three stage model
			-- seems like we are
		- It bothers some because it's not the simplest sollution, but
		  things should be as simple possible without being simpler.
		- You can say stage in -- the problem is, how long can I count
		  on the data being around.  JSDL has the notion of file system
		  which gives you control over how long the thing lives.
		- The difficulty is that, as soon as you decide that there is
		  no guaranteed workflow model, then you are relying on blind
		  faith, or relying on something like BPEL.
		- Suppose you are trying to do one of these jobs that does work
		  across multiple sites.  Once they are all staged, then you
		  say go.  How does BPEL do this if BES only supports the
		  staging workflow model.
		- But, JSDL allows you to have staging, and jobs that are
		  empty.  You can send something that has empty whatever.
		- There is presently an implied life cycle that says that the
		  implied duration of the stage activity is the whole job.
		  JSDL has an explicit way of dealing with this.
		- What are the choices in JSDL?
			-- Delete on Termination -- boolean
		- We have to make some life cycle statements about things we
		  create or consume.
		- In the old days, keep/delete/pass.
			-- Pass meant that there was some other thing that
			   would consume this thing -- you have to make sure
			   that it stays around at least until the next thing
			   consumed it.  The next thing could then also say
			   pass.
			-- This worked great for 15 years until something
			   failed and then the chain was broken.
			-- We then had to say pass,delete or pass,pass
		- But, on a grid you don't have control of the disk storage.
		- Sure I do -- if I create a disk resource -- if I say it's
		  going to be consumed, then it can record that disposition and
		  know that the next thing that comes along that attempts to
		  access that resource, is the trigger that decides what
		  happens to it finaally.
		- What if something never shows up?
		- Also, that resource is not being handled by a single
		  scheduler.
		- It doesn't matter as long as they are coordinated -- and that
		  coordination can be at a very high level -- a human being.  
		- There seems to be consensus that we stick with the JSDL model
		  where an activity may consist of an activity with up to
		  three stages -- it doesn't have to contain all of thos
		  stages.
		- We need to have a coorsponding state space that maps to those
		  stages.
		- There is some sort of atomicity guarantee between those
		  phases.
 	* Now, we are going to have a presentation on JSDL.  Seems particularly
	  important that we do.  Believe that we have a consensus that we want
	  to have (createActivity).  The next two are much less controversial,
	  but there may be hairy work involved.
	* ActivityStatesDocument getActivityStatus(WS-Name[] -- This is asking
	  for one or more activities)
		- We need to determine the state space.
		- We have to worry about staging now and in JSDL you could
		  stage in multiple documents.
		- What does the ActivityStatesDocument have
			-- it may include references to the JSDL document.
			-- Have a parallel structure in the status document to
			   the original JSDL document.
			-- We need to get people to go off and tell us what
			   this would look like.
			-- Pulsipher volunteers the JSDL group here
				--- Chris, Andreas, etc.
		- Faults -- we always return a document and the document may
		  have fault information.  
	* terminateActivity(WS-Name[])
		- When you send a kill, you want it to return immediately.  We
		  could have a "terminating" as well as a terminated state.
		- Termination can get stuck.
	* Why not ask the job for the status?
		- Well, bulk operations
		- Sometimes Activites don't respond
	* Some things have dissapeared
		- GetActivities
		- Would a container be a service group
			-- ServiceGroup?
	* Shouldn't we define operations on activies as well
		- Checkpoint
		- Send Arbitrary Signals
		- Change Job Description
	* Are we assuming that terminate is a hard kill?
		- Yes.
		- If you want to have a softer kill, that should probably be on
		  the activity itself
	* If we go down this route, then we have to define operations that are
	  on the activity and which ones are duplicated in both the container
	  and the activity.
		- We need to clearly distingish difference between things you
		  do in the container and things you do in the activity.

Brief Intro to JSDL
-------------------
	* Presentation by Steve McGough
	* Stage-in/Stage-out -- if you already have the file, then does it get
	  copied over?
		- No, if it is there, it is there and doesn't need to be
		  re-staged-in.
	* How elastic is this system.  If there are two jobs that both use a
	  staged in file, what does that mean?
		- The job doesn't need to know about sharing of resources, but
		  the manager may need to.
	* The thing that is interesting about this is that, if you tihnk about
	  Choreography, then all choreography boilds down to understanding what
	  the explicity relationships between usage ane executing are.  If you
	  say things like A produces something that B needs, then order and
	  workflow can be figured out and in that situatoin, every intermediate
	  resource has associated with it properties (life cycle,
	  associateability, dimension) and you can learn a lot more about
	  scheduling with that model then with the model of "First Run A, then
	  Run B".
	* WRT JSDL, if we did make a broader Choreography statement at some
	  point, what would be the properties of those resources.
		- If you said that you were defining in JSDL that those
	 	  resources are all modeled as point sequencers, that would be
		  OK.
		- We have to figure out what those default properties are.
		- If JSDL gets used with BES, then BES should say here is how
		  JSDL get's used.
		- BES will make a profile on JSDL.
		- There is no semantic meaning to the elements in the
		  <Application> element in JSDL, but BES can profile this to
		  apply semantic meaning.
	* Re: <POSIXApplication>.  Executable is a string.
		- Could be like /usr/local/bin/...,
		- or could be RNS,
		- or....
	* Do we in BES want to deal with different namespaces?
	* Resources -- in BES, we should use the same names and definitions
	  that JSDL has.
	* What about GLUE schema, would they agree with these?
		- Not sure.
		- GLUE came out of CERN but was originally part of data grid.
		- From an OMII perspective, alighnment with EGEE is more
		  important than CIM.
	* If BES uses JSDL, then BES has to talk about the semantics of the
	  resource types.
	* BES doesn't have to say anything about what get's done with things as
	  they come down.  What decision do I make about fault vs. Ignore for
	  things that I don't care about.  What parts should BES parse, ignore,
	  fault, do something about?

Management aspects of the BES container.
---------------------------------------
	* Fred to present on RM-DesignTeam of the OGSA-WG.
	* What do we want to do in BES in terms of the management
	  activites on the container itself.  
	* These meta-data as well as policies that we might want to have, we
	  need to have some definitions of those by June.
	* Would rather have something and say these are prototypes, rather then
	  have a huge whole.
	* Two issues are what are the attributes we want to represent, and how
	  do we want to represent and manipulate them.
	* Most of these things generally boil down to attribute/value pair.
	  How do we render that?
	* JSDL already has some, then we should use thouse, esp. in June time
	  frame.  Most of those however are useful in matchmaking.  But, are
	  there also policy things that we want to do on containers.  Things
	  like, change the scheduling policy, reduce the maximum number of
	  things you will run at any given time.
	* Are there any common ones of those where you will want to change the
	  number.
	* Why not move policy things outside of the container.  Have an
	  extermal policy manager.
	* Our bias should be to use the ones given in JSDL.
		- But JSDL isn't sufficient in some cases.  There may be other
		  metadata that needs to be examined, manipulated, etc.
	* Why isn't this information out of scope?
		- Well, making decisions is out of scope, but needing to get
		  the information isn't out of scope.
	* Should we come up with a meta-model for attributes that is
	  extensible?
	* There is a premitive resource requirements speicfication in JSDL,
	  some are dimensional like memory, CPU size, etc.  There is a hint of
	  a meta-model, but it isn't explicit.  It would be nice if someone who
	  was doing an RR.
	* If we are going to adopt anything, then we should adopt what JSDL is
	  using this moment.
	* Both groups agree that the RRL is something that another group should
	  do.
	* If you have defined a container well, then using JSDL to test it is
	  good, but using JSDL to define BES isn't necessarily right.
		- But JSDL as a starting point seems OK.
	* Why do we need to worry about advertisement when selection is out of
	  scope.
		- The container is a service, but it is going to have to have
		  properties that may be used by someone who IS doing
		  selection.
	* When the container advertised it's ability to start things, but
	  submitted jobs may not have that exact content.  When a container
	  gets JSDL that has a bunch of resources specified in it, it's within
	  it's pervue to look at that resource specification and decide not to
	  accept the job creation -- how does it report on why it decided not
	  to do a job.
	* Is the BES about the container, or just about the job launching port
	  type?
	* Container has different policies/attributes then the jobs.
	* Instead of focusing on JSDL, why aren't we focusing on WS-Agreement.?
		- Because it's out of scope.
	* Are we defining a port type, or a service.  If service, then you have
	  to have everything available here, if port type, it can be composed
	  with other things.
	* Let's say that the resources that a job advertises, and those that it
	  uses are equal.
	
Votes:
--------
	Service:      13
	Port Type:   7
	Don't Care:  0

	Attributes Yes:	14
	Attributes No:	3
	Don't Care:	1

Meta Model
----------
	* Can we just say that we start with JSDL, and if a few people can come
	  up with a metamodel in time, can we say, hey, here is something that
	  works, lets use it.  If it fails, then we have JSDL and nothing lost.
		- Yes -- Mark, Jay, and Chris to take point.
		- June 16th telecon,
		- Decision -- start with JSDL, some people to go off and try to
		  make a meta model (or find one) by June 16th.

Additional Container Port Types
-------------------------------
	* A query interface for activities -- pass in a document 
	* get state model
	* save state of activities
	* save state of container
	* advertise attributes

	* Assuming a container supports all states, what are those states?
	  BES would define a set of states.
		- Do we need an unknown state?
		- Is there any concept of priority?
			-- Has nothing to do with state.
		- Both done and failed are exit states.
		- Maybe we don't want them to dissapear necessarily.
	* Do we want a get state model?
		- No.
	* We have a state model in BES, should we discuss more?
		- Yes.
	* Resource dependency/data dependency models are the only ones that
	  solve the "Missed the train" case...process driven models always have
	  this problem.
	* The nice thing about a resource driven model, you can query the
	  resource and it's either there, or it isn't.
	* We need to have a phone call.
	* There is the container and the activity.
	* There is a working group (Self Manage Services), aren't they dealing
	  with state saving?
		- Saving the state of a service is out of scope for the service
		  because if we have protocols for saving state of a service,
		  that's some other thing or it's a property of the container
		  that the container is running in.
	* Would like to argue that we keep checkpointing, even the container,
	  out of scope for this discussion.
	* What about start and stop?
		- Will it have lifetime properties itself.
		- If I terminate a container, do I terminate all of the
		  activities.
	* What about policy things?
	* If there is a state of the container -- clearly the container has very
	  gross state (running, not running).  Does the container have sub
	  states -- can it be told to not stop, but don't accept any more work?
	* Do we need to have a state model for the state container itself.
		- Yes, but is it more then running and not?
	* If the reason a container resist a JSDL doc is because it's draining,
	  that's a different fault.  If you want to have a different port type
	  that says, "You, drain", then you at least are going to have a
	  different port type.  Or is a property?
	* Content to leave a basic running/not as long as you add a new fault
	  that says, not accepting more drops.
	* Are we going to worry about Orphans?  You will always have orphans.
	* Motion that activities are decoupled from containers in terms of
	  Life Cycle.
	* Motion that when a container terminates, we make no statement
	  whatsoever about the contained things.
	* If you want to stop all of the things on that gateway, then you have
	  to do it explicitly through the container prior to shutting the
	  gateway down.
	* Should we call this a BES, or factory, instead of container?
	* Transactionally, if you are attempting to kill all of the
	  subordinates of a BES, then you should say shutdown and then kill all
	  the instances.
		- You need something that says stop accepting new jobs, and
		  something that says go away.
		- We need to have a new port type that says, stop accepting new
		  jobs and resume accepting new jobs.

Activity Port Types
---------------------------
	* Should we say anything about this?
	* Is it a WSRF endpoint?
	* If so, are the port types:
		- Check Point
		- Send Signal
		- Change Job Description (No)
		- GetStatus
	* Shouldn't this group treat activities as things which may, or may not
	  have port types.
	* Let's say that some types of things, like for example a legacy
	  activity, we hvae a specific port type that can have things like
		- Send Signal
		- Terminate (Soft Terminate)
		- Suspend and Resume
	* Is there some manual action you can take that would affect state
	  transitions.
		- If you implement suspend/resume, if you suspend an activity,
		  then it may, or may not, affect the resource consumption of
		  an activity.
	* There is another assumption that each activity will have a port type
	  that you can access through a web service.
	* Suppose I am a service in web sphere, I may want to stop it for some
	  period of time
	* Architecturally, you have to prepare for the worse
	* We just have to make sure that we are clear in our specification what
	  it means to suspend something.  We can't guarantee anything about
	  it's resource usage or resource utilization.
	* Vote indicates that we don't have to define the Activity Port type.
	* Given that we are passing JSDL documents in, what are we going to
	  consider to be well formed JSDL documents and expect to be able to
	  handle?
	* Comes down to Posix application and we can state up front that we
	  support the Posix Application.
Wrap up
-------
	* We're not going to define Activity Port type
	* container we have about 5 or 6 port types
	* What is the form of the meta data/attributes -- start JSDL, come up
	  with meta model
	* Mark to send out minutes to ogsa-bes and ogsa-wg.
	* A number of decisions have been made
	* Darren, Steve, and Andrew have the pen to put together the document
	* More telecons to discuss
		- What is the state space that things can move through