--------------------------------------------------------------------

SAGA Use Case Template:
=======================

Name of use case:                   GriPPS - Grid Protein Pattern Scanning
Contact (name and address):         Christophe.Blanchet@ibcp.fr 
Authors (if different form contact) ................................


1. General Information:
-----------------------

  This section consists of check-boxes to provide some context in
  which to evaluate the use case.
  
  1.1 Which best describes your organisation:
  
    Industry                     [ ]
    Academic                     [X]
    Other                        [ ]
       Please specify:           ...................................
                       
                       
  1.2 Application area:    
                       
    Astronomy                    [ ]
    Particle physics             [ ]
    Bio-informatics              [X]
    Environmental Sc.            [ ]
    Image analysis               [ ]
    Other                        [ ]
       Please specify:           ...................................
  
  
  1.3 Which of the following apply to or best describe this use case
      Multiple selections are possible, please prioritize with numbers
      from 1 (low) to 5 (high):
  
    Database                     [5]
    Remote steering              [1]
    Visualization                [0]
    Security                     [3]
    Resource discovery           [1]
    Resource scheduling          [4]
    Workflow                     [3]
    Data movement                [5]
    High Throughput Computing    [3]
    High Performance Computing   [4]
    Other                        [ ]
        Please specify:          ...................................
  
  
  1.4 Are you an:
  
    Application user             [ ]
    Application developer        [X]
    System administrator         [ ]
    Service developer            [ ]
    Computer science researcher  [ ]
    Other                        [ ]
        Please specify:          ...................................
  

2. Introduction:
----------------
  
  2.1 Provide a paragraph introduction to your use case.  Background
      to the project is another alternative. (E.g. 100 words).

Genomics acquiring programs such as full genomes sequencing projects
are producing greater amounts of data. The analysis of these raw
biological data require very large computing resources. Functional
sites and signatures of protein are very useful for analyzing these
data or for correlating different kind of existing biological
data. These methods are applied, for example for identification and
characterization of the potential functions of new sequenced proteins,
clusterization in protein family of the sequences contained in
international databanks, and so on.

The Grid Protein Pattern Scanning-GriPPS project (granted by the
french programACI GRID 2002) aims to develop and adapt these
bioinformatic algorithms so that they can exploit the underlying grid
infrastructure. Models of those algorithms will be devised to be able
to foresee their behavior on a grid platform and proposals will be
written to adapt other bioinformatic algorithms to the grid. Within
this context, we propose to study such algorithms to identify the
constraints related to the biological applications and to determine
their granularity and the possible parallelization schemes that can be
applied to them.

  2.2 Is there a URL with more information about the project ?

http://gripps.ibcp.fr
  
  
3. Use Case to Motivate Functionality Within a Simple API:
----------------------------------------------------------  
  
  Provide a scenario description to explain customers' needs.  E.g. "move
  a file from A to B," "start a job."
  
  Please include figures if possible.

  If your use case requires multiple components of functionality, please  
  provide separate descriptions for each component, bullet points of 50
  words per functionality are acceptable.
  
Get fast access to sequence databank repository from worker node
Start scanning of one protein pattern
Download result file: patterns identified
Download sub-base of sequences
Repeat the job with an other protein pattern on the obtained subset of sequences

  
4. Customers:
-------------

  Describe customers of this use case and their needs. In particular,
  where and how the use case occurs "in nature" and for whom it occurs.  
  E.g. max 40 words

Used by biologist for identifying proteins with particular functionnality. Need to be able to put their own protein patterns against world-wide reference databases .

  
5. Involved Resources:
----------------------

  5.1 List all the resources needed: e.g. what hardware, data,
      software might be involved.

Storage:
- sequence database (SWISSPROT, TrEMBL, ..., user ones)
- pattern database (PROSITE, pFAM, ..., user ones)
Software: PattInProt
  
  5.2 Are these resources geographically distributed?

Yes, each database are maintained by different laboratories.
  

  5.3 How many resources are involved in the use case?  E.g. how
      many remote tasks are executing at the same time?

They could be several depending on the pattern bank and the proetin sequence bank to analyze.
  
  5.4 Describe your codes and tools: what sort of license is
      available, e.g. open or closed source license; what sort 
      of third party tools and libraries do you use, and what is 
      their availablility;  do you regularly work from source 
      code, or use pre-compiled applications; what languages 
      are your applications developed in (if relevant), e.g. 
      Fortran, C, C++, Java, Perl, or Python.

Databank: open for acadmemic community
Software: PattInProt, closed source, C/C++
  
  5.5 What information sources do you require, e.g. certificate
      authorities, or registries.

Certificate authorities to satisfy data privacy or access for some of them.
  
  5.6 Do you use any resources other than traditional compute 
      or data resources, e.g. telescopes, microscopes, medical 
      imaging instruments.

 No
  
  5.7 How often is your application used on the grid or grid-like 
      systems?

      [ ] Exclusively
      [ ] Often (say 50-50)
      [X] Ocassionally on the grid, but mostly stand-alone
      [ ] Not at all yet, but the plan is to.


6. Environment:
---------------

  Provide a description of the environment your scenario runs in,
  for example the languages used, the tool-sets used, and the user
  environments (e.g. shell, scripting language, or portal).

Available through web portal to the bioinformatic community.

  
7. How the resources are selected:
----------------------------------

  7.1 Which resources are selected by users, which are inherent 
      in the application, and which are chosen by system
      administrators, or by other means?  E.g. who is specifying 
      the architecture and memory to run the remote tasks?  

  ...
  
  
  7.2 How are the resources selected? E.g. by OS, by CPU power,
      by memory, don't care, by cost, frequency of availability
      of information, size of datasets?

  ...
  
  
  7.3 Are the resource requirements dynamic or static?

  ...
  

8. Security Considerations:
---------------------------

  8.1 What things are sensitive in this scenario: executable code, 
      data, computer hardware?  I.e. at what level are security 
      measures used to determine access, if any?

Data could be sensible if they are from patient or pharmaceutical/agronomic researches.
  
  8.2 Do you have any existing security framework, e.g. Kerberos 5,
      Unicore, GSI, SSH, smartcards?
  
 no 
  
  8.3 What are your security needs: authentication, authorisation,
      message protection, data protection, anonymisation, audit 
      trail, or others?

authentication, authz, data protection, monitoring and log (transfer and access)

  
  8.4 What are the most important issues which would simplify your
      security solution?  Simple API, simple deployment, integration
      with commodity technologies.  

  ...
  

9. Scalability:
---------------

  What are the things which are important to scalability and to what
  scale - compute resources, data, networks ?

data

  
10. Performance Considerations:
-------------------------------

  Explain any relevant performance considerations of the use case.

job could be short: need a low payload time.
these sort jobs may need to be ran on databank of several gigabytes 

  
11. Grid Technologies currently used:
-------------------------------------

  If you are currently using or developing this scenario, which grid
  technologies are you using or considering?

DIET middleware
  
  
12. What Would You Like an API to Look Like?
--------------------------------------------  
  
  Suggest some functions and their prototypes which you would like
  in an API which would support your scenario.

open local access IO (on worker node) to remote file/databank (on storage element)
write subset of the databank, selected by the pattern scan
  
13. References:
---------------

  List references for further reading.

  ...


--------------------------------------------------------------------