-------------------------------------------------------------------- SAGA Use Case Template: ======================= Name of use case: GriPPS - Grid Protein Pattern Scanning Contact (name and address): Christophe.Blanchet@ibcp.fr Authors (if different form contact) ................................ 1. General Information: ----------------------- This section consists of check-boxes to provide some context in which to evaluate the use case. 1.1 Which best describes your organisation: Industry [ ] Academic [X] Other [ ] Please specify: ................................... 1.2 Application area: Astronomy [ ] Particle physics [ ] Bio-informatics [X] Environmental Sc. [ ] Image analysis [ ] Other [ ] Please specify: ................................... 1.3 Which of the following apply to or best describe this use case Multiple selections are possible, please prioritize with numbers from 1 (low) to 5 (high): Database [5] Remote steering [1] Visualization [0] Security [3] Resource discovery [1] Resource scheduling [4] Workflow [3] Data movement [5] High Throughput Computing [3] High Performance Computing [4] Other [ ] Please specify: ................................... 1.4 Are you an: Application user [ ] Application developer [X] System administrator [ ] Service developer [ ] Computer science researcher [ ] Other [ ] Please specify: ................................... 2. Introduction: ---------------- 2.1 Provide a paragraph introduction to your use case. Background to the project is another alternative. (E.g. 100 words). Genomics acquiring programs such as full genomes sequencing projects are producing greater amounts of data. The analysis of these raw biological data require very large computing resources. Functional sites and signatures of protein are very useful for analyzing these data or for correlating different kind of existing biological data. These methods are applied, for example for identification and characterization of the potential functions of new sequenced proteins, clusterization in protein family of the sequences contained in international databanks, and so on. The Grid Protein Pattern Scanning-GriPPS project (granted by the french programACI GRID 2002) aims to develop and adapt these bioinformatic algorithms so that they can exploit the underlying grid infrastructure. Models of those algorithms will be devised to be able to foresee their behavior on a grid platform and proposals will be written to adapt other bioinformatic algorithms to the grid. Within this context, we propose to study such algorithms to identify the constraints related to the biological applications and to determine their granularity and the possible parallelization schemes that can be applied to them. 2.2 Is there a URL with more information about the project ? http://gripps.ibcp.fr 3. Use Case to Motivate Functionality Within a Simple API: ---------------------------------------------------------- Provide a scenario description to explain customers' needs. E.g. "move a file from A to B," "start a job." Please include figures if possible. If your use case requires multiple components of functionality, please provide separate descriptions for each component, bullet points of 50 words per functionality are acceptable. Get fast access to sequence databank repository from worker node Start scanning of one protein pattern Download result file: patterns identified Download sub-base of sequences Repeat the job with an other protein pattern on the obtained subset of sequences 4. Customers: ------------- Describe customers of this use case and their needs. In particular, where and how the use case occurs "in nature" and for whom it occurs. E.g. max 40 words Used by biologist for identifying proteins with particular functionnality. Need to be able to put their own protein patterns against world-wide reference databases . 5. Involved Resources: ---------------------- 5.1 List all the resources needed: e.g. what hardware, data, software might be involved. Storage: - sequence database (SWISSPROT, TrEMBL, ..., user ones) - pattern database (PROSITE, pFAM, ..., user ones) Software: PattInProt 5.2 Are these resources geographically distributed? Yes, each database are maintained by different laboratories. 5.3 How many resources are involved in the use case? E.g. how many remote tasks are executing at the same time? They could be several depending on the pattern bank and the proetin sequence bank to analyze. 5.4 Describe your codes and tools: what sort of license is available, e.g. open or closed source license; what sort of third party tools and libraries do you use, and what is their availablility; do you regularly work from source code, or use pre-compiled applications; what languages are your applications developed in (if relevant), e.g. Fortran, C, C++, Java, Perl, or Python. Databank: open for acadmemic community Software: PattInProt, closed source, C/C++ 5.5 What information sources do you require, e.g. certificate authorities, or registries. Certificate authorities to satisfy data privacy or access for some of them. 5.6 Do you use any resources other than traditional compute or data resources, e.g. telescopes, microscopes, medical imaging instruments. No 5.7 How often is your application used on the grid or grid-like systems? [ ] Exclusively [ ] Often (say 50-50) [X] Ocassionally on the grid, but mostly stand-alone [ ] Not at all yet, but the plan is to. 6. Environment: --------------- Provide a description of the environment your scenario runs in, for example the languages used, the tool-sets used, and the user environments (e.g. shell, scripting language, or portal). Available through web portal to the bioinformatic community. 7. How the resources are selected: ---------------------------------- 7.1 Which resources are selected by users, which are inherent in the application, and which are chosen by system administrators, or by other means? E.g. who is specifying the architecture and memory to run the remote tasks? ... 7.2 How are the resources selected? E.g. by OS, by CPU power, by memory, don't care, by cost, frequency of availability of information, size of datasets? ... 7.3 Are the resource requirements dynamic or static? ... 8. Security Considerations: --------------------------- 8.1 What things are sensitive in this scenario: executable code, data, computer hardware? I.e. at what level are security measures used to determine access, if any? Data could be sensible if they are from patient or pharmaceutical/agronomic researches. 8.2 Do you have any existing security framework, e.g. Kerberos 5, Unicore, GSI, SSH, smartcards? no 8.3 What are your security needs: authentication, authorisation, message protection, data protection, anonymisation, audit trail, or others? authentication, authz, data protection, monitoring and log (transfer and access) 8.4 What are the most important issues which would simplify your security solution? Simple API, simple deployment, integration with commodity technologies. ... 9. Scalability: --------------- What are the things which are important to scalability and to what scale - compute resources, data, networks ? data 10. Performance Considerations: ------------------------------- Explain any relevant performance considerations of the use case. job could be short: need a low payload time. these sort jobs may need to be ran on databank of several gigabytes 11. Grid Technologies currently used: ------------------------------------- If you are currently using or developing this scenario, which grid technologies are you using or considering? DIET middleware 12. What Would You Like an API to Look Like? -------------------------------------------- Suggest some functions and their prototypes which you would like in an API which would support your scenario. open local access IO (on worker node) to remote file/databank (on storage element) write subset of the databank, selected by the pattern scan 13. References: --------------- List references for further reading. ... --------------------------------------------------------------------