GCC
Current situation GCC 1.0
Roche 454
Current
cluster
UZ network 8C 16Gb 2TB UZ NAS Storage UZ NAS Storage 8C 16Gb 8C 16GbPer run:
~ 1 Mio reads
~ 2Gb raw data
New sequencer: 1000x increase
1.1TB / run (200Gbp)
~1000 Mio reads
8 days run!
Basic analysis of 1 full run
< 1 week on 3 nodes with 48Gb RAM and 8 CPU cores each (and needs 7TB space)
Full capacity sequencing = full capacity 24 cpu cores
Meta-analyses & post-analyses
•
Several fold higher needs than basic run analyses
•
Integrate multiple runs (e.g,. patient versus controls,
families, etc)
•
Integrate with previous data
•
Integrate with publicly available data
–
RNA-Seq + gene expression data from GEO
•
Integrate with other data sources
–
DNA-Seq + RNA-Seq + Methyl-Seq
•
Integrate with genome browsers
–
Galaxy, UCSC, Ensembl
•
Make analysis pipelines available to users as a service
•
Custom analyses as a service or in collaboration
Ideal computing setup
High Performance
Computing (HPC)
UZ-GBIOMED-VSC
8C 16Gb 2TB 8C 16Gb 8C 16Gb UZ NAS Storage UZ NAS Storage- Additional RAM (32Gb or 48 Gb per node) - Additional storage? DAS or NAS? Dell, NetApp? DAS or NAS? Dell, NetApp? Open-MPI SGE Distributed computing Torque/PBS Distributed computing Flexible computing ~ 100 cpu 6Gb RAM/core NetApp +DDN Storage NetApp +DDN Storage - Servers - Storage - Switches Software: - Academic tools - CLCBio? Software:
- CASAVA (parall. by user) - Academic: bowtie, bwa, … - CLCBio? UZ-Patient data Software: - CASAVA - CLCBio - Roche
- Computing (0,5 EUR / cpu-hour) - Storage (750-1500 EUR / TB)
VSC
gbiomed
To be discussed
• How can HiSEQ2000 choose between UZ and KULeuven network to send run data to storage?
– 1Gb
– 350 Gb / run compressed
• Where to store data after secondary analysis?
– Cheap storage – External HDD – tape
• Who does what?
– Jeroen / Jan for UZ?
– Stein / Gert / Raf for Biomed?
• Can we already buy additional RAM for UZ cluster?
• Can we connect gbiomed servers directly to UZ storage?
– What are the requirements?
• Estimate load over 3 levels
– # users
– # run
What’s next
•
Decide on gbiomed hardware
•
List of things needed at UZ
•
Start testing CASAVA on UZ system and on VSC
•
Test CLCBio on UZ system for Illumina data
Storage
•
How much do we need?
–
1.1 TB per run
–
7 TB space during analysis
•
BUT: keep only runs that are being analyzed
–
~ 3 at a time?
–
10-15 TB
•
After analysis:
–
Data delivered to client
–
Data compressed and moved to offline storage
• Cheap HDD array? • Tape?
Proposal for GCC2.0
(ideas under construction)
UZ Computing nodes (existing) 8C 16Gb 2TB UZ NetApp Storage UZ NetApp Storage 8C 16Gb 8C 16GbPatient-related data Non-patient-related data (e.g., model organisms, cell lines, …)
32C 256Gb 8C 48Gb 8C 48Gb gbiomed computing nodes Fast interconnect; high I/O bandwidth
Illumina HiSEQ2000 Roche 454 ICTS/VSC NetApp +DDN Storage ICTS/VSC NetApp +DDN Storage
VSC
(existing), pay per cpu-hour!
Non-patient-related data
!
GCC2.0 features
•
Divide and conquer: solution at 3 levels
– UZ: for UZ-patient-related data (protected)
– Gbiomed: ad hoc, flexible computing for research (non-UZ-patient related data) – VSC: high-performance computing (non UZ-patient related data)
•
Storage (too expensive to duplicate)
– VSC storage with Gbiomed access (create 10Gb fast interconnect from ICTS to gbiomed) – UZ storage with Gbiomed access (create ‘open-access’ policy for non-patient related data) – Gbiomed ad hoc storage (HDDs in the local servers)
•
Computing
– VSC for HPC
– Servers in UZ (patient-related data)
– Servers in gbiomed (for research-related ad hoc analyses, web services, development, software testing, …)
GCC2.0 Cost, Timing & Effort
estimates
• Budget from Stichting tegen Kanker
– 200-250 K left for computing
• Solution for the first 3 years should be possible (excluding bioinformatics manpower)
• Budget spread between VSC-gbiomed-UZ: to be decided internally in genomics core
• VSC x%
– Storage (86.400 EUR for 32 TB; ~80 TB is needed for 25 runs per year)
– Computing time (29.594 EUR for 55.000 cpu-hours)
• Gbiomed local servers and local storage y%
• UZ additional storage z%
• Software licenses (CLCBio) (price quote requested)
– More investments needed over time (e.g., new hardware is only for 3 years)
• Timing: 31 August 2010?
• Estimated effort (to be discussed)
– VSC:
• Create 10Gb ethernet link to gbiomed (cost?)
• … mandays for startup and testing (network connections, storage, software)
• Maintenance included in price
– Genomics Core bioinformaticians (VRC, CME)
• … mandays for startup and testing
– Gbiomed IT:
• … mandays for setting-up local servers & integration with ICTS storage • … FTE for maintenance of local servers
Hurdles to overcome
•
1) 10Gb ethernet link between VSC and gbiomed
– For non-UZ-patient related data – To transfer Illumina data to VSC
– To run ad hoc analyses on local gbiomed servers, connected to the VSC storage, without the need to duplicate the storage solution and the data (too costly)
– An absolute requirement – Currently not available
– A necessary investment for future VSC-BMW interactions
•
2) UZ-Patient-related data cannot be transferred to VSC storage, nor
computed at VSC
– Can VSC provide a secure transfer, storage and computing environment for UZ-data? If not, data analysis and storage for UZ-data remains in UZ.
•
3) Link between UZ storage and gbiomed for non-patient related data
– Gbiomed-UZ
– 10Gb link is possible in principle. Perhaps during transition period (while waiting for 10Gb link VSC-gbiomed)?
Alternatives
•
All-in-one solution
•
PSSCLabs
Bioinformatics analyses
•
Estimated effort from Genomics Core bioinformatician for basic
analysis of 1 run: ~2-3 mandays
–
Included in service fee?
–
This analysis will not be satisfactory for most projects
•
Fee-based bioinformatics and data analysis service for more
advanced analyses?
•
Many users have a bioinformatician in the group or already
collaborate with bioinformaticians
•
Contribution in the service fee for GCC hardware & maintenance
cost, and software licenses
•
Estimated effort:
– Either only basic analysis services are offered: ½ FTE bioinformatics postdoc – Or basic plus advanced bioinformatics services will be offered: 1 FTE