Condor CacheD: Caching for HTC - Part 1
- Stage input files to worker node.
- Start processing...
- Stage output files back to submit host.
Lets use a real world example. The nr blast database (Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr) (source) is currently ~15GB. When running blast, each query needs to run over the entire nr database, therefore it is required on every worker node that will run the query.
It is easy to say... well a 1 Gbps connection can transfer 15GB in ~120 seconds. Two minutes doesn't seem unreasonable to stage-in data, especially if the job can be 8 hours long. But usually you have many queries, so you will want to run these queries across many jobs. So lets say you submit 10 jobs, each needing the 15GB. Well, that should mean that if they all transfer at the same time, it will take 20 minutes of stage in time before any processing begins. But that is only for 10 jobs, what if you are submitting 1,000 jobs, or 10,000 jobs? At 1,000 jobs, it takes 33 hours to transfer data for input files? Suddenly 2 minutes to transfer a database becomes hours. And your submit machine is doing nothing but transferring input files! Also, using some math (or a simple simulation), you can see that if the jobs start sequentially, you are limited on the number of jobs that can run simultaneously.
This increasing data size and static maximum job length have lead to compromises and innovations on the part of users and sites.
Innovations
- Bandwidth increases from the storage nodes.
- Caching near the execution host.
Lots O' Bandwidth
OSG Connect's Stash attempts to ease the input file stage problem by providing a storage service with lots of bandwidth (10 Gbps), and support for site caching through HTTP. Certainly the higher bandwidth solution is the brute force method of decreasing the transfer time for files. But, 10 Gbps only knocks off a factor of 10 from all of the above times. This will certainly decrease the transfer time, but only if you can use all 10 Gbps.Numerous sites and users have tried to use high bandwidth storage services to solve the stage-in problem. Nebraska (and many other sites) even have their storage services connected to 100 Gbps network connections. But they tend to be limited not by the bandwidth available to the storage device, but by the bandwidth bottleneck at the boundary of each cluster to the outside world, usually a NAT.
Caching
Additionally, site caching simply runs from the submit host to the remote caching host. In the last week, the 95% of OSG VO's CPU hours have been provided by ~25 unique sites (source, calculations). This analogous to increasing the bandwidth of the submit host by 25 times (assuming 1gbps connections standard). This is a very cheap way to increase the bandwidth available to transfer input files. But the VO's usage is not split evenly amongst all 25 sites. 6 sites account for ~50% of the OSG VO usage. On those 6 clusters, the transfer bandwidth is limited to what those 6 proxy servers can transfer to their cluster's nodes. Therefore, for 50% of your processing slots, you are only increasing the transfer speed by 6.
A New Hope
Bandwidth
BitTorrent was chosen since it contains many characteristics that make it ideal for transferring large files. In order to bypass the network bottlenecks in the remote clusters, BitTorrent allows for clients to transfer data between them while also downloading from the original source. This allows every worker node to become a cache for the rest of the cluster nodes. As we will see in the next post, BitTorrent works well because the vast majority of the traffic is between nodes inside the cluster, rather than to nodes outside the cluster.
Caching
The CacheD instead uses local caches on each worker node. This local cache allows very fast transfers when the jobs begin. Further, the local cache acts as a seeder for the BitTorrent transfers described above.
When a job is begins running on a node, the stage-in stage will request from the local cache a copy of the data files. If the data files are not already staged at the local node, the local cache will pull the data files using BitTorrent from the submit host (cache origin) and all other nodes in the cluster. When the local cache has cached the stage-in data files, it will transfer the cached files into the jobs sandbox and begin processing. Subsequent jobs that require the same stage-in data files will request then immediately receive the files since they are already cached locally.
Up Next
UPDATE: Link to Part 2, Architecture of the CacheD.
Leave a comment