|
Project Information
Links
|
Falkon aims to enable the rapid and efficient execution of many tasks on large compute clusters. Falkon integrates (1) multi-level scheduling to separate resource acquisition from task dispatch, and (2) a streamlined dispatcher. Falkon’s integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. Microbenchmarks show that Falkon throughput (ranging from 100s to 1000s of tasks/sec) and scalability (to 54K executors and 2M queued tasks) are several orders of magnitude better than other systems used in production Grids. Furthermore, we have extended Falkon to include data management functionality. Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a “data diffusion” approach that acquires resources for data analysis dynamically, schedules computations as close to data as possible, and replicates data in response to workloads. As demand increases, more resources are acquired and “cached” to allow faster response to subsequent requests; resources are released when demand drops. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on the application workloads and the performance characteristics of the underlying infrastructure. This data diffusion concept is reminiscent of cooperative Web-caching and peer-to-peer storage systems. Other data-aware scheduling approaches assume static or dedicated resources, which can be expensive and inefficient if load varies significantly. The challenges to our approach are that we need to co-allocate storage resources with computation resources in order to enable the efficient analysis of possibly terabytes of data without prior knowledge of the characteristics of application workloads. To explore the proposed data diffusion, we have extended Falkon to allow the compute resources to cache data to local disks, and perform task dispatch via a data-aware scheduler. The integration of Falkon and the Swift parallel programming system provides us with access to a large number of applications from astronomy, astro-physics, medicine, and other domains, with varying datasets, workloads, and analysis codes. Large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time, relative to versions that execute tasks via separate scheduler submissions. Furthermore, data diffusion can further decrease application execution times by several factors and improve overall application scalability. Falkon Goals
For more information on the project, please see the main Falkon site at http://dev.globus.org/wiki/Incubator/Falkon. |