Cost-Aware Live Migration of Services in the Cloud David Breitgand -- IBM Haifa Research Lab Gilad Kutiel, Danny Raz -- Technion, Israel Institute of Technology The research leading to these results has received funding from the European Union's Seventh Framework Programme under grant agreement n 257448, 215605
Agenda Introduction The Cost of Live Migration Fixed Bandwidth Migration Variable Bandwidth Migration Related Work The CALM (Cost Aware Live Migration) Algorithm Evaluation Study Conclusions
Introduction We consider pre-copy live migration model (but results hold for post-copy approach as well) We consider in-band migration We focus on network bandwidth as primary bottleneck (but the presented framework is general) We provide analytical study of our approach We validate our proposal using trace-driven simulations
The Cost of Live Migration 1/2 Clearly no service is available during downtime If migration is done in-band then some of the bandwidth used to serve clients is used now for the migration We define the cost to be the probability to violate the SLA at a given time It is a function of the available bandwidth for the service and we denote it by F(B s )
Percentege of requests not conforming to the SLA The Cost of Live Migration 2/2 Quality of Service Degredation 0.9 0.8 0.7 0.6 0.5 Apache Nutch search engine Workload using Poisson distribution SLA is 1 second 0.4 Cost per time unit 0.3 0.2 0.1 0 1100 1080 1060 1040 1020 1000 980 960 940 920 900 880 860 840 820 Available bandwidth in KB / sec
Fixed Bandwidth Migration 1/5 We start with a simple case The bandwidth for the migrations is predefined and fixed through the migration process Recall that memory is updated during the migration process, how much bandwidth should we use? More, faster but more degradation Less, better service while migrating but we might need to transfer pages again and again The optimal bandwidth depends on the cost function and other factors
Total Cost of the Migration Fixed Bandwidth Migration 2/5 160 Simulated Cost of a Fixed Bandwidth Migration 140 120 100 80 Migrating 10,000 pages. p = 0.001 B C = 100 Cost function 60 40 20 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 Bandwidth Used for the Migration
Fixed Bandwidth Migration 3/5 Formulation: Virtual machine with M pages Total available bandwidth (service + migration) is B B m is the bandwidth used for the migration B S is the bandwidth available for the service p is the probability for a page to be updated during a single time unit, we assume that it is uniform and independent (q = 1 p) A clean page is one that was copied and hasn t been updated since then N pages are transferred during the pre-copy phase (and the rest during the copy phase)
Fixed Bandwidth Migration 4/5 The expected cost of the migration process is given by: Where: And: Optimal bandwidth can be found by minimizing the cost function (analytically / numerically).
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 Total Cost of the Migration Fixed Bandwidth Migration 5/5 200 Simulated Cost of a Fixed Bandwidth Migration 180 160 140 120 100 80 Simulated Cost Calculated Cost 60 40 20 0 Bandwidth Used for the Migration
Trace Driven Simulation We found an optimal migration when the bandwidth is fixed and the dirtying probability is uniform What is the dirtying probability for a realworld application? We generated traces of dirtying patterns for several services and used those traces to simulate migration of a real-world services
1 214 427 640 853 1066 1279 1492 1705 1918 2131 2344 2557 2770 2983 3196 3409 3622 3835 4048 4261 4474 4687 4900 5113 5326 5539 5752 5965 6178 6391 6604 6817 7030 7243 7456 7669 7882 8095 8308 8521 8734 8947 9160 9373 9586 9799 Number Of Writes During The Trace Dirtying Probability of Real Application 8000 7000 6000 5000 4000 3000 2000 1000 0 Page Number
Number Of Writes During The Trace Dirtying Probability (top 100) 8000 7000 6000 5000 4000 3000 2000 1000 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 Page Number
1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 320 331 342 353 364 375 386 397 408 419 430 441 452 463 474 485 496 Number Of Writes During The Trace Dirtying Probability (top 500) 8000 7000 6000 5000 4000 3000 2000 1000 0 Page Number
Number Of Writes per second Dirtying Probability (Over Time) 12000 10000 8000 6000 4000 2000 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 Time (seconds)
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 Cost (Requests That Were Not Served Within The SLA Limitation) Migration's Cost With Real Dirtying Trace 9 8 7 6 5 4 Calculated Cost Simulated Cost 3 2 1 0 Bandwidth Used For The Migration Process
Variable Bandwidth Migration In reality we are not limited to a predefined, fixed, bandwidth. Bandwidth can be dynamically adjusted during the migration process. Intuitively, we should use a low bandwidth at the beginning of the process and increase it as we proceed, why?
Related Work 1/2 Clark et al. suggested the following algorithm: The first pre-copy round copies all the pages from the source host to the destination host using initial bandwidth defined by the system administrator. Each subsequent round copies only the dirty pages using a bandwidth equal to where is a fixed addition defined by the administrator. Continue until bandwidth calculation exceed a maximum limit defined by the administrator or when there are less than 256KB to transfer.
Related Work 2/2 Xen s migration algorithm uses similar principles, the stop conditions are defined as follows: Less than 50 pages were dirtied during the last precopy iteration. 29 pre-copy iterations have been carried out. More than 3 times the total amount of RAM allocated to the VM has been copied to the destination host. What are the problems with the above algorithms?
The CALM (Cost Aware Live Migration) Algorithm 1/2 The bandwidth for the migrations can change over time the algorithm determines the bandwidth to be used at each phase of the migration process. the end of the pre-copy phase. works in steps each step moving from i clean pages to i+1 clean pages. decides whether to continue or move to copy phase.
The CALM (Cost Aware Live Migration) Algorithm 2/2 During step bandwidth is fixed so we can use previous results, the cost of the ith step is given by: where Find best B i (with minimal cost) Move to copy phase when:
Evaluation Study 1/2 We compare the CALM algorithm against the one suggested by Clark et al by simulating a live migration of a real-world services. Our simulations show that the CALM algorithm outperform Clark s algorithm even when used to migrate a real-world services.
Evaluation Study 2/2 RAM Bandwidth Algorithm Cost Total Time (sec) Down Time (sec) 1GB 1GBits / sec Clark 0.04 105.2 0.0424 CALM 0.01 85.78 0.0078 Fixed 0.01 19.93 0.0035 512MB 512MBits / sec Clark 0.27 67.99 0.0039 CALM 0.01 79.03 0.0078 Fixed 0.02 142.55 0.0069 512MB 256MBits / sec Clark 48.71 67.22 0.6412 CALM 12.05 56.58 11.940 Fixed 12.42 31.90 12.232 256MB 256MBits / sec Clark 36.86 40.38 2.1160 CALM 4.05 56.37 0.3940 Fixed 4.52 35.60 4.2325
Conclusions We presented a novel model that accounts for the total cost of migration: pre-copy & copy phases Optimal migration strategy depends on various factors (available bandwidth, memory size, type of the service etc ). Cost-Aware migration algorithm is beneficial. CALM algorithm performs well also on real-world applications. The fixed algorithm performs well in certain cases. Future work is needed in order to better adjust the CALM algorithm to a real-world page dirtying pattern.
Thank You.
1 207 413 619 825 1031 1237 1443 1649 1855 2061 2267 2473 2679 2885 3091 3297 3503 3709 3915 4121 4327 4533 4739 4945 5151 5357 5563 5769 5975 6181 6387 6593 6799 7005 7211 Bandwidth Usage (MB / sec) / Memory Copied (MB) Bandwidth Usage on Different Scenarios 600 500 400 300 200 100 BW Usage (XEN 512) Clean Pages (XEN 512) BW Usage (CALM 512) Clean Pages (CALM 512) BW Usage (XEN 256) Clean Pages (XEN 256) BW Usage (CALM 256) Clean Pages (CALM 256) 0 Time (millisec)
Fixed Bandwidth Migration 4/7 We would like to calculate the expected number of clean pages (N 2 ) after t time units when at time = 0 the number of clean pages was N 1. This give us the following:
Fixed Bandwidth Migration 5/7 Using the formula above we can calculate the expected time (T) it takes until there are N 2 clean pages, when in time = 0 there were N 1 clean pages, we get:
Fixed Bandwidth Migration 6/7 Finally, the cost of the pre-copy phase is given by: And the total cost by: Optimal bandwidth can be found by minimizing the cost function (analytically / numerically).