自动迁移

根据宿主机的 CPU 使用率或者空闲内存,自动迁移宿主机上的虚拟机

使用场景

  • 当宿主机 CPU 使用率超过阈值或内存超过阈值后,自动迁移宿主机上部分虚拟机到指定范围的宿主机
  • 当宿主机 CPU 使用率超过阈值或内存超过阈值后,自动迁移宿主机上部分虚拟机到负载最空的宿主机

监控指标

每台宿主机都有对应的 cpu.usage_active 和 mem.available 监控指标,说明如下:

  • cpu.usage_active: CPU 总核心使用率,上限为 100% ,表示所有核心都处于忙碌状态
  • mem.available: 内存可用大小,单位为 Byte

实现原理

创建对应的宿主机监控报警指标,当宿主机发生报警的时候,监控服务根据指标的当前值和阈值只差,选择宿主机上对应的虚拟机进行迁移。

使用说明

CPU

下面以宿主机 CPU 超过阈值,迁移虚拟机到其他宿主举例:

  1. 创建 cpu.usage_active 的迁移规则,名为 test-cpu ,监控宿主机 test-66-onecloud02 上的指标,当 cpu.usage_active 大于 60% 后触发自动迁移,每隔 2 分钟检查一次
$ climc monitor-migrationalert-create \
    --period 2m \
    --source-host test-66-onecloud02 \
    test-cpu cpu.usage_active.gt 60
  1. 创建对应的虚拟机进行测试,假设宿主机 CPU 40 核,为了达到阈值触发迁移,虚拟机的 CPU 核数就需要是 24 (40 * 60%) 核,然后在虚拟机使用 stress-ng 压测工具把所有核心打到 100%
# 创建虚拟机
$ climc server-create --disk CentOS-7.6.1810-20190430.qcow2 \
    --net your-net \
    --mem-spec 1g \
    --ncpu 24 \
    --allow-delete \
    --auto-start \
    --prefer-host test-66-onecloud02 \
    cpu-test-vm

# 登录虚拟机,使用 stress-ng 压测 CPU
$ climc server-ssh cpu-test-vm
(cpu-test-vm)$ yum install -y stress-ng
(cpu-test-vm)$ stress-ng --cpu 24 --timeout 36000s
  1. 隔2分钟查看监控迁移记录
# 可以先登录 influxdb 查看当前宿主机的监控指标
$ kubectl exec -ti -n onecloud $(kubectl get pods -n onecloud | grep default-influxdb | awk '{print $1}') -- influx -host 127.0.0.1 -port 30086 -type influxql -ssl  -precision rfc3339 -unsafeSsl
Connected to https://127.0.0.1:30086 version 1.7.7
InfluxDB shell version: 1.7.7
> use telegraf

# 通过 climc host-list --search test-66-onecloud02 得到 host_id 为 6fc10297-eb20-4a96-86a8-4b65260d6016
# 下面查看该宿主过去 2m 的 cpu.usage_active 指标已经大于 60% 的阈值了
> select usage_active from cpu where host_id = '6fc10297-eb20-4a96-86a8-4b65260d6016' and time > now() - 2m  GROUP BY "host_id"
name: cpu
tags: host_id=6fc10297-eb20-4a96-86a8-4b65260d6016
time                 usage_active
----                 ------------
2022-06-28T03:55:00Z 62.90831581190119
2022-06-28T03:56:00Z 70.15669899594904

# 查看报警迁移记录
# 找到 id 为 a7a92f4a-fed1-49bb-880b-59eae5185acc
$ climc monitor-migrationalert-list --scope system
+--------------------------------------+----------+------------------+
|                  id                  |   name   |   metric_type    |
+--------------------------------------+----------+------------------+
| a7a92f4a-fed1-49bb-880b-59eae5185acc | test-cpu | cpu.usage_active |
+--------------------------------------+----------+------------------+

# 查看迁移记录事件
$ climc monitor-migrationalert-event a7a92f4a-fed1-49bb-880b-59eae5185acc --scope system
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   id   |          ops_time           |                obj_id                |    obj_type    | obj_name |     user     | tenant |      action      |                                                                                                                                                                         notes                                                                                                                                                                          |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 428124 | 2022-06-28T07:40:02.000000Z | a7a92f4a-fed1-49bb-880b-59eae5185acc | migrationalert | test-cpu | monitoradmin | system | find_result_fail | find result to migrate: not found target for guest &balancer.cpuCandidate{guestResource:(*balancer.guestResource)(0xc001963c20), usageActive:99.46995000000001, guestCPUCount:24, hostCPUCount:40}: [host:test-69-onecloud01:current(55.313408) + guest:cpu-test-vm:score(59.681970) >= threshold(60.000000), host:a15:current(62.305391) + guest:cpu-test-vm:score(59.681970) >= threshold(60.000000)] |
| 427991 | 2022-06-28T03:56:57.000000Z | a7a92f4a-fed1-49bb-880b-59eae5185acc | migrationalert | test-cpu | sysadmin     | system | create           | {"id":"a7a92f4a-fed1-49bb-880b-59eae5185acc","name":"test-cpu","res_name":"migrationalert"}                                                                                                                                                                                                                                                            |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# 发现了一条 find_result_fail 的记录,表示虽然发生了报警,但是没有找到对应的目标宿主机进行迁移
# 原因是集群中的另外两台宿主机 test-69-onecloud01 当前指标为 55.313408%,a15 为 59.681970%,如果把 cpu-test-vm 59.681970% 的 cpu 负载迁移到另外两台宿主机
# 又会导致其他两台宿主机超过阈值,所以失败


# 如果把集群节点的负载降低,或者加入新的宿主机,负载高的虚拟机预期就会迁移过去,下面是迁移成功的记录
# 假设我重新使用 climc monitor-migrationalert-create 创建了一条 a15 宿主机的迁移规则,大于 60 触发,id 为 afc9468c-2cd7-4be8-83c7-92d7535a53cf
$ climc monitor-migrationalert-event afc9468c-2cd7-4be8-83c7-92d7535a53cf --scope system
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|   id   |          ops_time           |                obj_id                |    obj_type    | obj_name |     user     | tenant |  action   |                                                                                                                                                        notes                                                                                                                                                         |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 428147 | 2022-06-28T07:56:01.000000Z | afc9468c-2cd7-4be8-83c7-92d7535a53cf | migrationalert | test-cpu | monitoradmin | system | migrating | {"guest":{"host":"a15","host_id":"733b10fa-bd33-4503-836d-2ccd225bf12f","id":"a3107d1f-c46e-43cf-8aa8-55743d1533b1","name":"aisenzhe","score":10.769393333333335,"vcpu_count":8,"vmem_size":8192},"target_host":{"id":"6fc10297-eb20-4a96-86a8-4b65260d6016","name":"test-66-onecloud02","score":22.65071999099409}} |
+--------+-----------------------------+--------------------------------------+----------------+----------+--------------+--------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# 上述信息表示把 a15 上的 aisenzhe 虚拟机(cpu.usage_active 10.76%) 迁移到 test-66-onecloud02(cpu.usage_active 22.65%) 目标宿主机上

# 查看虚拟机的状态发现正在迁移中
$ climc server-list --search aisenzhe
+--------------------------------------+----------+--------------+-----------+------------+-----------+-----------+-----------------------------+------------+---------+-----------+
|                  ID                  |   Name   | Billing_type |  Status   | vcpu_count | vmem_size | Secgrp_id |         Created_at          | Hypervisor | os_type | is_system |
+--------------------------------------+----------+--------------+-----------+------------+-----------+-----------+-----------------------------+------------+---------+-----------+
| a3107d1f-c46e-43cf-8aa8-55743d1533b1 | aisenzhe | postpaid     | migrating | 8          | 8192      | default   | 2022-01-06T08:34:45.000000Z | kvm        | Linux   | false     |
+--------------------------------------+----------+--------------+-----------+------------+-----------+-----------+-----------------------------+------------+---------+-----------+

# 热迁移会持续一段时间,具体时间视虚拟机内存和磁盘大小而定,等待迁移结束后,会记录迁移成功的日志

其他操作

# 自动调节集群宿主机 cpu 负载,即不指定 --source-host 参数
$ climc monitor-migrationalert-create --period 5m all-host-cpu cpu.usage_active.gt 80

# 指定目标宿主机,当宿主机 cpu.usage_active 大于 80 后,只能迁移到目标宿主机 host1 和 host2
$ climc monitor-migrationalert-create --period 5m --target-host host1 --target-host host2 target-host-cpu cpu.usage_active.gt 80

# 指定监控的源宿主机,只关心 src-host1 和 src-host2 的监控
$ climc monitor-migrationalert-create --period 5m --source-host src-host1 --source-host src-host2 src-host-cpu cpu.usage_active.gt 80

# 指定迁移的源虚拟机,当宿主机 cpu.usage_active 大于 80 时候,只能迁移 gst1 和 gst2 虚拟机
$ climc monitor-migrationalert-create --period 5m --source-guest gst1 --source-guest gst2 host-gst-cpu cpu.usage_active.gt 80

注意事项

该功能目前只是 alpha 版本不一定稳定,仅限测试使用。

另外为了防止迁移条件判断不准确,导致宿主机之前虚拟机相互迁移,最后出现雪崩效应。

目前同一个时刻,只会一条报警触发的迁移逻辑,会一次迁移一批机器。如果该时刻另外一个 migrationalert 报警规则触发,会放弃此次迁移,必须等待全局没有其他 migrationalert 触发的迁移时,才会开始自己的迁移逻辑。