基于 Hwameistor 的 Elasticsearch 迁移实践¶
由于 Kubernetes 自身的特性,有状态应用部署完成后是否可以迁移取决于底层 CSI 的能力。然而当集群出现资源不均等意外情况时,需要跨节点迁移相关的有状态应用。
本文以 Elasticsearch 为例,参考 Hwameistor 官方提供的迁移指南,演示使用 Hwameistor 时如何跨节点迁移数据服务中间件。
演示环境¶
从集群信息、ES 安装信息、PVC 三方面进行介绍演示环境:
[root@prod-master1 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
prod-master1 Ready control-plane 15h v1.25.4
prod-master2 Ready control-plane 15h v1.25.4
prod-master3 Ready control-plane 15h v1.25.4
prod-worker1 Ready <none> 15h v1.25.4
prod-worker2 Ready <none> 15h v1.25.4
prod-worker3 Ready <none> 15h v1.25.4
kubectl -n mcamel-system get pvc -l elasticsearch.k8s.elastic.co/statefulset-name=mcamel-common-es-cluster-masters-es-data
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
elasticsearch-data-mcamel-common-es-cluster-masters-es-data-0 Bound pvc-61776435-0df5-448f-abb9-4d06774ec0e8 35Gi RWO hwameistor-storage-lvm-hdd 15h
elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1 Bound pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c 35Gi RWO hwameistor-storage-lvm-hdd 15h
elasticsearch-data-mcamel-common-es-cluster-masters-es-data-2 Bound pvc-955bd221-3e83-4bb5-b842-c11584bced10 35Gi RWO hwameistor-storage-lvm-hdd 15h
演示目标¶
将 prod-worker3 节点上的 mcamel-common-es-cluster-masters-es-data-1 (以下简称 演示应用 / esdata-1 )有状态应用跨节点迁移到 prod-master3 节点。
准备工作¶
确定需要迁移的 PV¶
使用如下命令查找演示应用 esdata-1 对应的 PV 磁盘,明确需要迁移哪个 PV 。
-
查看演示应用绑定的 PVC
-
查看该 PVC 绑定的 PV
[root@prod-master1 ~]# kubectl -n mcamel-system get pvc elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1 Bound pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c 35Gi RWO hwameistor-storage-lvm-hdd 17h
-
确认该 PV 绑定的应用是否为需要迁移的应用,即此文中的演示应用 esdata-1
[root@prod-master1 ~]# kubectl -n mcamel-system get pv pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c 35Gi RWO Delete Bound mcamel-system/elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1 hwameistor-storage-lvm-hdd 17h ```
上述信息证明,需要迁移的 PV 为 pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c 。
停止运行待迁移应用¶
-
查看当前正在运行的应用
[root@prod-master1 ~]# kubectl -n mcamel-system get sts NAME READY AGE elastic-operator 2/2 20h mcamel-common-es-cluster-masters-es-data 3/3 20h mcamel-common-kpanda-mysql-cluster-mysql 2/2 20h mcamel-common-minio-cluster-pool-0 1/1 20h mcamel-common-mysql-cluster-mysql 2/2 20h mysql-operator 1/1 20h rfr-mcamel-common-redis-cluster 3/3 20h
-
停止运行 ES operator
-
停止运行 ES:
-
确认 ES 已经停止运行
[root@prod-master1 ~]# kubectl -n mcamel-system get sts NAME READY AGE elastic-operator 0/0 20h mcamel-common-es-cluster-masters-es-data 0/0 20h mcamel-common-kpanda-mysql-cluster-mysql 2/2 20h mcamel-common-minio-cluster-pool-0 1/1 20h mcamel-common-mysql-cluster-mysql 2/2 20h mysql-operator 1/1 20h rfr-mcamel-common-redis-cluster 3/3 20h
视频演示如下:
开始迁移¶
有关此过程的详细说明,可参考 Hwameistor 官方文档:迁移数据卷
-
建立迁移任务
[root@prod-master1 ~]# cat migrate.yaml apiVersion: hwameistor.io/v1alpha1 kind: LocalVolumeMigrate metadata: namespace: hwameistor name: migrate-es-pvc # 任务名称 spec: sourceNode: prod-worker3 # 来源 node,可以通过 `kubectl get ldn` 获取 targetNodesSuggested: - prod-master3 volumeName: pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c # 需要迁移的 pvc migrateAllVols: false
-
执行迁移命令
此时会在 hwameistor 命名空间创建一个 pod ,用于执行迁移动作。
-
查看迁移状态
[root@prod-master1 ~]# kubectl get localvolumemigrates.hwameistor.io migrate-es-pvc -o yaml apiVersion: hwameistor.io/v1alpha1 kind: LocalVolumeMigrate metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"hwameistor.io/v1alpha1","kind":"LocalVolumeMigrate","metadata":{"annotations":{},"name":"migrate-es-pvc"},"spec":{"migrateAllVols":false,"sourceNode":"prod-worker3","targetNodesSuggested":["prod-master3"],"volumeName":"pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c"}} creationTimestamp: "2023-04-30T12:24:17Z" generation: 1 name: migrate-es-pvc resourceVersion: "1141529" uid: db3c0df0-57b5-42ef-9ec7-d8e6de487767 spec: abort: false migrateAllVols: false sourceNode: prod-worker3 targetNodesSuggested: - prod-master3 volumeName: pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c status: message: 'waiting for the sync job to complete: migrate-es-pvc-datacopy-elasticsearch-data-mcamel' originalReplicaNumber: 1 state: SyncReplica targetNode: prod-master3
-
迁移完成后,查看迁移结果
恢复 common-es¶
-
启动 ES operator
-
启动 ES
[root@prod-master1 ~]# kubectl -n mcamel-system scale --replicas=3 sts mcamel-common-es-cluster-masters-es-data
相关问题¶
HwameiStor 使用 rclone 来迁移 PV,而 rclone 在迁移过程中可能会丢失权限(参考 rclone#1202 和 hwameistor#830)。如果权限丢失,ES 会启动失败并反复启动,陷入恶性循环。
遇到类似问题时可以通过下述步骤排查并解决故障。
确认问题¶
使用以下命令查看 Pod 日志:
``bash kubectl -n mcamel-system logs mcamel-common-es-cluster-masters-es-data-0 -c elasticsearch
如果日志中包含如下错误信息,则可以确认为权限丢失造成的问题。
```log
java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
解决故障¶
-
运行下命令修改 ES 的 CR
-
为 ES 的 Pod 添加一个 initcontainer ,内容如下:
- command: - sh - -c - chown -R elasticsearch:elasticsearch /usr/share/elasticsearch/data name: change-permission resources: {} securityContext: privileged: true
initcontainer 在 CR 中的位置如下:
spec: ... ... nodeSets: - config: node.store.allow_mmap: false count: 3 name: data podTemplate: metadata: {} spec: ... ... initContainers: - command: - sh - -c - sysctl -w vm.max_map_count=262144 name: sysctl resources: {} securityContext: privileged: true - command: - sh - -c - chown -R elasticsearch:elasticsearch /usr/share/elasticsearch/data name: change-permission resources: {} securityContext: privileged: true