Alluxio Enterprise AI on K8s FIO 测试教程

? Alluxio Enterprise AI on K8s测试教程 ? 链接为Alluxio Enterprise AI on K8s FIO测试视频教程。 fio 是业内常用的磁盘与文件系统性能测试工具，下面内容将通过文字方式介绍Alluxio on k8s 进行 fio 测试的教程。

1. 测试环境

虚拟机规格：ecs.g3i.16xlarge，包含 64 vCPU、256GB 内存、140GB 磁盘（极速型 SSD FlexPL）。带宽等详细数据见实例规格。 Alluxio 版本：3.2-5.2.1 Alluxio Operator 版本：1.3.0

2. 测试环境准备

确保已经在云端kubernetes集群上搭建了Alluxio集群，集群中已启动以下pod，集群部署和启动方式详见《Alluxio on K8s部署教程》

1 个 Coordinator pod

2 个 Worker pod

1 个 FUSE pod

在业务pod启动时被自动拉起，和业务pod被分配在同一个node

1 个业务pod

启动方式：见 FUSE-based POSIX API，下文也有启动yaml文件示例。启动视频教程见《Alluxio on K8s 部署教程》视频11分30秒处

2.1 集群配置

下面是alluxio集群的yaml文件配置内容。

注意1：etcd需要配置storageClass字段。不同云厂商的容器服务提供不同的storage class类型。如何配置，详见《Alluxio on K8s 部署教程》和Alluxio on K8s FAQ。如果不清楚当前云厂商容器服务的storage class类型，请执行kubectl get sc查看。如果您只是进行部署验证，同时当前没有方便的 storage class 供集群使用，您可以关闭 etcd 的 persistence 配置，如下。注意该配置无法适用于生产，仅供验证测试使用。

etcd:
  persistence:
    enabled: false

注意2：集群的默认配置会部署1个 coordinator、2 个 worker、1 套 3 节点 etcd，同时启动 pvc 的 pod 过程中会自动创建 fuse 相关 pod，请谨慎配置这些相关 pod 的 request 资源，以免相应 pod 无法调度成功。

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio
spec:
  image: k8s-alluxio-cn-beijing.cr.volces.com/alluxio-test/alluxio-enterprise
  imageTag: AI-3.2-5.2.1
  user: 0
  group: 0

  worker:
    count: 2
    resources:
      limits:
        cpu: "16"
        memory: "32Gi"
      requests:
        cpu: "0"
        memory: "512Mi"
    jvmOptions:
      - "-Xmx16g"
      - "-Xms16g"
      - "-XX:MaxDirectMemorySize=12g"

  fuse:
    type: csi
    resources:
      requests:
        cpu: "0"
        memory: "2Gi"
      limits:
        cpu: "32"
        memory: "16Gi"
    jvmOptions:
      - "-Xms24g"
      - "-Xmx24g"
      - "-XX:MaxDirectMemorySize=16g"

  etcd:
    enabled: true
    replicaCount: 1
    persistence:
      storageClass: ebs-ssd
      size: 30Gi


    image:
      registry: k8s-alluxio-cn-beijing.cr.volces.com
      repository: alluxio-test/etcd
      tag: 3.5.9-debian-11-r24
    volumePermissions:
      image:
        registry: k8s-alluxio-cn-beijing.cr.volces.com
        repository: alluxio-test/os-shell
        tag: 11-debian-11-r2

  alluxio-monitor:
    enabled: true
    prometheus:
      imageInfo:
        image: k8s-alluxio-cn-beijing.cr.volces.com/alluxio-test/prometheus
        imageTag: v2.52.0
    grafana:
      imageInfo:
        image: k8s-alluxio-cn-beijing.cr.volces.com/alluxio-test/grafana
        imageTag: 11.1.0-ubuntu
  pagestore:
    quota: 10Gi

2.2 业务pod环境配置

下面是业务pod的yaml文件配置内容。此处yaml文件的image字段可以任意指定一个镜像。如果是国内用户，确保指定的镜像可以被集群拉取到即可。

apiVersion: v1
kind: Pod
metadata:
  name: fuse-test-0
  labels:
    app: alluxio
spec:
  containers:
    - image: k8s-alluxio-cn-beijing.cr.volces.com/alluxio-test/grafana:11.1.0-ubuntu
      imagePullPolicy: IfNotPresent
      name: fuse-test
      command: ["/bin/sh", "-c"]
      args:
        - sleep infinity
      volumeMounts:
        - mountPath: /data
          name: alluxio-pvc
          mountPropagation: HostToContainer
      securityContext:
        runAsUser: 0
        runAsGroup: 0
  volumes:
    - name: alluxio-pvc
      persistentVolumeClaim:
        claimName: alluxio-alluxio-csi-fuse-pvc
    
  nodeSelector:
    kubernetes.io/hostname: 172.31.16.6

如果fuse pod和worker pod被分配到了同一个node，那么数据并不会通过网络传输，会导致测试结果不准确。为了保证fuse pod与worker pod分配在不同的node，请使用最后两行来指定一个node，在其上分配fuse pod和业务pod。此处填入的值为kubectl get node看到的node name。此处为将该pod分配到名为172.31.16.6的node。

在业务pod上进行 fio 测试之前，需要进行以下配置：

更新并安装依赖：

apt-get update && apt install -y libaio-dev fio openssh-server

启动 SSH 服务：

service ssh start

配置免密登录，使业务pod可以免密登录到宿主机，便于清理 Kernel 缓存。

3. 测试流程

使用 fio 对 Alluxio 文件系统进行读操作的性能测试，具体步骤如下：

3.1 准备数据

首先，登陆进任意一个woker pod，使用 Alluxio 的 job 命令将测试数据加载到 Worker 节点上。例如，此处的测试数据为tos://tos-k8s-alluxio-test/5G,其为通过dd命令生成并上传的一个5GB大小的文件：

alluxio job load --path tos://tos-k8s-alluxio-test/5G --submit
alluxio job load --path tos://tos-k8s-alluxio-test/5G --progress

使用--submit提交数据load命令，使用--progress查看数据load进度。当--progress返回Job State: SUCCEEDED时，说明测试数据已经load完成，接下来通过alluxio-fuse的读操作都是热读。

3.2（重要）在每次测试前清理 Kernel Cache

在执行每次测试前，都需要清除 Kernel Cache，避免Linux kernel cache对测试结果的干扰。使用以下命令在宿主机上清除Kernel Cache：

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

3.3 执行顺序热读测试

3.3.1 `-bs=4K` 顺序热读

使用以下命令测试 4K 顺序热读性能：

fio -iodepth=1 -rw=read -ioengine=libaio -bs=4K -numjobs=1 -group_reporting -size=5G -filename=/data/tos/5G -name=read_test --readonly -direct=1 --invalidate=1

3.3.2 `-bs=256K` 顺序热读

清除 Kernel Cache后，使用以下命令测试 256K 顺序热读性能：

fio -iodepth=1 -rw=read -ioengine=libaio -bs=256K -numjobs=1 -group_reporting -size=5G -filename=/data/tos/5G -name=read_test --readonly -direct=1 --invalidate=1

3.4 执行随机热读测试

3.4.1 `-bs=4K` 随机热读

再次清理 Kernel Cache后，执行 4K 随机热读测试：

fio -iodepth=1 -rw=randread -ioengine=libaio -bs=4K -numjobs=1 -group_reporting -size=5G -filename=/data/tos/5G -name=read_test --readonly -direct=1 --invalidate=1

3.4.2 `-bs=256K` 随机热读

清理 Kernel Cache后，执行 256K 随机热读测试：

fio -iodepth=1 -rw=randread -ioengine=libaio -bs=256K -numjobs=1 -group_reporting -size=5G -filename=/data/tos/5G -name=read_test --readonly -direct=1 --invalidate=1

通过以上步骤，能够评估 Alluxio 文件系统在不同读写场景下的性能表现，并得到相应的数据支持。

视频中的实际测试结果显示，当batch size为256k，顺序热读场景下，fio 单线程读吞吐可达2924MB/s。增大线程数（numjobs）到32，64，可以得到更高的fio测试性能。关于更多测试结果，请点击官网性能测试。

总结

### 文章总结：Alluxio Enterprise AI on K8s FIO测试教程
#### 测试教程概述
本教程通过视频和文字详细介绍了如何在Kubernetes环境下，使用`fio`工具对Alluxio Enterprise AI进行性能测试。`fio`是业界常用的磁盘与文件系统性能测试工具，适用于评估Alluxio在不同读写场景下的性能表现。
#### 测试环境
- **虚拟机规格**：ecs.g3i.16xlarge，包含64 vCPU、256GB内存、140GB极速型SSD FlexPL磁盘。
- **Alluxio版本**：3.2-5.2.1
- **Alluxio Operator版本**：1.3.0
#### 测试环境准备
1. **集群搭建**：确保已在云端Kubernetes集群上部署并启动Alluxio集群，包括1个Coordinator pod、2个Worker pod、1个FUSE pod和1个业务pod。
2. **集群配置**：
- 配置Alluxio集群的yaml文件，注意etcd的persistence配置（可根据云厂商提供的storage class类型进行调整）。
- 配置Worker、FUSE等pod的资源请求和限制，确保集群能够成功调度。
3. **业务pod环境配置**：
- 准备业务pod的yaml文件，确保镜像可被集群拉取。
- 使用nodeSelector确保fuse pod和业务pod分配到不同的node上，以避免网络传输影响测试结果。
- 在业务pod中更新并安装依赖，启动SSH服务，配置免密登录。
#### 测试流程
1. **准备数据**：
- 登陆worker pod，使用Alluxio的`job`命令将测试数据加载到Worker节点上。
- 使用`alluxio job load`命令提交并监控数据加载进度。
2. **清理Kernel Cache**：
- 在每次测试前，使用`echo 3 > /proc/sys/vm/drop_caches`命令清除宿主机上的Kernel Cache，避免测试结果受缓存影响。
3. **执行顺序热读测试**：
- 使用`fio`命令分别测试4K和256K的顺序热读性能。
- 调整测试参数（如iodepth、rw、ioengine等）以适应不同的测试场景。
4. **执行随机热读测试**：
- 同样使用`fio`命令，分别测试4K和256K的随机热读性能。
- 每次测试前都需清理Kernel Cache，以确保测试结果的准确性。
#### 测试结果与结论
- 通过上述测试步骤，可以评估Alluxio文件系统在不同读写场景下的性能表现。
- 视频中的实际测试结果显示，在顺序热读场景下，当batch size为256k时，fio单线程读吞吐可达2924MB/s。
- 增大线程数（numjobs）可以进一步提高fio测试性能。
#### 后续资源
- 关于更多测试结果和详细信息，可访问官网性能测试页面获取。
- 视频教程详见《Alluxio on K8s FIO测试视频教程》。