体验TiDB V6.0.0 之Clinic

作者：边城元元

原文来源：https://tidb.net/blog/6b2cf9a8

体验TiDB V6.0.0 之Clinic

一、背景

TiDB的生态越来越完善，带来利好的同时，也增加了运维不可确定性，clinic的出现减轻了运维成本和可以快速准确定位的集群中的问题。基于好奇，对新技术的敬畏，记录Clinic之旅。

Clinic 目前支持tiup部署的V4.0以上的TiDB本地集群和TiDB Cloud。

二、Clinic 工作原理

使用click需要安装diag组件

Diag 首先需要获取集群拓扑信息，然后通过几种不同的数据采集方式进行诊断数据采集。

获取集群拓扑信息

从部署工具（tiup-cluster/tidb-operator) 获取集群拓扑信息。

数据采集方式1: scp 方式传输服务器文件

对于 TiUP 部署的集群，通过 scp 方式直接从目标组件节点采集日志文件、配置文件。

数据采集方式2: ssh 远程执行命令采集数据

对于 TiUP 部署的集群，Diag 可以通过 ssh 到目标组件系统，执行 insight 等命令获取系统信息，包括内核日志、内核参数、基础的系统和硬件信息等。

数据采集方式3: http 调用采集数据

调用 TiDB 组件的 http 接口，获取 TiDB、TiKV、PD 等组件的实时配置、实时性能采样信息。调用 Prometheus 的 http 接口，获取 alert 信息和 metrics监控数据。

数据采集方式4:SQL语句查询数据库参数

通过 SQL 语句，查询 TiDB 数据库的系统参数等信息，该方式需要用户在采集时额外提供访问 TiDB 数据库的用户名和密码。

对于使用 TiUP 部署的 TiDB 集群和 DM 集群，PingCAP Clinic 诊断服务（以下简称为 PingCAP Clinic）可以通过 Diag 诊断客户端（以下简称为 Diag）与 Clinic Server 云诊断平台（以下简称为 Clinic Server）实现远程定位集群问题和本地快速检查集群状态。

三、体验目标

1）体验clinic 在v6.0.0 离线安装

2）体验clinic在V5.3.1 在线安装

3）体验远程协助快速定位集群问题

3.1 安装2个集群（这里不再详细说明）

cluster111.yml 拓扑参考 https://tidb.net/blog/af8080f7

3.1.1 安装v5.3.1

# 在线安装 V5.3.1
curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh

source /root/.bash_profile
tiup update --self && tiup update cluster
tiup list tidb 
tiup cluster check ./cluster111.yml --user root -p
tiup cluster check ./cluster111.yml --user root -p --apply 
tiup cluster deploy cluster111 v5.3.1 ./cluster111.yml --user root -p
tiup cluster list
tiup cluster start cluster111

# 下面2项这里不做说明可自行操作
# numactl
# 关闭swap分区这里不在说明

3.1.2 离线安装v6.0.0

# tidb 6 离线安装 （安装速度快）

# 下载离线包 覆盖TiUP 会完成覆盖升级
tar xzvf tidb-community-server-${version}-linux-amd64.tar.gz
sh tidb-community-server-${version}-linux-amd64/local_install.sh
source /root/.bash_profile
tiup update cluster
tiup cluster check ./cluster111.yml --user root -p --apply 
tiup cluster deploy cluster111 v6.0.0 ./cluster111.yml --user root -p

3.2 安装clinic

3.2.1 安装diag

#在安装了 TiUP 的中控机上，一键安装 Diag
tiup install diag

3.2.2 登录clinic站点获取token

https://clinic.pingcap.com.cn/portal 1、使用社区账号登录 2、先设置组织 3、右下角获取token

# 设置上传采集数据的token
# 该 Token 只用于数据上传，访问数据时不需要使用 Token。
tiup diag config clinic.token ${token-value}

3.3 采集TiDB集群数据

# 使用说明
# 运行 Diag，采集诊断数据。
# 例如，如需采集从当前时间的 4 小时前到 2 小时前的诊断数据，可以运行以下命令：
tiup diag collect ${cluster-name} -f="-4h" -t="-2h"

# 运行 Diag 数据采集命令后，Diag 不会立即开始采集数据，而会在输出中提供预估数据量大小和数据存储路径，并询问你是否进行数据收集。如果确认要开始采集数据，请输入 Y。
# 采集完成后，Diag 会提示采集数据所在的文件夹路径。

3.3.1 采集TiDB集群cluster111

4 小时前到现在的数据

# 1、采集
tiup diag collect cluster111 -f="-4h"
tiup diag collect cluster111 -f="-4h" -y

3.3.2 上传采集数据

# 将采集到的数据上传到 Clinic Server。
# 2.1 在线上传
#上传数据（数据包文件夹）的大小不得超过 10 GB，否则会导致上传失败。
# tiup diag upload ${filepath}
tiup diag upload /usr/local0/webserver/tidb/diag-fSk85byRYW6

# 使用该方式进行上传时，你需要使用 Diag v0.7.0 及以上版本。

# 2.2 上传方式 2：打包后上传。
tiup install diag
tiup diag collect cluster111 -f="-4h" -y
tiup diag package ${filepath}
#打包时，Diag 会同时对数据进行压缩和加密。
# 会生成.diag文件

# 使用可以访问互联网的机器上传数据压缩包。

tiup diag upload ${filepath}

3.3.3 登录clinic服务验证采数据


#3/完成数据上传后，通过上传输出结果中的 Download URL 获取诊断数据的链接

2个集群的上报数如下

3.3.4 体验过程中遇到的问题

3.3.4.1 如果集群停止了 clinic将不可用

3.3.4.2 如果pd挂掉 clinic将不可用

提醒：1) clinc需要从pd获取集群拓扑。2)pd正常的情况下可以收集集群信息

4、本地快速检查集群状态只能检测--include ="config"

tiup diag collect ${cluster-name} --include="config" tiup diag collect cluster111 --include="config"

tiup diag check ${filepath}

3.3.4.3 PD正常如果tkv挂掉

1）收集信息成功

2）tiup diag check 失败

3.3.4.4 PD正常如果tidb挂掉

1）收集信息成功

2）tiup diag check 失败

3.3.4.5 PD正常，有一个节点异常将

无法在本地 tiup diag check ${filepath}

1）收集信息成功

2）tiup diag check 失败

3）可以upload

tiup diag package 
tiup diag upload ${filepath}

#如果修改token 需要删除原来的.diag文件，重新 package
tiup diag config clinic.token  ${token}

3.3.4.6 节点都正常启动的情况下

可以使用tiup diag check ${filepath}

3.4 采集 DM 集群的数据

# 将 ${version} 修改成实际需要的版本

tar xzvf  dm-v1.9.4-linux-amd64.tar.gz
chmod +x ./tiup-dm && mv ./tiup-dm /root/.tiup/bin/
tiup-dm list


# 单独的采集命令
tiup diag collectdm <dm-custername>
tiup diag package ${filepath}
tiup upload ${filepath}.diag

[root@bogon vagrant]# tiup diag collectdm dm-cluster111

[root@bogon vagrant]# tiup diag collectdm dm-cluster111
tiup is checking updates for component diag ...
Starting component `diag`: /root/.tiup/components/diag/v0.7.0/diag /root/.tiup/components/diag/v0.7.0/diag collectdm dm-cluster111
Detecting metadata of the cluster...

Detecting alert lists from Prometheus node...

Detecting metrics from Prometheus node...

No Prometheus node found in topology, skip.
Detecting basic system information of servers...

Detecting logs of components...

+ Download necessary tools
  - Downloading collecting tools for linux/amd64 ... Done
+ Collect host information
  - Scraping log files on 10.0.2.15:22 ... ⠧ CopyComponent: component=diag, version=, remote=10.0.2.15:/tmp/tiup os=linux, arch=amd64
+ Collect host information
  - Scraping log files on 10.0.2.15:22 ... ⠹ Shell: host=10.0.2.15, sudo=false, command=`/tmp/tiup/bin/scraper --log '/home/tidb/deploy/dm-master-8261/log/*,/hom...
+ Collect host information
  - Scraping log files on 10.0.2.15:22 ... Done
Detecting config files of components...

+ Download necessary tools
  - Downloading collecting tools for linux/amd64 ... Done
+ Collect host information
  - Scraping log files on 10.0.2.15:22 ... ⠋ CopyComponent: component=diag, version=, remote=10.0.2.15:/tmp/tiup os=linux, arch=amd64
+ Collect host information
  - Scraping log files on 10.0.2.15:22 ... Done
Detecting dm audit logs of components...

+ Collect TiUP dm audit log information
  - Scraping TiUP dm audit log ... Done
Time range:
  2022-05-02T07:12:52Z - 2022-05-02T09:12:52Z (Local)
  2022-05-02T07:12:52Z - 2022-05-02T09:12:52Z (UTC)
  (total 7200 seconds)

Estimated size of data to collect:
Host       Size       Target
----       ----       ------
10.0.2.15  392.19 kB  /home/tidb/deploy/dm-worker-8262/log/dm-worker.log
10.0.2.15  106.28 kB  /home/tidb/deploy/dm-worker-8262/log/dm-worker_stderr.log
10.0.2.15  11.08 kB   /home/tidb/deploy/dm-worker-8262/log/dm-worker_stdout.log
10.0.2.15  383.89 kB  /home/tidb/deploy/dm-master-8261/log/dm-master.log
10.0.2.15  1.80 kB    /home/tidb/deploy/dm-master-8261/log/dm-master_stderr.log
10.0.2.15  330 B      /home/tidb/deploy/dm-worker-8262/conf/dm-worker.toml
10.0.2.15  345 B      /home/tidb/deploy/dm-master-8261/conf/dm-master.toml
localhost  2.30 kB    1 TiUP dm audit logs
Total      898.20 kB  (inaccurate)
These data will be stored in /home/vagrant/diag-fSwQn7ZDb6f
Do you want to continue? [y/N]: (default=N) y
Collecting metadata of the cluster...

Error collecting metadata of the cluster: no endpoint available, the data might be incomplete.
Collecting alert lists from Prometheus node...

No monitoring node (prometheus) found in topology, skip.
Collecting metrics from Prometheus node...

No Prometheus node found in topology, skip.
Collecting basic system information of servers...

+ Download necessary tools
  - Downloading check tools for linux/amd64 ... Done
+ Collect host information
+ Collect host information
  - Getting system info of 10.0.2.15:22 ... Done

+ Collect system information
  - Collecting system info of node 10.0.2.15 ... Done
+ Cleanup temp files
  - Cleanup temp files on 10.0.2.15:22 ... Done
  - Cleanup temp files on 10.0.2.15:22 ... Done
Collecting logs of components...

+ Scrap files on nodes
  - Downloading log files from node 10.0.2.15 ... Done
+ Cleanup temp files
  - Cleanup temp files on 10.0.2.15:22 ... Done
Collecting config files of components...

+ Scrap files on nodes
  - Downloading config files from node 10.0.2.15 ... Done
+ Cleanup temp files
  - Cleanup temp files on 10.0.2.15:22 ... Done


+ Query realtime configs
  - Querying configs for tikv 10.0.2.15:8261 ... Error
  - Querying configs for tikv 10.0.2.15:8262 ... Error
Error collecting config files of components: Get "http:?full=true": http: no Host in request URL, the data might be incomplete.
Collecting dm audit logs of components...

+ Scrap TiUP audit logs
  - copy TiUP dm audit log files ... Done
Some errors occurred during the process, please check if data needed are complete:
metadata of the cluster:        no endpoint available

config files of components:     Get "http:?full=true": http: no Host in request URL

Collected data are stored in /home/vagrant/diag-fSwQn7ZDb6f

3.5 采集tiflash数据

clinic对tiflash的信息收集集成在tidb集群里通过tiup diag collect 即可收集到

# 扩容tiflash
tiup cluster scale-out cluster111 ./scale-out-${nodename}.yml -uroot -p 
# 本地测试内存要大于4G

## 按库构建 TiFlash 副本
ALTER DATABASE db_name SET TIFLASH REPLICA count;

采集TiCDC

clinic对ticdc的信息收集集成在tidb集群里通过tiup diag collect 即可收集到

# 扩容ticdc
tiup cluster scale-out cluster111 ./scale-out-${nodename}.yml -uroot -p 

# 注意使用cdc server 增加的ticdc节点将无法使用clinic收集到信息

四、信息安全

1、clinic采集的诊断数据类型包括（配置、拓扑，日志），详情见 https://docs.pingcap.com/zh/tidb/v6.0/clinic-data-instruction-for-tiup；

2、通过 PingCAP Clinic 在使用 TiUP 部署的集群中采集的数据仅用于诊断和分析集群问题。

3、clinic 上传数据采用认证或加密上传到Clinic server，Clinic Server 是部署在云端的云服务，位于 PingCAP 内网（中国境内），只有经授权的内部技术人员可以访问该数据；

五、总结

1、clinic 简化了日志收集和协助分析，非常感谢pingcap 带来这样工具！

2、对clinic的期许

clinic 如果pd挂掉的情况下，如何收集信息上传，即对异常集群的收集和分析

3、发稿后TiDB的大佬告知可以通过 tiup diag collect -R=（组件）来收集指定组件的信息，这个功能太棒了！

再次谢谢PingCap，感谢TiDB社区！

六、参考

https://asktug.com/t/topic/272957#【SOP 系列 22】TiDB 集群诊断信息收集 Clinic 使用指南&资料大全 https://docs.pingcap.com/zh/tidb/v6.0/clinic-data-instruction-for-tiup#Clinic 数据采集说明 https://docs.pingcap.com/zh/tidb/v6.0/quick-start-with-clinic#快速上手指南 https://asktug.com/t/topic/664214#使用指南

体验TiDB V6.0.0 之Clinic