- 首先安装nvidia驱动,nvidia-smi有输出即安装成功
- 安装docker,版本最新就行,当前装的是1.13.1
- 安装nvidia-docker,https://github.com/NVIDIA/nvidia-docker
- 安装kubernetes,当前安装的是1.13.2 1.
docker pull jicki/kube-proxy:v1.13.2
docker pull jicki/kube-controller-manager:v1.13.2
docker pull jicki/kube-scheduler:v1.13.2
docker pull jicki/kube-apiserver:v1.13.2
docker pull jicki/coredns:1.2.6
docker pull jicki/cluster-proportional-autoscaler-amd64:1.3.0
docker pull jicki/kubernetes-dashboard-amd64:v1.10.0
docker pull jicki/etcd:3.2.24
docker pull jicki/node:v3.1.3
docker pull jicki/ctl:v3.1.3
docker pull jicki/kube-controllers:v3.1.3
docker pull jicki/cni:v3.1.3
docker pull jicki/pause:3.1
docker pull jicki/pause-amd64:3.1
docker pull quay.io/coreos/flannel:v0.10.0-arm
docker pull quay.io/coreos/flannel:v0.10.0-ppc64le
docker pull quay.io/coreos/flannel:v0.10.0-s390x
docker tag jicki/kube-proxy:v1.13.2 k8s.gcr.io/kube-proxy:v1.13.2
docker tag jicki/kube-controller-manager:v1.13.2 k8s.gcr.io/kube-controller-manager:v1.13.2
docker tag jicki/kube-scheduler:v1.13.2 k8s.gcr.io/kube-scheduler:v1.13.2
docker tag jicki/kube-apiserver:v1.13.2 k8s.gcr.io/kube-apiserver:v1.13.2
docker tag jicki/coredns:1.2.6 k8s.gcr.io/coredns:1.2.6
docker tag jicki/cluster-proportional-autoscaler-amd64:1.3.0 k8s.gcr.io/cluster-proportional-autoscaler-amd64:1.3.0
docker tag jicki/kubernetes-dashboard-amd64:v1.10.0 k8s.gcr.io/kubernetes-dashboard-amd64:v1.10.0
docker tag jicki/etcd:3.2.24 k8s.gcr.io/etcd:3.2.24
docker tag jicki/node:v3.1.3 k8s.gcr.io/node:v3.1.3
docker tag jicki/ctl:v3.1.3 k8s.gcr.io/ctl:v3.1.3
docker tag jicki/kube-controllers:v3.1.3 k8s.gcr.io/kube-controllers:v3.1.3
docker tag jicki/cni:v3.1.3 k8s.gcr.io/cni:v3.1.3
docker tag jicki/pause:3.1 k8s.gcr.io/pause:3.1
docker tag jicki/pause-amd64:3.1 k8s.gcr.io/pause-amd64:3.1
docker rmi jicki/kube-proxy:v1.13.2
docker rmi jicki/kube-controller-manager:v1.13.2
docker rmi jicki/kube-scheduler:v1.13.2
docker rmi jicki/kube-apiserver:v1.13.2
docker rmi jicki/coredns:1.2.6
docker rmi jicki/cluster-proportional-autoscaler-amd64:1.3.0
docker rmi jicki/kubernetes-dashboard-amd64:v1.10.0
docker rmi jicki/etcd:3.2.24
docker rmi jicki/node:v3.1.3
docker rmi jicki/ctl:v3.1.3
docker rmi jicki/kube-controllers:v3.1.3
docker rmi jicki/cni:v3.1.3
docker rmi jicki/pause:3.1
docker rmi jicki/pause-amd64:3.1
- 搭建集群
- 开启kubectl服务:
systemctl enable kubelet.service
kubeadm init –kubernetes-version=v1.13.2 –pod-network-cidr=10.244.0.0/16 –apiserver-advertise-address=<your_ip>
- apply flannel:
- master加入调度:
kubectl taint nodes node1 node-role.kubernetes.io/master-
- 安装NVIDIA device plugin,插件以daemonset部署,通过label筛选出有GPU的集群:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml
- 测试:
apiVersion: v1
kind: Pod
metadata:
name: test-gpu
spec: volumes:
- hostPath: path: /usr/lib64/nvidia
name: lib containers:
- env:
- name: TEST value: "GPU"
imagePullPolicy: Always
name: gpu-container-1
image: tensorflow/tensorflow:latest-gpu
resources:
limits: nvidia.com/gpu: 1
volumeMounts:
- mountPath: /usr/local/nvidia/lib64 name: lib
- 注意: 教程中说要设置docker json加上default runtime的,都不需要,最新版本的docker已经有了。
- 很多教程中都说要设置kubectl的参数Accelerators=true。其实不需要,因为1.13.2已经莫任务true了。设置了之后启动的时候报错unrecognized feature gate: Accelerators。因为这个标签已经在1.11之后已经废弃了。
- 问题:
- 跑任务的时候报错CUDA_ERROR_NO_DEVICE,网上查了之后有两种原因:一种是CUDA_VISIBLE_DEVICES设置的问题,一种是cuda驱动没装好。后面发现是因为程序内设置了CUDA_VISIBLE_DEVICES=1,CUDA_VISIBLE_DEVICES的意思是是编号为几的GPU对程序可见。k8s调度的时候会随机选择/dev/下面的任意一张设备挂载在镜像内,如果挂载的不是CUDA_VISIBLE_DEVICES指定的那个就会失败。通过修改CUDA_VISIBLE_DEVICES为0,1,2,3就可以运行了:

- 跑job的时候希望能够使用job的自动清理功能:spec.ttlSecondsAfterFinished,但是跑的时候发现并没有生效,kubectl get job xxx -o json的时候也发现没有设置上去。后面发现是需要开启TTLAfterFinished这个feature-gates:
- kubectl需要加feature-gate:
- vim /etc/sysconfig/kubelet
- KUBELET_EXTRA_ARGS=–feature-gates=TTLAfterFinished=true
- systemctl daemon-reload
- systemctl restart kubelet.service
- apiserver、controller-manager、scheduler都需要加–feature-gates=TTLAfterFinished=true:(kubeadm 将会为 API server、controller manager 和 scheduler 生成 Kubernetes 的静态 Pod manifest 文件,并将这些文件放入 /etc/kubernetes/manifests 中)
- cd /etc/kubernetes/manifests
- 编辑kube-apiserver.yaml、kube-controller-manager.yaml、kube-scheduler.yaml添加flag
- 重启:kubectl get po -nkube-system delete掉相应的pod*
- 参考: