직무스터디/QA 노하우

[k8s] Kubernetes 클러스터를 재설치 reset, init

피타챈 2024. 7. 11. 15:24

잘 사용해오던 쿠버네티스 환경이 서버가 재기동 이후로 kubectl 명령어가 안먹기 시작했다.

The connection to the server {ip}:6443 was refused - did you specify the right host or port?

세팅되어있는 상태의 k8s 환경이라 괜히 꼬일까봐 어딜 건들여야할지 대략 난감

init 했던 과정을 남겨본다. 전체적인 흐름은 다음과 같다! So  Easy ~

  • 마스터 노드: reset - init - kube config 설정 - 칼리코 적용
  • 워커 노드: reset - stop - join

쿠버 상태 확인

서버가 재기동될 때 쿠버네티스는 자동으로 재기동되는데, 어디서 connection refused 가 됬는지 확인하기 위해 kubelet 서비스 상태 확인한다.

# kubelet 서비스 상태 확인
sudo systemctl status kubelet

# 노드 상태 확인
kubectl get nodes

describe 해서 뭐라떠있는지 확인했었는데 기억안남. kubectl 재시작을 해보았다.

# kubectl 재시작
sudo systemctl restart kubelet

재시작을해도 뭔가 config 파일을 수정하라는 에러가 나왔던것같은데. .kube 에 있는 컨피그 파일을 초기화 하는 것이 낫겠다고 판단. 설치도 안해봤기때문에 공부할겸 초기화하면서 차근차근 볼 기회였다.

재설치 사전 준비

기존에 k8s 마스터노드와 워커노드를 사용하고 있었던 환경이므로 왠만한 명령어실행이 가능해야함

마스터 노드 kubeadm reset - 초기화 

Kubernetes 구성 요소 제거하고 클러스터 데이터 제거하는 과정이라고 보면 된다.

$ sudo kubeadm reset

reset명령어를 날리니, crio 설정파일에서 충돌이 났다.

master@worker-a:~# kubeadm reset
Found multiple CRI endpoints on the host. Please define which one do you wish to use by setting the 'criSocket' field in the kubeadm configuration file: unix:///var/run/containerd/containerd.sock, unix:///var/run/crio/crio.sock
To see the stack trace of this error execute with --v=5 or higher
  • /var/run/containerd/containerd.sock
  • /var/run/crio/crio.sock

얘네 둘이 충돌이 나는데, 울환경은 컨테이너 안쓰고 crio 사용하니까 containerd.sock 사용

삭제할까하다가 .. 쿠버잘모르는 나..무서우니까 백업한다. 지워도 뭐 상관없었던듯 하다.

sudo cp -r /var/run/containerd/containerd.sock /var/run/containerd/containerd.sock.bak

 

아래같이 나오면 초기화는 완료.

master@worker-a:~$ sudo kubeadm reset
W0711 10:57:28.803530  542527 preflight.go:56] [reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0711 10:57:29.975477  542527 removeetcdmember.go:106] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.

마스터 노드 kubeadm init - 초기화

새로운 Kubernetes 마스터 노드를 초기화하여 클러스터를 설정하고 시작해보자

sudo kubeadm init --apiserver-advertise-address {마스터노드IP} --cri-socket /var/run/crio/crio.sock
W0711 13:30:45.559854  647470 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/crio/crio.sock". Please update your configuration!
I0711 13:30:46.073064  647470 version.go:256] remote version is much newer: v1.30.2; falling back to: stable-1.28
[init] Using Kubernetes version: v1.28.11
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local worker-a] and IPs [10.96.0.1 192.168.15.123]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost worker-a] and IPs [192.168.15.123 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost worker-a] and IPs [192.168.15.123 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 3.502944 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node worker-a as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node worker-a as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: jivx2l.et7v3likd765no88
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join {ip}:6443 --token {토큰} \
	--discovery-token-ca-cert-hash {토큰}

마지막 출력되는 토큰으로 워커노드에서 조인할때 쓰니까 메모해두자

마스터노드 설정은 여기서 끝~!!!!! 이아니고 kubectl 명령 사용하기위해 다음 수행

master@worker-a:~$ sudo mkdir -p $HOME/.kube
master@worker-a:~$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
master@worker-a:~$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

칼리코 파드도 재기동도면서 내려간듯 싶다. 다시 적용해주자. 

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/calico.yaml

워커 노드 Join 하기

마스터노드에서 init 하고 마지막에 출력된 join 명령어를 sudo 를 사용해서 워커노드에서 실행하면 다음과같은 결과가 출력된다

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

그럼 이제 마스터노드에서 워커노드랑 마스터노드가 조회된다.

master@worker-a:~$ kubectl get nodes
NAME        STATUS   ROLES           AGE    VERSION
worker-a    Ready    control-plane   108m   v1.28.2
workeri9a   Ready    <none>          108m   v1.28.2

kube-system 관련된 파드도 다 정상러닝인지 확인

master@worker-a:~$ kubectl get pods -A
NAMESPACE      NAME                                     READY   STATUS    RESTARTS   AGE
kube-system    calico-kube-controllers-77bd7c5b-vsncb   1/1     Running   0          77m
kube-system    calico-node-45sj4                        1/1     Running   0          77m
kube-system    calico-node-4j56r                        1/1     Running   0          77m
kube-system    coredns-5dd5756b68-6l7p6                 1/1     Running   0          112m
kube-system    coredns-5dd5756b68-6qsp2                 1/1     Running   0          112m
kube-system    etcd-worker-a                            1/1     Running   4          112m
kube-system    kube-apiserver-worker-a                  1/1     Running   4          112m
kube-system    kube-controller-manager-worker-a         1/1     Running   5          112m
kube-system    kube-proxy-g685s                         1/1     Running   0          112m
kube-system    kube-proxy-l2dpv                         1/1     Running   0          112m
kube-system    kube-scheduler-worker-a                  1/1     Running   5          112m

거의 재설치급?ㅋㅋㅋ짱났다 일할거 많은대. 재기동 할때마다 이러려나... ? 

트러블슈팅을 제대로 해보진 못하였으나, 짐작으로는 재기동 시점이나 사용자가 kubectl 이 업데이트를 한 이후로 컨피그가 수정되면서 connection refused 가 발생할 수 있다고는 한다. 

재기동시점에서 업데이트가 안되도록 막아놓는 설정은 이미 해놨을 거고,, 수동업데이트는 막을 수 없으니 kubectl version 명령어로 ssh 접속할때마다 확인해봐야하나 싶다.