Kubernetes 滚动升级

共计 9493 个字符，预计需要花费 24 分钟才能阅读完成。

Kubernetes Rolling Upgrade

背景

Kubernetes 是一个很好的容器应用集群管理工具，尤其是采用 ReplicaSet 这种自动维护应用生命周期事件的对象后，将容器应用管理的技巧发挥得淋漓尽致。在容器应用管理的诸多特性中，有一个特性是最能体现 Kubernetes 强大的集群应用管理能力的，那就是滚动升级。

滚动升级的精髓在于升级过程中依然能够保持服务的连续性，使外界对于升级的过程是无感知的。整个过程中会有三个状态，全部旧实例，新旧实例皆有，全部新实例。旧实例个数逐渐减少，新实例个数逐渐增加，最终达到旧实例个数为 0，新实例个数达到理想的目标值。

kubernetes 滚动升级

Kubernetes 中采用 ReplicaSet（简称 RS）来管理 Pod 实例。如果当前集群中的 Pod 实例数少于目标值，RS 会拉起新的 Pod，反之，则根据策略删除多余的 Pod。Deployment 正是利用了这样的特性，通过控制两个 RS 里面的 Pod，从而实现升级。
滚动升级是一种平滑过渡式的升级，在升级过程中，服务仍然可用。这是 kubernetes 作为应用服务化管理的关键一步。服务无处不在，并且按需使用。这是云计算的初衷，对于 PaaS 平台来说，应用抽象成服务，遍布整个集群，为应用提供随时随地可用的服务是 PaaS 的终极使命。
1.ReplicaSet
关于 RS 的概念大家都很清楚了，我们来看看在 k8s 源码中的 RS。

type ReplicaSetController struct {
    kubeClient clientset.Interface
    podControl controller.PodControlInterface

    // internalPodInformer is used to hold a personal informer.  If we're using
    // a normal shared informer, then the informer will be started for us.  If
    // we have a personal informer, we must start it ourselves.   If you start
    // the controller using NewReplicationManager(passing SharedInformer), this
    // will be null
    internalPodInformer framework.SharedIndexInformer

    // A ReplicaSet is temporarily suspended after creating/deleting these many replicas.
    // It resumes normal action after observing the watch events for them.
    burstReplicas int
    // To allow injection of syncReplicaSet for testing.
    syncHandler func(rsKey string) error

    // A TTLCache of pod creates/deletes each rc expects to see.
    expectations *controller.UIDTrackingControllerExpectations

    // A store of ReplicaSets, populated by the rsController
    rsStore cache.StoreToReplicaSetLister
    // Watches changes to all ReplicaSets
    rsController *framework.Controller
    // A store of pods, populated by the podController
    podStore cache.StoreToPodLister
    // Watches changes to all pods
    podController framework.ControllerInterface
    // podStoreSynced returns true if the pod store has been synced at least once.
    // Added as a member to the struct to allow injection for testing.
    podStoreSynced func() bool

    lookupCache *controller.MatchingCache

    // Controllers that need to be synced
    queue *workqueue.Type

    // garbageCollectorEnabled denotes if the garbage collector is enabled. RC
    // manager behaves differently if GC is enabled.
    garbageCollectorEnabled bool
}

这个结构体位于 pkg/controllers/replicaset, 这里我们可以看出，RS 最主要的几个对象，一个是针对 Pod 的操作对象 -podControl. 看到这个名字就知道，这个对象是控制 RS 下面的 Pod 的生命周期的，我们看看这个 PodControl 所包含的方法。

// PodControlInterface is an interface that knows how to add or delete pods
// created as an interface to allow testing.
type PodControlInterface interface {// CreatePods creates new pods according to the spec.
    CreatePods(namespace string, template *api.PodTemplateSpec, object runtime.Object) error
    // CreatePodsOnNode creates a new pod accorting to the spec on the specified node.
    CreatePodsOnNode(nodeName, namespace string, template *api.PodTemplateSpec, object runtime.Object) error
    // CreatePodsWithControllerRef creates new pods according to the spec, and sets object as the pod's controller.
    CreatePodsWithControllerRef(namespace string, template *api.PodTemplateSpec, object runtime.Object, controllerRef *api.OwnerReference) error
    // DeletePod deletes the pod identified by podID.
    DeletePod(namespace string, podID string, object runtime.Object) error
    // PatchPod patches the pod.
    PatchPod(namespace, name string, data []byte) error
}

这里我们可以看到，RS 可以完全控制 Pod. 这里有两个 watch，rsController 和 podController，他们分别负责 watch ETCD 中 RS 和 Pod 的变化。这里一个重要的对象不得不提，那就是 syncHandler，这个是所有 Controller 都有的对象。每一个控制器通过 Watch 来监视 ETCD 中的变化，使用 sync 的方式来同步这些对象的状态，注意这个 Handler 只是一个委托，实际真正的 Handler 在创建控制器的时候指定。这种模式不仅仅适用于 RS，其他控制器亦如此。
下面的逻辑更加清晰地说明了 watch 的逻辑。

rsc.rsStore.Store, rsc.rsController = framework.NewInformer(&cache.ListWatch{ListFunc: func(options api.ListOptions) (runtime.Object, error) {return rsc.kubeClient.Extensions().ReplicaSets(api.NamespaceAll).List(options)
            },
            WatchFunc: func(options api.ListOptions) (watch.Interface, error) {return rsc.kubeClient.Extensions().ReplicaSets(api.NamespaceAll).Watch(options)
            },
        },
        &extensions.ReplicaSet{},
        // TODO: Can we have much longer period here?
        FullControllerResyncPeriod,
        framework.ResourceEventHandlerFuncs{AddFunc:    rsc.enqueueReplicaSet,
            UpdateFunc: rsc.updateRS,
            // This will enter the sync loop and no-op, because the replica set has been deleted from the store.
            // Note that deleting a replica set immediately after scaling it to 0 will not work. The recommended
            // way of achieving this is by performing a `stop` operation on the replica set.
            DeleteFunc: rsc.enqueueReplicaSet,
        },
    )

每次 Watch 到 ETCD 中的对象的变化，采取相应的措施，具体来说就是放入队列，更新或者取出队列。对于 Pod 来说，也有相应的处理。

podInformer.AddEventHandler(framework.ResourceEventHandlerFuncs{
        AddFunc: rsc.addPod,
        // This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like
        // overkill the most frequent pod update is status, and the associated ReplicaSet will only list from
        // local storage, so it should be ok.
        UpdateFunc: rsc.updatePod,
        DeleteFunc: rsc.deletePod,
    })

RS 基本的内容就这些，在 RS 的上层是 Deployment，这个对象也是一个控制器。

// DeploymentController is responsible for synchronizing Deployment objects stored
// in the system with actual running replica sets and pods.
type DeploymentController struct {client        clientset.Interface
    eventRecorder record.EventRecorder

    // To allow injection of syncDeployment for testing.
    syncHandler func(dKey string) error

    // A store of deployments, populated by the dController
    dStore cache.StoreToDeploymentLister
    // Watches changes to all deployments
    dController *framework.Controller
    // A store of ReplicaSets, populated by the rsController
    rsStore cache.StoreToReplicaSetLister
    // Watches changes to all ReplicaSets
    rsController *framework.Controller
    // A store of pods, populated by the podController
    podStore cache.StoreToPodLister
    // Watches changes to all pods
    podController *framework.Controller

    // dStoreSynced returns true if the Deployment store has been synced at least once.
    // Added as a member to the struct to allow injection for testing.
    dStoreSynced func() bool
    // rsStoreSynced returns true if the ReplicaSet store has been synced at least once.
    // Added as a member to the struct to allow injection for testing.
    rsStoreSynced func() bool
    // podStoreSynced returns true if the pod store has been synced at least once.
    // Added as a member to the struct to allow injection for testing.
    podStoreSynced func() bool

    // Deployments that need to be synced
    queue workqueue.RateLimitingInterface
}

对于 DeploymentController 来说，需要监听 Deployment,RS 和 Pod。从 Controller 的创建过程中可以看出来。

dc.dStore.Store, dc.dController = framework.NewInformer(&cache.ListWatch{ListFunc: func(options api.ListOptions) (runtime.Object, error) {return dc.client.Extensions().Deployments(api.NamespaceAll).List(options)
            },
            WatchFunc: func(options api.ListOptions) (watch.Interface, error) {return dc.client.Extensions().Deployments(api.NamespaceAll).Watch(options)
            },
        },
        &extensions.Deployment{},
        FullDeploymentResyncPeriod,
        framework.ResourceEventHandlerFuncs{AddFunc:    dc.addDeploymentNotification,
            UpdateFunc: dc.updateDeploymentNotification,
            // This will enter the sync loop and no-op, because the deployment has been deleted from the store.
            DeleteFunc: dc.deleteDeploymentNotification,
        },
    )

    dc.rsStore.Store, dc.rsController = framework.NewInformer(&cache.ListWatch{ListFunc: func(options api.ListOptions) (runtime.Object, error) {return dc.client.Extensions().ReplicaSets(api.NamespaceAll).List(options)
            },
            WatchFunc: func(options api.ListOptions) (watch.Interface, error) {return dc.client.Extensions().ReplicaSets(api.NamespaceAll).Watch(options)
            },
        },
        &extensions.ReplicaSet{},
        resyncPeriod(),
        framework.ResourceEventHandlerFuncs{AddFunc:    dc.addReplicaSet,
            UpdateFunc: dc.updateReplicaSet,
            DeleteFunc: dc.deleteReplicaSet,
        },
    )

    dc.podStore.Indexer, dc.podController = framework.NewIndexerInformer(&cache.ListWatch{ListFunc: func(options api.ListOptions) (runtime.Object, error) {return dc.client.Core().Pods(api.NamespaceAll).List(options)
            },
            WatchFunc: func(options api.ListOptions) (watch.Interface, error) {return dc.client.Core().Pods(api.NamespaceAll).Watch(options)
            },
        },
        &api.Pod{},
        resyncPeriod(),
        framework.ResourceEventHandlerFuncs{AddFunc:    dc.addPod,
            UpdateFunc: dc.updatePod,
            DeleteFunc: dc.deletePod,
        },
        cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc},
    )

    dc.syncHandler = dc.syncDeployment
    dc.dStoreSynced = dc.dController.HasSynced
    dc.rsStoreSynced = dc.rsController.HasSynced
    dc.podStoreSynced = dc.podController.HasSynced

这里最核心的就是 syncDeployment，因为这里面有 rollingUpdate 和 rollback 的实现。在这里如果 watch 到某个 Deployment 对象的 RollbackTo.Revision 部位 nil, 则执行 rollingbach。这个 Revision 是版本号，注意虽然是回滚，但 k8s 内部记录的版本号永远是增长的。
有人会好奇,rollback 是怎么做到的，其实原理很简单，k8s 记录了各个版本的 PodTemplate, 把旧的 PodTemplate 覆盖新的 Template 即可。
对于 K8S 来说，升级有两种方式，一种是重新构建，一种是滚动升级。

switch d.Spec.Strategy.Type {case extensions.RecreateDeploymentStrategyType:
        return dc.rolloutRecreate(d)
    case extensions.RollingUpdateDeploymentStrategyType:
        return dc.rolloutRolling(d)
}

这个 rolloutRolling 里面包含了所有的秘密，这里我们可以看到。

func (dc *DeploymentController) rolloutRolling(deployment *extensions.Deployment) error {newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(deployment, true)
    if err != nil {return err
    }
    allRSs := append(oldRSs, newRS)

    // Scale up, if we can.
    scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, deployment)
    if err != nil {return err
    }
    if scaledUp {// Update DeploymentStatus
        return dc.updateDeploymentStatus(allRSs, newRS, deployment)
    }

    // Scale down, if we can.
    scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, deployment)
    if err != nil {return err
    }
    if scaledDown {// Update DeploymentStatus
        return dc.updateDeploymentStatus(allRSs, newRS, deployment)
    }

    dc.cleanupDeployment(oldRSs, deployment)

    // Sync deployment status
    return dc.syncDeploymentStatus(allRSs, newRS, deployment)
}