Setting ConfigMap parameters for large resource backup

Portworx Documentation has moved to https://docs.portworx.com

For Portworx Backup version 2.5 and later, refer to the documentation at https://docs.portworx.com/portworx-backup-on-prem/.

The cluster configuration with large number of Kubernetes resources can be spread across a broad spectrum of resource and system configurations. To make the solution viable to fit a wide range of configurations, users can alter the ConfigMap parameters.

Add the parameters specified in the table below in stork-controller-config ConfigMap in kube-system namespace and alter the values as required to suit your configuration:

Parameter	Default Value	Usage
`Large-resource-size-limit`	1 MB	`large-resource-size-limit="819200"`, this number sets the size limit to 800 KB.
`Resource-count-limit`	500	`resource-count-limit="200"`
`Restore-volume-backup-count`	25	`Restore-volume-backup-count="22"`
`Restore-volume-sleep-interval`	20 s	`restore-volume-sleep-interval="1m"` or `restore-volume-sleep-interval="53s"`

The behavior of these parameters is explained below:

Large-resource-size-limit: In a cluster, if the etcd‘s message size is configured lesser than the default value of 1.5 MB, then you should alter this parameter’s value to adapt to its cluster-wide settings. Users can specify an appropriate value (in bytes) to update the value of this parameter.

Resource-count-limit: If the number of resources overload the Kubernetes API server, then you may see the following error in stork log and eventually the backup operation can time out:

time="2023-04-22T04:22:49Z" level=debug msg="Monitoring storage nodes"
time="2023-04-22T04:23:55Z" level=warning msg="gatherResourceInChunks: failed to list resources"
time="2023-04-22T04:23:55Z" level=error msg="Error getting resources: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=<resource-version>
time="2023-04-22T04:23:55Z" level=error msg="Error backing up resources: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=<resource-version>
time="2023-04-22T04:23:55Z" level=error msg="Error backing up volumes: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=resource-version>

To troubleshoot this scenario, you can change the default value of 500 resource queries at a time to a lesser number, say 200 or 300.

Restore-volume-backup-count: This configuration parameter defines the number of volumes that will be restored in a single batch. Whenever the restore process fails with device busy error, then one of the probable errors can be higher batch count of PVCs supplied for the restore process. Hence, the backend storage system fails with device busy error. Here is the sample error message displayed in the user interface window for this scenario:
```
Restore failed for volume: cloudsnap Restore id:<restore_id> for <backup-name> did not succeed: [createRestoreDestinationVol, Failed to create restore vol err:Volume (Name: <pvc-name>)] create failed error: Volume is busy on Node-not-assigned, processingNode <node-name>]
```
Alter the default value of this parameter to a value below 25 as a troubleshooting measure.
Restore-volume-sleep-interval: This parameter helps you to increase the time interval between two batches of volumes that will be restored. You can increase the default value to increase the interval between two batches of restore.