blob: 7c622ea0fd1a3a6bf94d0f77cb1a9986e125e4a9 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
Checkpointing is a technique to reduce the impact of machine failure.
When using Checkpointing, tasks make periodical snapshots of their state.
If a task fails, it can be restarted from the last snapshot instead of starting from the beginning.
A user can define a checkpoint model using the following parameters:
| Variable | Type | Required? | Default | Description |
|---------------------------|--------|-----------|---------|----------------------------------------------------------------------------------------------------------------------|
| checkpointInterval | Int64 | no | 3600000 | The time between checkpoints in ms |
| checkpointDuration | Int64 | no | 300000 | The time to create a snapshot in ms |
| checkpointIntervalScaling | Double | no | 1.0 | The scaling of the checkpointInterval after each successful checkpoint. The default of 1.0 means no scaling happens. |
### Example
```json
{
"checkpointInterval": 3600000,
"checkpointDuration": 300000,
"checkpointIntervalScaling": 1.5
}
```
In this example, a snapshot is created every hour, and the snapshot creation takes 5 minutes.
The checkpointIntervalScaling is set to 1.5, which means that after each successful checkpoint,
the interval between checkpoints will be increased by 50% (for example from 1 to 1.5 hours).
|