Time-Based Replication Monitoring In The Hot_Standby_Delay
This was something that had been a long-standing item on my personal TODO list, and happened to scratch the itch of a couple of clients at the time.
Previously it would only take an integer representing how many bytes of WAL data the master could be ahead of a replica before the threshold is crossed:
check_hot_standby_delay --dbhost=master,replica1 --critical=16777594
This is certainly useful for, say, keeping an eye on whether you're getting close to running over your wal_keep_segments value. Of course it can also be used to indicate whether the replica is still processing WAL, or has become stuck for some reason. But for the (arguably more common) problem of determining whether a replica is falling too far behind determining what byte thresholds to use, beyond simply guessing, isn't easy to figure out.
Postgres 9.1 introduced a handy function to help solve this problem: pg_last_xact_replay_timestamp(). It measures a slightly different thing than the pg_last_xlog_* functions the action previously used. And it's for that reason that the action now has a more complex format for its thresholds:
check_hot_standby_delay --dbhost=master,replica1 --critical="16777594 and 5 min"
For backward compatibility, of course, it'll still take an integer and work the same as it did before. Or alternatively if you only want to watch the chronological lag, you could even give it just a time interval, '5 min', and the threshold only takes the transaction replay timestamp into account. But if you specify both, as above, then both conditions must be met before the threshold activates.
Comments
Post a Comment