PostgreSQL Streaming Replication Monitoring Script
One of the easiest ways to monitor slave lag when using streaming replication is to turn hot standby on your slave and use pg_last_xact_replay_timestamp() and/or the other recovery information functions. Here’s an example query to run on the slave systems to get the number of seconds behind it is:
SELECT extract(epoch from now() - pg_last_xact_replay_timestamp()) AS slave_lag
The issue with this query is that while your slave(s) may be 100% caught up, the time interval being returned is always increasing until new write activity occurs on the master that the slave can replay. This can cause your monitoring to give false positives that your slave is falling behind if you have things set up to ensure your slaves are no more than a few minutes behind. A side affect of this monitoring query can also give you an indication that writes to your master have stopped for some reason.
One of our clients has a smaller sized database that doesn’t get quite as much write traffic as our typical clients do. But it still has failover slaves and still needs to be monitored just like our other larger clients to ensure it doesn’t fall too far behind. So, my coworker introduced me to the pg_stat_replication view that was added in PostgreSQL 9.1. Querying this from the master returns information about streaming replication slaves connected to it.
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 16649
usesysid | 16388
usename | replication
application_name | walreceiver
client_addr | xxx.xxx.xxx.xxx
client_hostname | db1-prod-ca
client_port | 58085
backend_start | 2013-10-29 19:57:51.48142+00
state | streaming
sent_location | 147/11000000
write_location | 147/11000000
flush_location | 147/11000000
replay_location | 147/11000000
sync_priority | 0
sync_state | async
-[ RECORD 2 ]----+------------------------------
pid | 7999
usesysid | 16388
usename | replication
application_name | walreceiver
client_addr | yyy.yyy.yyy.yyy
client_hostname | db2-prod
client_port | 54932
backend_start | 2013-10-29 15:32:47.256794+00
state | streaming
sent_location | 147/11000000
write_location | 147/11000000
flush_location | 147/11000000
replay_location | 147/11000000
sync_priority | 0
sync_state | async
He also provided a handy query to get back a simple, easy to understand numeric value to indicate slave lag. The issue I ran into using the query is that this view uses pg_stat_activity as one of its sources. If you’re not a superuser, you’re not going to get any statistics on sessions that aren’t your own (and hopefully you’re not using a superuser role as the role for your monitoring solution). So, instead I made a function with SECURITY DEFINER set, made a superuser role the owner, and gave my monitoring role EXECUTE privileges on the function.
CREATE OR REPLACE FUNCTION streaming_slave_check() RETURNS TABLE (client_hostname text, client_addr inet, byte_lag float)
LANGUAGE SQL SECURITY DEFINER
AS $$
SELECT
client_hostname,
client_addr,
sent_offset - (replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 ) AS byte_lag
FROM (
SELECT
client_hostname,
client_addr,
('x' || lpad(split_part(sent_location::text, '/', 1), 8, '0'))::bit(32)::bigint AS sent_xlog,
('x' || lpad(split_part(replay_location::text, '/', 1), 8, '0'))::bit(32)::bigint AS replay_xlog,
('x' || lpad(split_part(sent_location::text, '/', 2), 8, '0'))::bit(32)::bigint AS sent_offset,
('x' || lpad(split_part(replay_location::text, '/', 2), 8, '0'))::bit(32)::bigint AS replay_offset
FROM pg_stat_replication
) AS s;
$$;
Running this query gives back a few handy columns that should be good enough for most monitoring tools. You can easily add more columns from pg_stat_replication or any other tables you need to join against for more info.
postgres=# select * from streaming_slave_check();
client_hostname | client_addr | byte_lag
-----------------+-----------------+----------
db1-prod-ca | xxx.xxx.xxx.xxx | 160
db2-prod | yyy.yyy.yyy.yyy | 160
UPDATE: If you’re running PostgreSQL 9.2+, there is a new, built-in function that avoids needing the above function all together and can just query pg_stat_replication directly.
postgres=# SELECT client_hostname, client_addr,
pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location)
AS byte_lag
FROM pg_stat_replication;
client_hostname | client_addr | byte_lag
-----------------+---------------+----------
db1-prod-ca | xxx.xxx.xx.xx | 0
db2-prod | yy.yyy.yyy.yy | 0
Unfortunately, this function still requires superuser privileges to obtain all relevant data and most monitoring tools do not use a superuser role (I hope). So, in that case you do still need a SECURITY DEFINER function, but it can be a much much simpler one
CREATE OR REPLACE FUNCTION streaming_slave_check() RETURNS TABLE (client_hostname text, client_addr inet, byte_lag numeric)
LANGUAGE SQL SECURITY DEFINER
AS $$
SELECT client_hostname
, client_addr
, pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location) AS byte_lag
FROM pg_stat_replication;
$$;
This can also be useful to monitor slave lag when you don’t have hot standby turned on for your slaves to allow read-only queries.
Combining both of the replication monitoring solutions mentioned in this post should give you a much better overall picture of the status of your master/slave systems.
Comments
Post a Comment