HOW TO RECOVER POSTGRES DATABASE FROM DISK LEVEL CORRUPTION
Ans:
I had to deal with corrupted Postgres database cluster. At the end, we couldn’t able to recover some of the data but managed to recover most part of it. Having experience working with dozens of database systems, I’m pleasantly surprised to experience resiliency of Postgres database.
Kudos to Postgres Development team for building the most resilience database in the world
Here is my Postgres database recovery story
Disclaimer:
I'm posting the steps carried out during the recovery process for information purpose only. This post doesn't provide any guarantee that it will work for your use-case and/or environment.
Note: The actual database name has been replaced with “dbname” and actual table names with “tablename”.
One of the Postgres DB cluster database experienced disk level corruption thus we were hitting this error:
ERROR: 58030:could not read block 36 in file "base/14241237/15856837";input/output error
ERROR: 58030:could not read block 37 in file "base/14241237/15856837";input/output error
(OR)postgres=# \c dbname
FATAL: could not read block 0 in file "base/16389/11930": Input/output error
Uh oh?? Really bad, isn’t it? Fortunately, it wasn’t mission critical system so we managed to take extended outage and work on partial recovery process because we didn’t want to loose all the data!!
Once we received the complaint, we immediately backed up corrupted database and created recovery cluster to bring up on different server so we can go through recovery efforts!!
Trial 1:
As many of you know, the first option is to bring up recovery database cluster with zero_damaged_pages=on . You can set the value in Postgres config file and try to reindex system catalog:
reindexdb -p 5433 --system dbname
reindexdb: could not connect to database dbname: FATAL: index "pg_index_indexrelid_index" contains unexpected zero page at block 0
HINT: Please REINDEX it.
Doh! Still, we could still not be able to connect to database !!
Trial 2:
If you aren’t aware, you should note down that there is a way to ignore indexes at system level. We started up recovery cluster with ignore_system_indexes=true setting:
pg_ctl -D /data -o '-c ignore_system_indexes=true'
restarted
dbname=# \c dbname
Yay! I could able to connect to DB now!
Trial 3:
Let’s try to reindex the database…
dbname=# reindex database "dbname";
NOTICE: table "pg_catalog.pg_class" was reindexed
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:ERROR: could not create unique index "pg_statistic_relid_att_inh_index"
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:DETAIL: Key (starelid, staattnum, stainherit)=(2608, 5, f) is duplicated.
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: reindex database "dbname";
ERROR: could not create unique index "pg_statistic_relid_att_inh_index"
DETAIL: Key (starelid, staattnum, stainherit)=(2608, 5, f) is duplicated.
As the table is corrupted with duplicate entries, let’s find out and fix them.
dbname=# select starelid, staattnum, stainherit from pg_catalog.pg_statistic where starelid=2608 order by 2;
starelid | staattnum | stainherit
----------+-----------+------------
2608 | 1 | f
2608 | 2 | f
2608 | 3 | f
2608 | 4 | f
2608 | 5 | f
2608 | 5 | f
2608 | 6 | f
2608 | 7 | f
(8 rows)
Let’s remove one of the entry based on XMIN :
dbname=# delete from pg_catalog.pg_statistic where starelid=2608 and staattnum=5 and xmin=1228447;
DELETE 1
Trial 4:
Restart REINDEX but it failed again!!
2016-08-22 16:01:29.698 PDT rhost=[local] app=psql:user=postgres:db=dbname:ERROR: 1 constraint record(s) missing for rel tablename
2016-08-22 16:01:29.698 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: reindex database "dbname";
ERROR: 1 constraint record(s) missing for rel tablename
Trial 5:
Let’s try to vacuum analzye the table
dbname=# vacuum analyze tablename;
2016-08-22 16:04:01.282 PDT rhost=[local] app=psql:user=postgres:db=dbname: 1 constraint record(s) missing for rel tablename
2016-08-22 16:04:01.282 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: vacuum analyze tablename;
ERROR: 1 constraint record(s) missing for rel tablename
hrm…it’s still complaining about constraint
Trial 6:
let’s disable constraint check….
dbname=# update pg_class set relchecks=0 where relname='tablename';
UPDATE 1
The above update fixed the the constraint error
Trial 7:
Let’s reindex the database again!
dbname =# reindex database "dbname";
Yay, Reindex is successful.
Once the reindex is successfully completed, we restarted recovery cluster without zero_damaged_page and ignore_system_indices settings.
Partial tables recovery through pg_dump process:
As the database is corrupted, it makes sense to kick off the pg_dump on the database … we kicked off the pg_dump but it was still showing some of the sequences with errors!!
/usr/lib/postgresql/9.4/bin/pg_dump dbname -p 5433 -Fc >recovery_dbname.dmp
2016-08-22 16:22:09.517 PDT rhost=[local] app=pg_dump:user=postgres:db=dbname:ERROR: invalid page in block 0 of relation base/16389/2825248
2016-08-22 16:22:09.517 PDT rhost=[local] app=pg_dump:user=postgres:db=dbname:STATEMENT: SELECT sequence_name, start_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled FROM XX_id_seq
pg_dump: [archiver (db)] query failed: ERROR: invalid page in block 0 of relation base/16389/2825248
pg_dump: [archiver (db)] query was: SELECT sequence_name, start_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled FROM XX_id_seq
We had issue recovering a couple of tables but we managed to recover the most of the tables in the database
Comments
Post a Comment