Backing Up and Restoring Your PostgreSQL Database

Introduction

As a Developer or even a System Administrator, experiencing database corruption is a scary thing. Usually when something you’re responsible fails… it fails hard. It’s important to always make frequent backups of your database(s) your in charge of. The internet is filled with solutions and one liners you can consider. But there is a lot of contradiction everywhere too as each site tells you how it should be done. Personally, I’m not going to tell you the way I do it is the correct way or not. But I will tell you it works, and it’s saved me in numerous occasions.

I’m lazy, and when I look for solutions online, I always hope someone is just going to spell it right out for me. It’s so much easier when someone else does all the work; but usually that’s wasn’t the case for me. Stack Overflow is fantastic for getting the quick 1 liners you need to build your solution, but I prefer automation. I don’t want to remember all of the different options one tool has from another. I want simplicity because when something fails, I don’t want to learn the complicated steps to do a database restoration.

As a strong supporter of PostgreSQL, I want to provide the solution I use to hopefully speed along someone else’s research in the future. I can’t stress enough also that if you aren’t taking regular backups of databases you are responsible for, then you really should reconsider and at use the scripts defined below.

Jun 28th, 2016 Update
I packaged pgnux_backup, pgnux_restore and pgnux_kickusers into an RPM to simplify things:
RPM: pgnux-1.0.3-1.el6.nuxref.noarch.rpm
Source: pgnux-1.0.3-1.el6.nuxref.src.rpm

Those who use CentOS/Red Hat 7 can get them here:
RPM: pgnux-1.0.3-1.el7.nuxref.noarch.rpm
Source: pgnux-1.0.3-1.el7.nuxref.src.rpm

These RPMs can also be fetch from my nuxref repository.

The rest of this blog contains a simplified version of the content found in the attached RPMs but should still suit the needs of most!

Database Backup

Assuming you you are the global system administrator responsible for a PostgreSQL database and have root privileges, here is a simple backup script you can use:

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_backup
# Description: Preform a backup of one ore more PostgreSQL Database(s)
# Returns: 0 on success, a non-zero value if a failure occurred
# Syntax: pgnux_backup [db1 [db2 [db3 [...]]]
#
# Note: If no database is specified to be backed up, then all of
#       databases hosted by the postgresql instance are used.

##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGPORT=5432
PGHOST=localhost
DUMPDIR=/var/pgbackup

##################################
# System Tweaks
##################################
# if COMPRESSION is enabled, then all database backups are
# gzipped to save space.  The trade off is more cpu power will
# be used to generate your backups and they will take longer
# to create.  But you can save yourself significant amounts of
# disk space.  Set this value to 0 if you do not wish to use
# compression. COMPRESSION can be set to a value between 0 and
# 9 where a setting of 0 is used to disable it.
# Please keep in mind also, that if you choose to use
# compression, the PostgreSQL restores will be a bit slower since
# the data needs to be first uncompressed before it can be
# reloaded.
COMPRESSION=6

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# List of databases you wish to back up by default.  If you
# leave this blank, then the system will automatically detect
# all of the databases hosted by the PostgreSQL instance.
#
# You can use Space, Tab or some form of white space delimiter
# between each database you specify if you need to specify more
# than one.
# If nothing is specified on the command line (when this script
# is called, then this is the default list that will be used in
# it's place.
DBLIST=""

# If anything was specified on the command line; we over-ride it here
[ ! -z "$@" ] && DBLIST="$@"

# Capture the current time to make it easy to locate our backups later
SNAPTIME=$(date +'%Y%m%d%H%M%S')

# Build DBLIST if it isn't already defined:
[ -z "$DBLIST" ] && DBLIST=$($SU -l postgres -c "$PGENGINE/psql 
                          -p $PGPORT -h $PGHOST -lt | 
                   awk 'NF >=5 { print $1}' | 
                    egrep -v '^[ t]*(template[0|1]||)')

# Create our backup directory if it doesn't already exist
[ ! -d $DUMPDIR ] && mkdir -p $DUMPDIR

# Ensure our directory is protected from prying eyes
chmod 770 $DUMPDIR
chown root.postgres $DUMPDIR

# Backup our Database Globals (exit on failure)
$SU -l postgres -c "$PGENGINE/pg_dumpall --globals-only 
   -p $PGPORT -h $PGHOST | 
   gzip -c > $DUMPDIR/${SNAPTIME}-globals.sql.gz"
if [ $? -ne 0 ]; then
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
   echo "Failed to dump system globals; backup aborted."
   exit 2
fi

# Protect our backup and ensure the user postgres user can manage
# this file if need be.
chmod ug+rw "$DUMPDIR/${SNAPTIME}-globals.sql.gz" &>/dev/null
chown postgres.root $DUMPDIR/${SNAPTIME}-globals.sql.gz &>/dev/null

FAIL_COUNT=0
# Iterate over our database list and perform our backup:
for DBNAME in $DBLIST; do
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
   echo "Dumping database '$DBNAME' ..."
   PGBKUP=
   PGFLAGS="-p $PGPORT -h $PGHOST --create --oids --format=t --verbose"
   if [ $COMPRESSION -gt 0 ]; then
      # Use Compression
      PGBKUP=$SNAPTIME-$DBNAME.db.gz
      $SU -l postgres -c "$PGENGINE/pg_dump $PGFLAGS $DBNAME | 
                 gzip -cq$COMPRESSSION > $DUMPDIR/$PGBKUP"
      RETVAL=$?
   else
      # Do not compres (flag is not set to 1)
      PGBKUP=$SNAPTIME-$DBNAME.db
      $SU -l postgres -c "$PGENGINE/pg_dump $PGFLAGS 
                 --file=$DUMPDIR/$PGBKUP $DBNAME"
      RETVAL=$?
   fi
   if [ $RETVAL -eq 0 ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
      echo "Backup Complete: $PGBKUP"
      chmod ug+rw $DUMPDIR/$PGBKUP &>/dev/null
      chown postgres.root $DUMPDIR/$PGBKUP &>/dev/null
   else
      [ -f $DUMPDIR/$PGBKUP ] && 
          rm -f $DUMPDIR/$PGBKUP
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
      echo "Backup Failed: $PGBKUP"
      let FAIL_COUNT+=1
   fi
done

if [ $FAIL_COUNT -ne 0 ]; then
   exit 1
fi
exit 0

Database Restore

Backups are one thing, but the ability to restore is another. This script pairs with the above and will allow you to restore different points in time.

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_restore
# Description: Preform a restore of one ore more PostgreSQL Database(s)
# Returns: 0 on success, a non-zero value if a failure occurred
# Syntax: pgnux_restore <database_name> [snapshot_time]
#
# Note: This script will not work if people are connected to the
#       databases in question you are trying to restore.
 
##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGPORT=5432
PGHOST=localhost
DUMPDIR=/var/pgbackup
 
# How many backups in the past to list when no restore
# has taken place
HISTORY=10
 
# Specify a keyword that when in place of the <database_name>
# it triggers all databases for specified snapshot time to be
# restored including database system globals. Set this to
# 'nothing' if you want the restore to do a global restore every
# time (might be dangerous)
# The below allows you to type 'pgnux_restore --' to restore
# everything.
RESTORE_ALL_KEYWORD="--"

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# Not worth continuing if there is no dump directory
if [ ! -d $DUMPDIR ]; then
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
   echo "Directory '$DUMPDIR' does not exist."
   exit 2
fi
  
# Capture Database(s) and allow for delimiters to separate more
# then one. Keep in mind that the first database specified will
# be the one used to detect the SNAPTIME if one isn't specified
DBLIST=$(echo "$1" | sed -e 's/[,:+%!&]/ /g')
 
# Take the value specified on the command line (argument 2)
shift
_SNAPTIME=$(echo "$@" | sed -e 's/[^0-9]//g')
SNAPTIME=
if [ -z "$_SNAPTIME" ]; then
   # Fetch most recent backup time
   SNAPTIME=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
        -type f -regex ".*/[0-9]+-.*.(db|db.gz)$" 
        -exec basename {} ; | 
        sed -e 's/^([0-9]+)-(.*).(db|db.gz)$/1/g' | 
        sort -r -n | uniq | head -n1)
   if [ -z "$SNAPTIME" ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
      echo "No backups detected."
      exit 2
   fi
fi
 
# Initialize Globals Loaded Flag 
GLOBALS_LOADED=1
if [ "$DBLIST" == "$RESTORE_ALL_KEYWORD" ]; then
   # Now we build a list of the databases based on the
   # snapshot time specified
   DBLIST=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
      -type f -regex ".*/$SNAPTIME-.*.(db|db.gz)$" 
      -exec basename {} ; | 
      sed -e 's/^[0-9]+-(.*).(db|db.gz)$/1/g')
   GLOBALS_LOADED=0
fi
 
FAILLIST=""
RESTORE_COUNT=0
for DBNAME in $DBLIST; do
   # Toggle 'Globals Loaded Flag' if 'Restore All Keyword' found
   # This neat little trick allows you to force the restores of
   # globals associated with a specified date of the backup
   # hence: pgnux_restore --,foobar_database 2014010100000
   # would cause the globals taken at this time to be loaded
   # with the database,
   [ "$DBLIST" == "$RESTORE_ALL_KEYWORD" ] && GLOBALS_LOADED=0

   # If no Snapshot time was specified, we need to
   # keep detecting it based on the database specified
   [ -z "$_SNAPTIME" ] &&
      SNAPTIME=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
        -type f -regex ".*/[0-9]+-$DBNAME.(db|db.gz)$" 
        -exec basename {} ; | 
        sed -e 's/^([0-9]+)-(.*).(db|db.gz)$/1/g' | 
        sort -r -n | uniq | head -n1)
   PGBKUP=$DUMPDIR/$SNAPTIME-$DBNAME.db

   if [ $GLOBALS_LOADED -eq 0 ] && 
      [ -f $DUMPDIR/globals-$SNAPTIME.sql.gz ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
      echo -n $"Restoring Database Globals ..."
         $SU -l postgres -c "gunzip 
             -c $DUMPDIR/globals-$SNAPTIME.sql.gz | 
             $PGENGINE/psql -a -p $PGPORT -h $PGHOST"
      # Toggle Flag
      GLOBALS_LOADED=1
   fi
 
   # Detect Compression
   if [ -f $PGBKUP.gz ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
      echo $"Uncompressing Database $DBNAME ($SNAPTIME)..."
      gunzip -c $PGBKUP.gz > $PGBKUP
      if [ $? -ne 0 ]; then
         FAILLIST="$FAILLIST $DBNAME"
         [ -f $PGBKUP ] && rm -f $PGBKUP
      fi
   fi
 
   [ ! -f $PGBKUP ] && continue
 
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
   echo $"Restoring Database $DBNAME ($SNAPTIME)..."
   # This action can fail for 2 reasons:
   #  1. the database doesn't exist (this is okay because we
   #     are restoring it anyway.
   #  2. Someone is accessing the database we are trying to
   #     drop.  You need to make sure you've stopped all
   #     processes that are accessing the particular database
   #     being restored or you will have strange results.
   $SU -l postgres -c "$PGENGINE/dropdb 
          -p $PGPORT -h $PGHOST $DBNAME" &>/dev/null
   # use -C & -d 'template1' so we can bootstrap off of the
   # 'template1' database. Note, that this means we are bound
   # by the character encoding of 'template1' (default is UTF-8)
   $SU -l postgres -c "$PGENGINE/pg_restore -e -v -p $PGPORT 
                   -C -d template1 -h $PGHOST $PGBKUP"
   [ $? -ne 0 ] && FAILLIST="$FAILLIST $DBNAME" || 
        let RESTORE_COUNT+=1
   [ -f $PGBKUP.gz ] && [ -f $PGBKUP ] && rm -f $PGBKUP
done
  
# Spit a readable list of databases that were not recovered if
# required.
if [ ! -z "$FAILLIST" ]; then
   for DBNAME in $FAILLIST; do
      echo "Warning: $DBNAME ($SNAPTIME) was not correctly restored."
   done
   exit 1
elif [ $RESTORE_COUNT -eq 0 ]; then
   # Nothing was restored; now is a good time to display to the user
   # their options
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - WARNING - "
   echo $"There were no databases restored."
   echo
   # Display last X backups and the databases associated with each
   BACKUPS=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
        -type f -regex ".*/[0-9]+-.*.(db|db.gz)$" 
        -exec basename {} ; | 
        sed -e "s/^([0-9]+)-(.*).(db|db.gz)$/1|2/g" | 
        sort -r -n | head -n$HISTORY)
   [ -z "$BACKUPS" ] && exit 1
 
   LAST_SNAPTIME=""
   printf "    %-16s %sn" "TIMESTAMP" "DATABASE(S)"
   for BACKUP in $BACKUPS; do
      SNAPTIME=$(echo $BACKUP | cut -d'|' -f1)
      DBNAME=$(echo $BACKUP | cut -d'|' -f2)
      if [ "$LAST_SNAPTIME" != "$SNAPTIME" ]; then
          printf "n%s: %s" $(echo "$SNAPTIME" | 
              sed -e 's/([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})/1-2-3_4:5:6/g') "$DBNAME"
          LAST_SNAPTIME=$SNAPTIME
      else
         printf ", %s" "$DBNAME"
      fi
   done
   printf "n"
   exit 1
fi
exit 0

Automation

Consider creating a few crontab entries to manage backups during off hours as well as some tool that can keep your backups from growing out of control on your hard drive. Feel free to adjust the numbers as you see fit. The below assumes you copied the 2 scripts above in /usr/bin

cat << EOF > /etc/cron.d/nuxref_postgresql_backup
# Create a backup of all databases every hour
0  * * * * root /usr/bin/pgnux_backup &>/dev/null
# Only keep the last 30 days of backups (check run every hour on the 40 min mark)
40 * * * * root find /var/pgbackup -maxdepth 1 -type f -mtime +30 -delete &>/dev/null
EOF

Usage

Backups are as simple as the following:

# Backup All databases
pgnux_backup

# Backup just the foobar_database:
pgnux_backup foobar_database

# Backup just the foobar_database and the barfoo_database:
pgnux_backup foobar_database barfoo_database

Restoring requires that no one is accessing the database (if it already exists). Otherwise it’s just as easy:

# Call the script with no parameters, to see a nice list of the last 10 backups.
# This list can be really helpful because it also identifies all of the
# snapshot times that they were taken.  You need these times to preform
# restores with.
pgnux_restore

# Restore ALL databases that resided in the last database snapshot
# Caution: If you just backed up a single database on it's own last
#          time, this might be a bit unsafe to run. If you backed
#          up other databases at a different time, then they won't
#          get included in this restore. The -- also forces the
#          system globals to get reloaded as well.
pgnux_restore --

# Restore the last snapshot taken of the foobar_database
pgnux_restore foobar_database

# Restore the last snapshot taken of the foobar_database and
# barfoo_database notice we can't use space as a delimiter (like
# pgnux_backup does) only because the second argument identifies
# the snapshot time (in case you want your restore to be specific).
pgnux_restore foobar_database,barfoo_database

# With respect to the last point, this will restore a database
# snapshot further back in the past (not the latest one). This
# also assumes you have a backup at this point.  You can find
# this out by running pgnux_restore all by itself as the first
# example given above. The script will take care of the formatting
# of the time so long as you provide YYYYMMDDHHmmss in that format.
# You can stick in as many delimiters to make the time readable as
# you want. The below restore a specific database snapshot:
pgnux_restore foobar_database 2014-03-13_15:47:06

# You can restore more then one database at a specific time as
# well using the a comma (,) to delimit your databases and
# specifying the snapshot time at the end. Here is how to restore
# the foobar_database and the barfoo_database at a specific
# instance of time:
pgnux_restore foobar_database,barfoo_database 2014-03-13_15:47:06

# the -- also plays another role, it can be used to force the loading
# of the postgresql globals again when it's paired with databases
# consider the above line but with this small tweak:
pgnux_restore --,foobar_database,barfoo_database 2014-03-13_15:47:06

Catastrophic Recoveries

If worst case scenario happens, and the restore process seems to be failing you, your not out of options. As long as you’ve been taking frequent backups, there is always a way back!

Remember that restores will fail if the database is being accessed by someone else. The below is a dirty little scripted trick that boots everyone accessing your system off. It’s somewhat dangerous to do, but if restoring is your final goal, then the danger is mitigated.

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_kickusers
# Description: Kicks all users off of a PostgreSQL instance being
#              accessed.
# Syntax: pgnux_kickusers

##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGPORT=5432
PGHOST=localhost

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# Get a list of all PIDS accessing our system while kicking
# the ones we can control via the database
# PostgreSQL 9.2 and above
PIDS=$($SU -l postgres \
      -c "$PGENGINE/psql -h $PGHOST -p $PGPORT -t -E --no-align \
      -c \"SELECT pg_stat_activity.pid, pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid();\"" 2>/dev/null | egrep -v 't' | cut -d'|' -f1)

# PostgreSQL 9.1 and below (un-comment if this is your version):
# PIDS=$($SU -l postgres \
#       -c "$PGENGINE/psql -h $PGHOST -p $PGPORT -t -E --no-align \
#      -c \"SELECT  pg_stat_activity.procpid,pg_terminate_backend(pg_stat_activity.procpid) FROM pg_stat_activity WHERE procpid <> pg_backend_pid();\"" 2>/dev/null | egrep -v 't' | cut -d'|' -f1)

# Iterate over the left over pids that we couldn't otherwise
# handle and attempt to kick them by sending each of them a SIGTERM.
for PID in $PIDS; do
   RETRY=3
   while [ $RETRY -gt 0 ]; do
      kill $PID &>/dev/null
      ps -p $PID &>/dev/null
      if [ $? -ne 0 ]; then break; fi
      let RETRY-=1
   done
   # All PostgreSQL Documentation will tell you SIGKILL
   # is the absolute 'WORST' thing you can do.
   # The below code should only be uncommented in
   # catastrophic situations where a restore is inevitable
   # ps -p $PID &>/dev/null
   # [ $? -eq 0 ] && kill -9 $PID
done
exit 0

They can also fail if PostgreSQL data directory contents has been corrupted. Corruption can occur:

  • From a system that is experiencing harddrive failures.
  • From another processes accessing /var/lib/pgsql/data/* that shouldn’t be. Only the database itself should be modifying/accessing the contents of this directory unless you know what you’re doing.
  • From just a simple power outage. A busy system that isn’t backed up by a UPS device or redundant power sources can have it’s file system corrupted easily during a loss of power. What usually happens is PostgreSQL is writing content to the database when the power is pulled from the machine. At the end of the day, you’re left with a database that wasn’t correctly written to and in some cases can be seriously damaged from this as a result.

In some cases, the corruption can be so bad the database instance (PostgreSQL) won’t even start. Evidently, you can’t restore a working copy of a database if the service itself isn’t working for you.

Sometimes attempting to repair the write-ahead logging can help you recover. Obviously this completely depends on the type of failure you’re experiencing. /var/lib/pgsql/data/pg_xlog/* can help shed some light on this subject as well. The below script assumes your instance of PostgreSQL is stopped.

# Make sure your database is NOT running if you intend to run the below
# command:
service postgresql stop

Here is the the simple wrapper to the read-ahead write log repair:

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_repair
# Description: Simple wrapper to pg_resetxlog
# Syntax: pgnux_repair
 
##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGDATA=/var/lib/pgsql/data

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# Attempt to repair the WAL logs
$SU -l postgres -c "$PGENGINE/pg_resetxlog -f $PGDATA"

If it still doesn’t work, as a last resort consider these simple commands to bring your system back up again. Note that you should be the root user prior to preforming these steps:

# Stop the database (if it isn't already):
service postgresql stop

# Backup your configuration files
mkdir -p /tmp/postgresql_conf
cp -a /var/lib/pgsql/data/*.conf /tmp/postgresql_conf

# Destroy the database data directory. Unless you changed the
# default directory, the below command will do it:
# This is IRREVERSIBLE, be sure you have a recovery point to
# restore to before doing this otherwise you will make your
# situation worse!
find /var/lib/pgsql/ -mindepth 1 -maxdepth 1 rm -rf {} ;

# Initialize a new empty database; this will rebuild
# the /var/lib/pgsql/data directory for you
service postgresql initdb

# Restore all of your configuration you backed up above:
cp -af /tmp/postgresql_conf/*.conf /var/lib/pgsql/data/

# Just in case, for SELinux Users, make sure we're set okay
restorecon -R /var/lib/pgsql

# Start your database up
service postgresql start

# Now restore your databases using pgnux_restore and make sure you
# include the double dash '--' so that you load the system globals
# back as well:
#   pgnux_restore --,db1,db2,db3,db4,...
#
# If you've always been backing them all together, then you can
# restore them with a single command:
pgnux_restore --

# Your done! You should be back and running now and hopefully with
# minimal downtime.

Disclaimer

I can not be held responsible or liable for anything that goes wrong with your system. This blog is intended to be used as a resource only and will hopefully help you out in the time of need. These tools work for me and have been tested thoroughly using PostgreSQL v8.4 (which ships with CentOS 6.x). I have not tried any of these tools for PostgreSQL v9.x or higher and can not promise you anything, but I honestly can’t see a reason why they wouldn’t work.

Credit

Please note that this information took me several days to put together and test thoroughly. I may not blog often; but I want to re-assure the stability and testing I put into everything I intend share.

If you like what you see and wish to copy and paste this HOWTO, please reference back to this blog post at the very least. It’s really all I ask.

Leave a Reply

Your email address will not be published. Required fields are marked *