Tag Archives: Red Hat

Backing Up and Restoring Your PostgreSQL Database

Introduction

As a Developer or even a System Administrator, experiencing database corruption is a scary thing. Usually when something you’re responsible fails… it fails hard. It’s important to always make frequent backups of your database(s) your in charge of. The internet is filled with solutions and one liners you can consider. But there is a lot of contradiction everywhere too as each site tells you how it should be done. Personally, I’m not going to tell you the way I do it is the correct way or not. But I will tell you it works, and it’s saved me in numerous occasions.

I’m lazy, and when I look for solutions online, I always hope someone is just going to spell it right out for me. It’s so much easier when someone else does all the work; but usually that’s wasn’t the case for me. Stack Overflow is fantastic for getting the quick 1 liners you need to build your solution, but I prefer automation. I don’t want to remember all of the different options one tool has from another. I want simplicity because when something fails, I don’t want to learn the complicated steps to do a database restoration.

As a strong supporter of PostgreSQL, I want to provide the solution I use to hopefully speed along someone else’s research in the future. I can’t stress enough also that if you aren’t taking regular backups of databases you are responsible for, then you really should reconsider and at use the scripts defined below.

Jun 28th, 2016 Update
I packaged pgnux_backup, pgnux_restore and pgnux_kickusers into an RPM to simplify things:
RPM: pgnux-1.0.3-1.el6.nuxref.noarch.rpm
Source: pgnux-1.0.3-1.el6.nuxref.src.rpm

Those who use CentOS/Red Hat 7 can get them here:
RPM: pgnux-1.0.3-1.el7.nuxref.noarch.rpm
Source: pgnux-1.0.3-1.el7.nuxref.src.rpm

These RPMs can also be fetch from my nuxref repository.

The rest of this blog contains a simplified version of the content found in the attached RPMs but should still suit the needs of most!

Database Backup

Assuming you you are the global system administrator responsible for a PostgreSQL database and have root privileges, here is a simple backup script you can use:

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_backup
# Description: Preform a backup of one ore more PostgreSQL Database(s)
# Returns: 0 on success, a non-zero value if a failure occurred
# Syntax: pgnux_backup [db1 [db2 [db3 [...]]]
#
# Note: If no database is specified to be backed up, then all of
#       databases hosted by the postgresql instance are used.

##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGPORT=5432
PGHOST=localhost
DUMPDIR=/var/pgbackup

##################################
# System Tweaks
##################################
# if COMPRESSION is enabled, then all database backups are
# gzipped to save space.  The trade off is more cpu power will
# be used to generate your backups and they will take longer
# to create.  But you can save yourself significant amounts of
# disk space.  Set this value to 0 if you do not wish to use
# compression. COMPRESSION can be set to a value between 0 and
# 9 where a setting of 0 is used to disable it.
# Please keep in mind also, that if you choose to use
# compression, the PostgreSQL restores will be a bit slower since
# the data needs to be first uncompressed before it can be
# reloaded.
COMPRESSION=6

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# List of databases you wish to back up by default.  If you
# leave this blank, then the system will automatically detect
# all of the databases hosted by the PostgreSQL instance.
#
# You can use Space, Tab or some form of white space delimiter
# between each database you specify if you need to specify more
# than one.
# If nothing is specified on the command line (when this script
# is called, then this is the default list that will be used in
# it's place.
DBLIST=""

# If anything was specified on the command line; we over-ride it here
[ ! -z "$@" ] && DBLIST="$@"

# Capture the current time to make it easy to locate our backups later
SNAPTIME=$(date +'%Y%m%d%H%M%S')

# Build DBLIST if it isn't already defined:
[ -z "$DBLIST" ] && DBLIST=$($SU -l postgres -c "$PGENGINE/psql 
                          -p $PGPORT -h $PGHOST -lt | 
                   awk 'NF >=5 { print $1}' | 
                    egrep -v '^[ t]*(template[0|1]||)')

# Create our backup directory if it doesn't already exist
[ ! -d $DUMPDIR ] && mkdir -p $DUMPDIR

# Ensure our directory is protected from prying eyes
chmod 770 $DUMPDIR
chown root.postgres $DUMPDIR

# Backup our Database Globals (exit on failure)
$SU -l postgres -c "$PGENGINE/pg_dumpall --globals-only 
   -p $PGPORT -h $PGHOST | 
   gzip -c > $DUMPDIR/${SNAPTIME}-globals.sql.gz"
if [ $? -ne 0 ]; then
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
   echo "Failed to dump system globals; backup aborted."
   exit 2
fi

# Protect our backup and ensure the user postgres user can manage
# this file if need be.
chmod ug+rw "$DUMPDIR/${SNAPTIME}-globals.sql.gz" &>/dev/null
chown postgres.root $DUMPDIR/${SNAPTIME}-globals.sql.gz &>/dev/null

FAIL_COUNT=0
# Iterate over our database list and perform our backup:
for DBNAME in $DBLIST; do
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
   echo "Dumping database '$DBNAME' ..."
   PGBKUP=
   PGFLAGS="-p $PGPORT -h $PGHOST --create --oids --format=t --verbose"
   if [ $COMPRESSION -gt 0 ]; then
      # Use Compression
      PGBKUP=$SNAPTIME-$DBNAME.db.gz
      $SU -l postgres -c "$PGENGINE/pg_dump $PGFLAGS $DBNAME | 
                 gzip -cq$COMPRESSSION > $DUMPDIR/$PGBKUP"
      RETVAL=$?
   else
      # Do not compres (flag is not set to 1)
      PGBKUP=$SNAPTIME-$DBNAME.db
      $SU -l postgres -c "$PGENGINE/pg_dump $PGFLAGS 
                 --file=$DUMPDIR/$PGBKUP $DBNAME"
      RETVAL=$?
   fi
   if [ $RETVAL -eq 0 ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
      echo "Backup Complete: $PGBKUP"
      chmod ug+rw $DUMPDIR/$PGBKUP &>/dev/null
      chown postgres.root $DUMPDIR/$PGBKUP &>/dev/null
   else
      [ -f $DUMPDIR/$PGBKUP ] && 
          rm -f $DUMPDIR/$PGBKUP
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
      echo "Backup Failed: $PGBKUP"
      let FAIL_COUNT+=1
   fi
done

if [ $FAIL_COUNT -ne 0 ]; then
   exit 1
fi
exit 0

Database Restore

Backups are one thing, but the ability to restore is another. This script pairs with the above and will allow you to restore different points in time.

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_restore
# Description: Preform a restore of one ore more PostgreSQL Database(s)
# Returns: 0 on success, a non-zero value if a failure occurred
# Syntax: pgnux_restore <database_name> [snapshot_time]
#
# Note: This script will not work if people are connected to the
#       databases in question you are trying to restore.
 
##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGPORT=5432
PGHOST=localhost
DUMPDIR=/var/pgbackup
 
# How many backups in the past to list when no restore
# has taken place
HISTORY=10
 
# Specify a keyword that when in place of the <database_name>
# it triggers all databases for specified snapshot time to be
# restored including database system globals. Set this to
# 'nothing' if you want the restore to do a global restore every
# time (might be dangerous)
# The below allows you to type 'pgnux_restore --' to restore
# everything.
RESTORE_ALL_KEYWORD="--"

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# Not worth continuing if there is no dump directory
if [ ! -d $DUMPDIR ]; then
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
   echo "Directory '$DUMPDIR' does not exist."
   exit 2
fi
  
# Capture Database(s) and allow for delimiters to separate more
# then one. Keep in mind that the first database specified will
# be the one used to detect the SNAPTIME if one isn't specified
DBLIST=$(echo "$1" | sed -e 's/[,:+%!&]/ /g')
 
# Take the value specified on the command line (argument 2)
shift
_SNAPTIME=$(echo "$@" | sed -e 's/[^0-9]//g')
SNAPTIME=
if [ -z "$_SNAPTIME" ]; then
   # Fetch most recent backup time
   SNAPTIME=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
        -type f -regex ".*/[0-9]+-.*.(db|db.gz)$" 
        -exec basename {} ; | 
        sed -e 's/^([0-9]+)-(.*).(db|db.gz)$/1/g' | 
        sort -r -n | uniq | head -n1)
   if [ -z "$SNAPTIME" ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - ERROR - "
      echo "No backups detected."
      exit 2
   fi
fi
 
# Initialize Globals Loaded Flag 
GLOBALS_LOADED=1
if [ "$DBLIST" == "$RESTORE_ALL_KEYWORD" ]; then
   # Now we build a list of the databases based on the
   # snapshot time specified
   DBLIST=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
      -type f -regex ".*/$SNAPTIME-.*.(db|db.gz)$" 
      -exec basename {} ; | 
      sed -e 's/^[0-9]+-(.*).(db|db.gz)$/1/g')
   GLOBALS_LOADED=0
fi
 
FAILLIST=""
RESTORE_COUNT=0
for DBNAME in $DBLIST; do
   # Toggle 'Globals Loaded Flag' if 'Restore All Keyword' found
   # This neat little trick allows you to force the restores of
   # globals associated with a specified date of the backup
   # hence: pgnux_restore --,foobar_database 2014010100000
   # would cause the globals taken at this time to be loaded
   # with the database,
   [ "$DBLIST" == "$RESTORE_ALL_KEYWORD" ] && GLOBALS_LOADED=0

   # If no Snapshot time was specified, we need to
   # keep detecting it based on the database specified
   [ -z "$_SNAPTIME" ] &&
      SNAPTIME=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
        -type f -regex ".*/[0-9]+-$DBNAME.(db|db.gz)$" 
        -exec basename {} ; | 
        sed -e 's/^([0-9]+)-(.*).(db|db.gz)$/1/g' | 
        sort -r -n | uniq | head -n1)
   PGBKUP=$DUMPDIR/$SNAPTIME-$DBNAME.db

   if [ $GLOBALS_LOADED -eq 0 ] && 
      [ -f $DUMPDIR/globals-$SNAPTIME.sql.gz ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
      echo -n $"Restoring Database Globals ..."
         $SU -l postgres -c "gunzip 
             -c $DUMPDIR/globals-$SNAPTIME.sql.gz | 
             $PGENGINE/psql -a -p $PGPORT -h $PGHOST"
      # Toggle Flag
      GLOBALS_LOADED=1
   fi
 
   # Detect Compression
   if [ -f $PGBKUP.gz ]; then
      echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
      echo $"Uncompressing Database $DBNAME ($SNAPTIME)..."
      gunzip -c $PGBKUP.gz > $PGBKUP
      if [ $? -ne 0 ]; then
         FAILLIST="$FAILLIST $DBNAME"
         [ -f $PGBKUP ] && rm -f $PGBKUP
      fi
   fi
 
   [ ! -f $PGBKUP ] && continue
 
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - INFO - "
   echo $"Restoring Database $DBNAME ($SNAPTIME)..."
   # This action can fail for 2 reasons:
   #  1. the database doesn't exist (this is okay because we
   #     are restoring it anyway.
   #  2. Someone is accessing the database we are trying to
   #     drop.  You need to make sure you've stopped all
   #     processes that are accessing the particular database
   #     being restored or you will have strange results.
   $SU -l postgres -c "$PGENGINE/dropdb 
          -p $PGPORT -h $PGHOST $DBNAME" &>/dev/null
   # use -C & -d 'template1' so we can bootstrap off of the
   # 'template1' database. Note, that this means we are bound
   # by the character encoding of 'template1' (default is UTF-8)
   $SU -l postgres -c "$PGENGINE/pg_restore -e -v -p $PGPORT 
                   -C -d template1 -h $PGHOST $PGBKUP"
   [ $? -ne 0 ] && FAILLIST="$FAILLIST $DBNAME" || 
        let RESTORE_COUNT+=1
   [ -f $PGBKUP.gz ] && [ -f $PGBKUP ] && rm -f $PGBKUP
done
  
# Spit a readable list of databases that were not recovered if
# required.
if [ ! -z "$FAILLIST" ]; then
   for DBNAME in $FAILLIST; do
      echo "Warning: $DBNAME ($SNAPTIME) was not correctly restored."
   done
   exit 1
elif [ $RESTORE_COUNT -eq 0 ]; then
   # Nothing was restored; now is a good time to display to the user
   # their options
   echo -n "$(date +'%Y-%m-%d %H:%M:%S') - $$ - WARNING - "
   echo $"There were no databases restored."
   echo
   # Display last X backups and the databases associated with each
   BACKUPS=$(find -L $DUMPDIR -maxdepth 1 -mindepth 1 
        -type f -regex ".*/[0-9]+-.*.(db|db.gz)$" 
        -exec basename {} ; | 
        sed -e "s/^([0-9]+)-(.*).(db|db.gz)$/1|2/g" | 
        sort -r -n | head -n$HISTORY)
   [ -z "$BACKUPS" ] && exit 1
 
   LAST_SNAPTIME=""
   printf "    %-16s %sn" "TIMESTAMP" "DATABASE(S)"
   for BACKUP in $BACKUPS; do
      SNAPTIME=$(echo $BACKUP | cut -d'|' -f1)
      DBNAME=$(echo $BACKUP | cut -d'|' -f2)
      if [ "$LAST_SNAPTIME" != "$SNAPTIME" ]; then
          printf "n%s: %s" $(echo "$SNAPTIME" | 
              sed -e 's/([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})/1-2-3_4:5:6/g') "$DBNAME"
          LAST_SNAPTIME=$SNAPTIME
      else
         printf ", %s" "$DBNAME"
      fi
   done
   printf "n"
   exit 1
fi
exit 0

Automation

Consider creating a few crontab entries to manage backups during off hours as well as some tool that can keep your backups from growing out of control on your hard drive. Feel free to adjust the numbers as you see fit. The below assumes you copied the 2 scripts above in /usr/bin

cat << EOF > /etc/cron.d/nuxref_postgresql_backup
# Create a backup of all databases every hour
0  * * * * root /usr/bin/pgnux_backup &>/dev/null
# Only keep the last 30 days of backups (check run every hour on the 40 min mark)
40 * * * * root find /var/pgbackup -maxdepth 1 -type f -mtime +30 -delete &>/dev/null
EOF

Usage

Backups are as simple as the following:

# Backup All databases
pgnux_backup

# Backup just the foobar_database:
pgnux_backup foobar_database

# Backup just the foobar_database and the barfoo_database:
pgnux_backup foobar_database barfoo_database

Restoring requires that no one is accessing the database (if it already exists). Otherwise it’s just as easy:

# Call the script with no parameters, to see a nice list of the last 10 backups.
# This list can be really helpful because it also identifies all of the
# snapshot times that they were taken.  You need these times to preform
# restores with.
pgnux_restore

# Restore ALL databases that resided in the last database snapshot
# Caution: If you just backed up a single database on it's own last
#          time, this might be a bit unsafe to run. If you backed
#          up other databases at a different time, then they won't
#          get included in this restore. The -- also forces the
#          system globals to get reloaded as well.
pgnux_restore --

# Restore the last snapshot taken of the foobar_database
pgnux_restore foobar_database

# Restore the last snapshot taken of the foobar_database and
# barfoo_database notice we can't use space as a delimiter (like
# pgnux_backup does) only because the second argument identifies
# the snapshot time (in case you want your restore to be specific).
pgnux_restore foobar_database,barfoo_database

# With respect to the last point, this will restore a database
# snapshot further back in the past (not the latest one). This
# also assumes you have a backup at this point.  You can find
# this out by running pgnux_restore all by itself as the first
# example given above. The script will take care of the formatting
# of the time so long as you provide YYYYMMDDHHmmss in that format.
# You can stick in as many delimiters to make the time readable as
# you want. The below restore a specific database snapshot:
pgnux_restore foobar_database 2014-03-13_15:47:06

# You can restore more then one database at a specific time as
# well using the a comma (,) to delimit your databases and
# specifying the snapshot time at the end. Here is how to restore
# the foobar_database and the barfoo_database at a specific
# instance of time:
pgnux_restore foobar_database,barfoo_database 2014-03-13_15:47:06

# the -- also plays another role, it can be used to force the loading
# of the postgresql globals again when it's paired with databases
# consider the above line but with this small tweak:
pgnux_restore --,foobar_database,barfoo_database 2014-03-13_15:47:06

Catastrophic Recoveries

If worst case scenario happens, and the restore process seems to be failing you, your not out of options. As long as you’ve been taking frequent backups, there is always a way back!

Remember that restores will fail if the database is being accessed by someone else. The below is a dirty little scripted trick that boots everyone accessing your system off. It’s somewhat dangerous to do, but if restoring is your final goal, then the danger is mitigated.

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_kickusers
# Description: Kicks all users off of a PostgreSQL instance being
#              accessed.
# Syntax: pgnux_kickusers

##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGPORT=5432
PGHOST=localhost

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# Get a list of all PIDS accessing our system while kicking
# the ones we can control via the database
# PostgreSQL 9.2 and above
PIDS=$($SU -l postgres \
      -c "$PGENGINE/psql -h $PGHOST -p $PGPORT -t -E --no-align \
      -c \"SELECT pg_stat_activity.pid, pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid();\"" 2>/dev/null | egrep -v 't' | cut -d'|' -f1)

# PostgreSQL 9.1 and below (un-comment if this is your version):
# PIDS=$($SU -l postgres \
#       -c "$PGENGINE/psql -h $PGHOST -p $PGPORT -t -E --no-align \
#      -c \"SELECT  pg_stat_activity.procpid,pg_terminate_backend(pg_stat_activity.procpid) FROM pg_stat_activity WHERE procpid <> pg_backend_pid();\"" 2>/dev/null | egrep -v 't' | cut -d'|' -f1)

# Iterate over the left over pids that we couldn't otherwise
# handle and attempt to kick them by sending each of them a SIGTERM.
for PID in $PIDS; do
   RETRY=3
   while [ $RETRY -gt 0 ]; do
      kill $PID &>/dev/null
      ps -p $PID &>/dev/null
      if [ $? -ne 0 ]; then break; fi
      let RETRY-=1
   done
   # All PostgreSQL Documentation will tell you SIGKILL
   # is the absolute 'WORST' thing you can do.
   # The below code should only be uncommented in
   # catastrophic situations where a restore is inevitable
   # ps -p $PID &>/dev/null
   # [ $? -eq 0 ] && kill -9 $PID
done
exit 0

They can also fail if PostgreSQL data directory contents has been corrupted. Corruption can occur:

  • From a system that is experiencing harddrive failures.
  • From another processes accessing /var/lib/pgsql/data/* that shouldn’t be. Only the database itself should be modifying/accessing the contents of this directory unless you know what you’re doing.
  • From just a simple power outage. A busy system that isn’t backed up by a UPS device or redundant power sources can have it’s file system corrupted easily during a loss of power. What usually happens is PostgreSQL is writing content to the database when the power is pulled from the machine. At the end of the day, you’re left with a database that wasn’t correctly written to and in some cases can be seriously damaged from this as a result.

In some cases, the corruption can be so bad the database instance (PostgreSQL) won’t even start. Evidently, you can’t restore a working copy of a database if the service itself isn’t working for you.

Sometimes attempting to repair the write-ahead logging can help you recover. Obviously this completely depends on the type of failure you’re experiencing. /var/lib/pgsql/data/pg_xlog/* can help shed some light on this subject as well. The below script assumes your instance of PostgreSQL is stopped.

# Make sure your database is NOT running if you intend to run the below
# command:
service postgresql stop

Here is the the simple wrapper to the read-ahead write log repair:

#!/bin/bash
# Author: Chris Caron <lead2gold at gmail.com>
# Name: pgnux_repair
# Description: Simple wrapper to pg_resetxlog
# Syntax: pgnux_repair
 
##################################
# Define some simple variables
##################################
PGENGINE=/usr/bin
PGDATA=/var/lib/pgsql/data

# For SELinux we need to use 'runuser' not 'su'
if [ -x /sbin/runuser ]; then
    SU=/sbin/runuser
else
    SU=/bin/su
fi

# Safety Check
if [ $(whoami) != 'root' ]; then
   echo "Error: You must be 'root' to use this script."
   exit 1
fi

# Attempt to repair the WAL logs
$SU -l postgres -c "$PGENGINE/pg_resetxlog -f $PGDATA"

If it still doesn’t work, as a last resort consider these simple commands to bring your system back up again. Note that you should be the root user prior to preforming these steps:

# Stop the database (if it isn't already):
service postgresql stop

# Backup your configuration files
mkdir -p /tmp/postgresql_conf
cp -a /var/lib/pgsql/data/*.conf /tmp/postgresql_conf

# Destroy the database data directory. Unless you changed the
# default directory, the below command will do it:
# This is IRREVERSIBLE, be sure you have a recovery point to
# restore to before doing this otherwise you will make your
# situation worse!
find /var/lib/pgsql/ -mindepth 1 -maxdepth 1 rm -rf {} ;

# Initialize a new empty database; this will rebuild
# the /var/lib/pgsql/data directory for you
service postgresql initdb

# Restore all of your configuration you backed up above:
cp -af /tmp/postgresql_conf/*.conf /var/lib/pgsql/data/

# Just in case, for SELinux Users, make sure we're set okay
restorecon -R /var/lib/pgsql

# Start your database up
service postgresql start

# Now restore your databases using pgnux_restore and make sure you
# include the double dash '--' so that you load the system globals
# back as well:
#   pgnux_restore --,db1,db2,db3,db4,...
#
# If you've always been backing them all together, then you can
# restore them with a single command:
pgnux_restore --

# Your done! You should be back and running now and hopefully with
# minimal downtime.

Disclaimer

I can not be held responsible or liable for anything that goes wrong with your system. This blog is intended to be used as a resource only and will hopefully help you out in the time of need. These tools work for me and have been tested thoroughly using PostgreSQL v8.4 (which ships with CentOS 6.x). I have not tried any of these tools for PostgreSQL v9.x or higher and can not promise you anything, but I honestly can’t see a reason why they wouldn’t work.

Credit

Please note that this information took me several days to put together and test thoroughly. I may not blog often; but I want to re-assure the stability and testing I put into everything I intend share.

If you like what you see and wish to copy and paste this HOWTO, please reference back to this blog post at the very least. It’s really all I ask.

Configuring and Installing VSFTPD on CentOS 6

Introduction

FTP Servers have been around for a very long time. On could easily argue that they aren’t always the best option to choose anymore. But for those dealing with legacy systems, the FTP protocol provides one the easiest ways to share and distribute files across external systems. Despite the hate some might give the protocol; it still receives a lot a love from others just because it’s still very compatible with just about everything; heck, even web browsers such as Internet Explorer, Chrome, and Firefox (and many more) have the protocol built right into it.

In my case, I needed to set up an FTP server to help a client with some legacy software they use (and are familiar with). This blog is more or less just the steps I took to make it work in case anyone else is interested.

Security Concerns

FTP preforms all of it’s transactions in ‘plain text’ including it’s authentication. This wasn’t a problem back in 1980 when online security wasn’t an issue. This also isn’t a problem for sites offering anonymous file hosting services. But for everyone else, it pretty much leaves you susceptible to privacy issues and possible intrusions.

FTPS (not to be confused with SFTP) is a way of securing the FTP protocol for the systems that require it allowing you to eliminate the ‘plain text’ problem. But this requires the client uses software that can take advantage of it.

Some additional security concerns I wanted to consider:

  • Separate user accounts from the system ones. We don’t want people trying to guess our root password or access anyone’s home directory unless with specifically configure the server to allow it.

Setup VSFTPD

VSFTPD stands for Very Secure FTP Daemon and provides all the flexibility we need. It’s official website can be found here.

The following will set up VSFTPD into your CentOS/RedHat environment in a isolated manor. You see we want to disconnect the users that currently access your system from the users we create for VSFTPD for more flexibility and control.

# First fetch the required packages
# db4:       The Berkeley Database (Berkeley DB) which is a
#             quick and dirty way of storing our user accounts.
# db4-utils: This provides a tool we'll use to build a small user
#             database with.  Technically you can uninstall (just)
#             this package afterwards for added security after.
# openssl:   This is used to grant our FTP server FTPS support
# vsftpd:    The FTP/FTPS Daemon itself
yum -y install db4 db4-utils openssl vsftpd

# Create a password file: /etc/vsftpd/users.passwd
# This file contains all of the users you want to allow on the
# site in the structure:
# Line: Entry
#  1  | USERNAME1
#  2  | PASSWORD1
#  3  | USERNAME2
#  4  | PASSWORD2
# This below creates a simple user 'foobar' and a password
# of 'barfoo'.  This isn't the safest way to build your password
# file because it leaves a persons password set available in
# the shell history... But for the purpose of this tutorial:
echo foobar > /etc/vsftpd/users.passwd
echo barfoo >> /etc/vsftpd/users.passwd

# Protect our password file now from prying eyes:
chmod 600 /etc/vsftpd/users.passwd
chown root.root /etc/vsftpd/users.passwd

# Prepare a directory we want the foobar user to send it's files
# to. You can also just use a directory you already have in place
# or someone elses home directory.
mkdir -p /var/ftp/foobar

# Set a comfortable permission to this directory granting the ftp
# user (or any user account your later going to assign to this user
# read/write access)
chown root.ftp /var/ftp/foobar
chmod 775 /var/ftp/foobar

# Convert our plain-text password file into the Berkeley Database
# format. This is the only command that requires the db4-utils
# rpm package which you can uninstall if you want for security
# reasons after you run the below command.  You'll need to
# re-install it though if you ever want to add or update accounts
db_load -T -t hash 
   -f /etc/vsftpd/users.passwd /etc/vsftpd/virtual.users.db

# Protect our new (Berkley) database:
chmod 600 /etc/vsftpd/virtual.users.db
chown root.root /etc/vsftpd/virtual.users.db

# Prepare our virtual user directory; this is where we can
# optionally place over-riding configuration for each user
# we create in the Berkley database above.
mkdir -p /etc/vsftpd/virtual.users
chmod 700 /etc/vsftpd/virtual.users
chown root.root /etc/vsftpd/virtual.users

# Create PAM Module that points to our new database.
# Note: you do not provide the '.db' extension when creating
#       this file. The file is valid as you see it below.
cat << _EOF > /etc/pam.d/vsftpd-virtual
auth     required pam_userdb.so db=/etc/vsftpd/virtual.users
account  required pam_userdb.so db=/etc/vsftpd/virtual.users
session  required pam_loginuid.so
_EOF

# Protect our Module
chmod 644 /etc/pam.d/vsftpd-virtual
chown root.root /etc/pam.d/vsftpd-virtual

# Create an empty jail directory.  This is used for default
# configurations only. A well configured system won't even use
# this; but it's still good to have since we'll be referencing
# it in our configuration. This will become the default
# directory a user connects to if they aren't otherwise
# configured to go to another location.
mkdir -p /var/empty/vsftpd/
chown nobody.ftp /var/empty/vsftpd/
chmod 555 /var/empty/vsftpd/

# Now we want to allow FTPS support, we'll need an SSL key to
# do it with.  If you already have one, you can skip this step.
# Otherwise, the following will just generate you a self-signed
# key as a temporary solution.
openssl req -nodes -new -x509 -days 730 -sha256 -newkey rsa:2048 
   -keyout /etc/pki/tls/private/nuxref.com.key 
   -out /etc/pki/tls/certs/nuxref.com.crt 
   -subj "/C=7K/ST=Westerlands/L=Lannisport/O=NuxRef/OU=IT/CN=nuxref.com"

# Protect our Keys
chmod 400 /etc/pki/tls/private/nuxref.com.key; # Private Key
chmod 444 /etc/pki/tls/certs/nuxref.com.crt; # Public Certificate

# Create ourselves a little banner we can use to at least alert
# human intruders that they are in fact being monitored.  This
# scare tactic may or may not work, but if you ever have a breach
# of security, you may need to reference that you gave the user
# ample warning that they were violating someones rights by
# continuing.  Feel free to adjust the banner to your liking.
cat << _EOF > /etc/banner
* - - - - - - W A R N I N G - - - - - - - W A R N I N G - - - - - *
*                                                                 *
* The use of this system is restricted to authorized users. All   *
* information and communications on this system are subject to    *
* review, monitoring and recording at any time, without notice or *
* permission.                                                     *
*                                                                 *
* Unauthorized access or use shall be subject to prosecution.     *
*                                                                 *
* - - - - - - W A R N I N G - - - - - - - W A R N I N G - - - - - *
_EOF

# Protect our banner
chmod 640 /etc/banner

At this point we have our environment set up the way we want it. The next step is to create our VSFTPD configuration.

# Lets first backup original configuration file
mv /etc/vsftpd/vsftpd.conf /etc/vsftpd/vsftpd.conf.orig

# Create new configuration
cat << _EOF > /etc/vsftpd/vsftpd.conf
# --------------------------------------------------------------
# Base Configuration
# --------------------------------------------------------------
anon_world_readable_only=NO
anonymous_enable=NO
chroot_local_user=YES
hide_ids=YES
listen=YES
local_enable=YES
max_clients=10
max_per_ip=3
nopriv_user=ftp
pasv_min_port=64000
pasv_max_port=64100
session_support=NO
user_config_dir=/etc/vsftpd/virtual.users
userlist_enable=YES
use_localtime=YES
xferlog_enable=YES
xferlog_std_format=NO
log_ftp_protocol=YES
pam_service_name=vsftpd-virtual
banner_file=/etc/banner
reverse_lookup_enable=NO
# --------------------------------------------------------------
# Secure Configuration (FTPS)
# --------------------------------------------------------------
ssl_enable=YES
virtual_use_local_privs=NO
allow_anon_ssl=NO
# forcing SSL makes the FTP portion of your site disabled and it
# will only operate as FTPS.  This may or may not be what you
# want.
force_local_data_ssl=NO
force_local_logins_ssl=NO
ssl_tlsv1=YES
ssl_sslv2=NO
ssl_sslv3=NO
# Point to our certificates
rsa_cert_file=/etc/pki/tls/certs/nuxref.com.crt
rsa_private_key_file=/etc/pki/tls/private/nuxref.com.key
require_ssl_reuse=NO
ssl_ciphers=HIGH:!MD5:!ADH
# --------------------------------------------------------------
# FTP Configuration
# --------------------------------------------------------------
async_abor_enable=YES
ftp_data_port=20
connect_from_port_20=YES
# --------------------------------------------------------------
# Default Anonymous Restrictions (over-ride per virtual user)
# --------------------------------------------------------------
guest_enable=NO
guest_username=nobody
# Default home directory once logged in
local_root=/var/empty/vsftpd
# write_enabled is required if the user is to make use of any of
# the anon_* commands below
write_enable=NO
# give the user the ability to make directories
anon_mkdir_write_enable=NO
# give the user the ability delete and overwrite files
anon_other_write_enable=NO
# give the user the ability upload new files
anon_upload_enable=NO
# Give the user permission to do a simple directory listings
dirlist_enable=NO
# Give the user permission to download files
download_enable=NO
# if the user has can upload or make new directories, then this
# will be the umask applied to them
anon_umask=0002
# delete failed uploads (speaks for itself)
delete_failed_uploads=YES
_EOF

# Protect our configuration
chmod 600 /etc/vsftpd/vsftpd.conf
chown root.root /etc/vsftpd/vsftpd.conf

Technically we’re done now, but because we intentionally specified very restrictive user access rights, our foobar user we created will only connect to the /var/empty/vsftpd directory with no access rights. Therefore, our final step is to create an additional configuration file for the foobar account granting him read/write access to /var/ftp/foobar.

# The file you write to in the /etc/vsftpd/virtual.users/ 'must' be the same
# name as the user(s) you created to over-ride their permissions!
cat << _EOF > /etc/vsftpd/virtual.users/foobar
local_root=/var/ftp/foobar
# --------------------------------------------------------------
# User
# --------------------------------------------------------------
guest_enable=YES
# Set this to any system user you want
guest_username=ftp
local_root=/var/ftp/foobar
# --------------------------------------------------------------
# Permissions
# --------------------------------------------------------------
# write_enabled is required if the user is to make use of any of
# the anon_* commands below
write_enable=YES
# give the user the ability to make directories
anon_mkdir_write_enable=YES
# give the user the ability delete and overwrite files
anon_other_write_enable=YES
# give the user the ability upload new files
anon_upload_enable=YES
# Give the user permission to do a simple directory listings
dirlist_enable=YES
# Give the user permission to download files
download_enable=YES
# if the user has can upload or make new directories, then this
# will be the umask applied to them
anon_umask=0002
# delete failed uploads (speaks for itself)
delete_failed_uploads=NO
_EOF

# Protect our foobar permission file
chmod 600 /etc/vsftpd/virtual.users/foobar
chown root.root /etc/vsftpd/virtual.users/foobar

You are now complete, you can start VSFTPD at any time:

# Have VSFTPD start after each system reboot
chkconfig --level 345 vsftpd on

# Start VSFTPD if it isn't already running
service vsftpd status || service vsftpd start

It’s worth noting that if you ever change any of the configuration or add more, you will need to restart the VSFTPD server in order for the changes you made to take effect:

# Restarting the vsftpd server is simple:
service vsftpd restart

Firewall Configuration

The FTP firewall configuration can get complicated especially when the ephemeral ports it chooses to open are random (when operating it in Passive mode). If you scan through the configuration file above, you’ll see that we’ve specified this range to be between 64000 and 64100.

In a nutshell, if you choose to use/enable FTPS (which I strongly recommend you do), the firewall configuration will look like this (/etc/sysconfig/iptables):

#....
#---------------------------------------------------------------
# FTP Traffic
#---------------------------------------------------------------
-A INPUT -p tcp -m state --state NEW --dport 21 -j ACCEPT
# non-encrypted ftp connections (ip_contrack_ftp) module looks
# after these ports, however for the encrypted sessions it can't
# spy so we need to disable ip_contrack_ftp and just open the
# port range himself
-A INPUT -p tcp -m state --state NEW --dport 64000:64100 -j ACCEPT
#...

However, if (and only if) you choose not to use FTPS and strictly operate using FTP only then your configuration will look as follows:

  1. /etc/sysconfig/iptables
    #....
    #---------------------------------------------------------------
    # FTP Traffic
    #---------------------------------------------------------------
    -A INPUT -p tcp -m state --state NEW --dport 21 -j ACCEPT
    #...
  2. /etc/sysconfig/iptables-config
    This file requires you to add ip_conntrack_ftp to the variable IPTABLES_MODULES which is near the top of this file. You may also need to update /etc/sysconfig/ip6tables-config if you are using ip6; the change is the same. This keeps the entire range of 64000 to 64100 ports closed be default and through packet sniffing, they are opened on demand.
#...
IPTABLES_MODULES="ip_conntrack_ftp"
#...

If you’re not running any other modules, you can use the following one liner to update the file:

sed -i -e 's/^IPTABLES_MODULES=.*/IPTABLES_MODULES="ip_conntrack_ftp"/g' 
    /etc/sysconfig/iptables-config
sed -i -e 's/^IPTABLES_MODULES=.*/IPTABLES_MODULES="ip_conntrack_ftp"/g' 
    /etc/sysconfig/ip6tables-config

Be sure to reload your firewall configuration once you have these in place:

# Restart the firewall
service iptables restart

Fail2Ban Brute Force Configuration

The final thing you should consider if your server will be available via the internet is some bruit force prevention. I really recommend you read my blog on Securing Your CentOS 6 System, specifically my blurb on Fail2Ban which I think all systems should always have running. Fail2ban allows you to track all users hitting your FTP server and take immediate action on preventing further access to this potential intruder.

The configuration is as follows (remember to set the email to what you want it as where I’ve specified your@email.goes.here so you can be notified of any intrusions that take place.

cat << _EOF >> /etc/fail2ban/jail.conf
[my-vsftpd-iptables]

enabled  = true
filter   = vsftpd
action   = iptables[name=VSFTPD, port=ftp, protocol=tcp]
           sendmail-whois[name=VSFTPD, dest=your@email.goes.here]
logpath  = /var/log/vsftpd.log
_EOF

Be sure to reload your Fail2Ban configuration once this is done

# Restart Fail2Ban
service fail2ban restart

Test Our Configuration

Make sure you have an ftp tool installed into your environment like lftp or even a GUI based tool like FileZilla that supports both FTP and FTPS. The old Linux tool ‘ftp’ will only allow you to test the un-encrypted connection.

# Install lftp to keep things simple
yum -y install lftp

# Test our server (FTP)
[root@nuxref ~]# lftp ftp://foobar:barfoo@localhost
lftp foobar@localhost:~> pwd
ftp://foobar:barfoo@localhost
lftp foobar@localhost:~> exit

# Test our server (FTPS)
[root@nuxref ~]# lftp ftps://foobar:barfoo@localhost
lftp foobar@localhost:~> pwd
ftps://foobar:barfoo@localhost
lftp foobar@localhost:~> exit

First lets just test the basic FTP portion (plain-text):

# Connect to our server
[root@nuxref ~]# ftp localhost
Connected to localhost (127.0.0.1).
220-* - - - - - - W A R N I N G - - - - - - - W A R N I N G - - - - - *
220-*                                                                 *
220-* The use of this system is restricted to authorized users. All   *
220-* information and communications on this system are subject to    *
220-* review, monitoring and recording at any time, without notice or *
220-* permission.                                                     *
220-*                                                                 *
220-* Unauthorized access or use shall be subject to prosecution.     *
220-*                                                                 *
220-* - - - - - - W A R N I N G - - - - - - - W A R N I N G - - - - - *
220 
Name (localhost:nuxref): foobar
331 Please specify the password.
Password:
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> ls
227 Entering Passive Mode (127,0,0,1,250,83).
150 Here comes the directory listing.
226 Directory send OK.
FileZilla FTPS Configuration
FileZilla FTPS Configuration

If you plan on using FileZilla as your solution, You need to configure it to connect as the FTP protocol with the Encryption set to Requires explicit FTP over TLS similar to the screen shot I provided.

You may or may not have to accept your certificate afterwards that we created earlier in this blog.

FileZilla On-Going FTPS Bug

The FileZilla Client is a pretty sweet application for those who like to work with a GUI instead of a command line. Those who choose to test their configuration with this should just know that there is an outstanding bug with FileZilla and the FTPS protocol. Hence, if you’re using Filezilla to to test your new VSFTPD server and it’s not working, it might not be your configuration at the end of the day. The versions seem to be hit and miss of which cause the bug to surface; reports of v3.5.2 working and v3.5.3 not. That all said, I’m using v3.7.3 and am not having a problem.

Here is the ticket #7873 that identifies the problem. One thing that is mentioned is that an earlier version of FileZilla works perfectly fine (specifically v3.5.2). But I’ve also had no problem with the current version (at the time was v3.7.3). I guess my main point is… don’t panic if you see this error; it’s not necessarily anything you’ve configured incorrectly. If you followed this blog then you shouldn’t have any issue at all.

Credit

I may not blog often; but I want to re-assure the stability and testing I put into everything I intend share.

If you like what you see and wish to copy and paste this HOWTO, please reference back to this blog post at the very least; it’s really all that I ask of you.

Sources:

Configuring and Installing NRPE and NSCA into Nagios Core 4 on CentOS 6

Introduction

About a month ago I wrote (and updated) an article on how to install Nagios Core 4 onto your system. I’m a bit of a perfectionist, so I’ve rebuilt the packages a little to accommodate my needs. Now I thought it might be a good idea to introduce some of the powerful extensions you can get for Nagios.

For an updated solution, you may wish to check out the following:

  • NRDP for Nagios Core on CentOS 7.x: This blog explains how awesome NRDP really is and why it might become a vital asset to your own environment. This tool can be used to replace NSCA’s functionality. The blog also provides the first set of working RPMs (with SELinux support of course) of it’s kind to support it.
  • NRPE for Nagios Core on CentOS 7.x: This blog explains how to set up NRPE (v3.x) for your Nagios environment. At the time this blog was written, there was no packaging of it’s kind for this version.

RPM Solution

RPMs provide a version control and an automated set of scripts to configure the system how I want it. The beauty of them is that if you disagree with something the tool you’re packaging does, you can feed RPMs patch files to accommodate it without obstructing the original authors intention.

Now I won’t lie and claim I wrote these SPEC files from scratch because I certainly didn’t. I took the stock ones that ship with these products (NRPE and NSCA) and modified them to accommodate and satisfy my compulsive needs. 🙂

My needs required a bit more automation in the setup as well as including:

  • A previous Nagios requirement I had was a /etc/nagios/conf.d directory to behave similar to how Apache works. I wanted to be able to drop configuration files into here and just have it work without re-adjusting configuration files. In retrospect of this, these plugins are a perfect example of what can use this folder and work right out of the box.
  • These new Nagios plugins should adapt to the new nagiocmd permissions. The nagioscmd group permission was a Nagios requirement I had made in my previous blog specifically for the plugin access.
  • NSCA should prepare some default configuration to make it easier on an administrator.
  • NSCA servers that don’t respond within a certain time should advance to a critical state. This should be part of the default (optional) configuration one can use.
  • Both NRPE and NSCA should plug themselves into Nagios silently without human intervention being required.
  • Both NRPE and NSCA should log independently to their own controlled log file that is automatically rotated by the system when required.

Nagios Enhancement Focus

The key things I want to share with you guys that you may or may not find useful for your own environment are the following:

  • Nagios Remote Plugin Executor (NRPE): NRPE (officially accessed here) provides a way to execute all of the Nagios monitoring tools on a remote server. These actions are all preformed through a secure (private) connection to the remote server and then reported back to Nagios. NRPE can allow you to monitor servers that are spread over a WAN (even the internet) from one central monitoring server. This is truly the most fantastic extension of Nagios in my opinion.
    NRPE High Level Overview
    NRPE High Level Overview
  • Nagios Service Check Acceptor (NSCA): NSCA (officially accessed here) provides a way for external applications to report their status directly to the Nagios Server on their own. This solution still allows the remote monitoring of a system by taking the responsibility off of the status checks off of Nagios. However the fantastic features of Nagios are still applicable: You are still centrally monitoring your application and Nagios will immediately take action in notifying you if your application stops responding or reports a bad status. This solution is really useful when working with closed systems (where opening ports to other systems is not an option).
    NSCA High Level Overview
    NSCA High Level Overview

Just give me your packaged RPMS

Here they are:

How do I make these packages work for me?

In all cases, the RPMs take care of just about everything for you, so there isn’t really much to do at this point. Some considerations however are as follows:

  • NRPE
    NRPE - Nagios Remote Plugin Executor
    NRPE – Nagios Remote Plugin Executor

    In an NRPE setup, Nagios is always the client and all of the magic happens when it uses the check_nrpe plugin. Most of NRPE’s configuration resides at the remote server that Nagios will monitor. In a nutshell, NRPE will provide the gateway to check a remote system’s status but in a much more secure and restrictive manor than the check_ssh which already comes with the nagios-plugins package. The check_ssh requires you to create a remote user account it can connect with for remote checks. This can leave your system vulnerable to an attack since you can do a lot more damage with a compromised SSH account. However check_nrpe uses the NRPE protocol and can only return what you let it; therefore making it a MUCH safer choice then check_ssh!

    You’ll want to install nagios-plugins-nrpe on the same server your hosting Nagios on:

    # Download NRPE
    wget --output-document=nagios-plugins-nrpe-2.15-1.el6.x86_64.rpm http://repo.nuxref.com/centos/6/en/x86_64/custom/nagios-plugins-nrpe-2.15-4.el6.nuxref.x86_64.rpm
    
    # Now install it
    yum -y localinstall nagios-plugins-nrpe-2.15-1.el6.x86_64.rpm
    

    Again I must stress, the above setup will work right away presuming you chose to use my custom build of Nagios introduced in my blog that went with it.

    Just to show you how everything works, we’ll make the Nagios Server the NRPE Server as well. In real world scenario, this would not be the case at all! But feel free to treat the setup example below on a remote system as well because it’s configuration will be identical! 🙂

    # Install our NRPE Server
    wget --output-document=nrpe-2.15-1.el6.x86_64.rpm http://repo.nuxref.com/centos/6/en/x86_64/custom/nrpe-2.15-4.el6.nuxref.x86_64.rpm
    
    # Install some Nagios Plugins we can configure NRPE to use
    wget --output-document=nagios-plugins-1.5-1.x86_64.rpm http://repo.nuxref.com/centos/6/en/x86_64/custom/nagios-plugins-1.5-5.el6.nuxref.x86_64.rpm
    
    # Now Install it
    yum -y localinstall nrpe-2.15-1.el6.x86_64.rpm 
       nagios-plugins-1.5-1.x86_64.rpm
    # This tool requires xinetd to be running; start it if it isn't
    # already running
    service xinetd status || service xinetd start
    
    # Make sure our system will always start xinetd
    # even if it's rebooted
    chkconfig --level 345 xinetd on
    

    Now we can test our server by creating a test configuration:

    # Create a NRPE Configuration our server can accept
    cat << _EOF > /etc/nrpe.d/check_mail.cfg
    command[check_mailq]=/usr/lib64/nagios/plugins/check_mailq -c 100 -w 50
    _EOF
    
    # Create a temporary test configuration to work with:
    cat << _EOF > /etc/nagios/conf.d/nrpe_test.cfg
    define service{
       use                 local-service
       service_description Check Users
       host_name           localhost
       # check_users is already defined for us in /etc/nagios/nrpe.cfg
    	check_command		  check_nrpe!check_users
    }
    
    # Test our new custom one we just created above
    define service{
       use                 local-service
       service_description Check Mail Queue
       host_name           localhost
       # Use the new check_mailq we defined above in /etc/nrpe.d/check_mail.cfg
    	check_command		  check_nrpe!check_mailq
    }
    _EOF
    
    # Reload Nagios so it sees our new configuration defined in
    # /etc/nagios/conf.d/*
    service nagios reload
    
    # Reload xinetd so nrpe sees our new configuration defined in
    # /etc/nrpe.d/*
    service xinetd reload
    

    We can even test our connection manually by calling the command:

    # This is what the output will look like if everything is okay:
    /usr/lib64/nagios/plugins/check_nrpe -H localhost -c check_mailq
    OK: mailq is empty|unsent=0;50;100;0
    

    Another scenario you might see (when setting on up on your remote server) is:

    /usr/lib64/nagios/plugins/check_nrpe -H localhost -c check_mailq
    CHECK_NRPE: Error - Could not complete SSL handshake.
    

    Uh oh, Could not complete SSL handshake.! What does that mean?
    This is the most common error people see with the NRPE plugin. If you Google it, you’ll get an over-whelming amount of hits suggesting how you can resolve the problem. I found this link useful.
    That all said, I can probably tell you right off the bat why it isn’t working for you. Assuming you’re using the packaging I provided then it’s most likely because your NRPE Server is denying the requests your Nagios Server is making to it.

    To fix this, access your NRPE Server and open up /etc/xinetd/nrpe in an editor of your choice. You need to allow your Nagios Server access by adding it’s IP address to the only_from entry. Or you can just type the following:

    # Set your Nagios Server IP here:
    NAGIOS_SERVER=192.168.192.168
    
    # If you want to keep your previous entries and append the server
    # you can do the following (spaces delimit the servers):
    sed -i -e "s|^(.*only_from[^=]+=)[ t]*(.*)|1 2 $NAGIOS_SERVER|g" 
       /etc/xinetd.d/nrpe
    
    # The below command is fine too to just replace what is there
    # with the server of your choice (you can use either example
    sed -i -e "s|^(.*only_from[^=]+=).*|1 $NAGIOS_SERVER|g" 
       /etc/xinetd.d/nrpe
    
    # When your done, restart xinetd to update it's configuration
    service xinetd reload
    

    Those who didn’t receive the error I showed above, it’s only because your using your Nagios Server as your NRPE Server too (which the xinetd tool is pre-configured to accept by default). So please pay attention to this when you start installing the NRPE server remotely.

    You will want to install nagios-plugins-nrpe on to your NRPE Server as well granting you access to all the same great monitoring tools that have already been proven to work and integrate perfectly with Nagios. This will save you a great deal of effort when setting up the NRPE status checks.

    As a final note, you may want to make sure port 5666 is open on your NRPE Server’s firewall otherwise the Nagios Server will not be able to preform remote checks.

    ## Open NRPE Port (as root)
    iptables -A INPUT -m state --state NEW -m tcp -p tcp --dport 5666 -j ACCEPT
    
    # consider adding this change to your iptables configuration
    # as well so when you reboot your system the port is
    # automatically open for you. See: /etc/sysconfig/iptables
    # You'll need to add a similar line as above (without the
    # iptables reference)
    # -A INPUT -m state --state NEW -m tcp -p tcp --dport 5666 -j ACCEPT
    
  • NSCA
    NSCA - Nagios Service Check Acceptor
    NSCA – Nagios Service Check Acceptor

    Remember, NSCA is used for systems that connect to you remotely (instead of you connecting to them (what NRPE does). This is a perfect choice plugin for systems you do not want to open ports up to unnecessarily on your remote system. That said, it means you need to open up ports on your Monitoring (Nagios) server instead.

    You’ll want to install nsca on the same server your hosting Nagios on:

    # Download NSCA
    wget --output-document=nsca-2.7.2-9.el6.x86_64.rpm http://repo.nuxref.com/centos/6/en/x86_64/custom/nsca-2.7.2-10.el6.nuxref.x86_64.rpm
    
    # Now install it
    yum -y localinstall nsca-2.7.2-9.el6.x86_64.rpm
    
    # This tool requires xinetd to be running; start it if it isn't
    # already running
    service xinetd status || service xinetd start
    
    # Make sure our system will always start xinetd
    # even if it's rebooted
    chkconfig --level 345 xinetd on
    
    # SELinux Users may wish to turn this flag on if they intend to allow it
    # to call content as root (using sudo) which it must do for some status checks.
    setsebool -P nagios_run_sudo on
    

    The best way to test if everything is working okay is by also installing the nsca-client on the same machine we just installed NSCA on (above). Then we can simply create a test passive service to test everything with. The below setup will work presuming you chose to use my custom build of Nagios introduced in my blog that went with it.

    # First install our NSCA client on the same machine we just installed NSCA
    # on above.
    wget http://repo.nuxref.com/centos/6/en/x86_64/custom/nsca-client-2.7.2-10.el6.nuxref.x86_64.rpm
    
    # Now install it
    yum -y localinstall nsca-client-2.7.2-9.el6.x86_64.rpm
    
    # Create a temporary test configuration to work with:
    cat << _EOF > /etc/nagios/conf.d/nsca_test.cfg
    # Define a test service. Note that the service 'passive_service'
    # is already predefined in /etc/nagios/conf.d/nsca.cfg which was
    # placed when you installed my nsca rpm
    define service{
       use                 passive_service
       service_description TestMessage
       host_name           localhost
    }
    _EOF
    
    # Now reload Nagios to it reads in our new configuration
    # Note: This will only work if you are using my Nagios build
    service nagios reload
    

    Now that we have a test service set up, we can send it different nagios status through the send_nsca binary that was made available to us after installing nsca-client.

    # Send a Critical notice to Nagios using our test service
    # and send_nsca. By default send_nsca uses the '<tab>' as a
    # delimiter, but that is hard to show in a blog (it can get mixed up
    # with the space.  So in the examples below i add a -d switch
    # to adjust what the delimiter in the message.
    # The syntax is simple:
    #    hostname,nagios_service,status_code,status_msg
    #
    # The test service we defined above identifies both the
    # 'host_name' and 'service_description' define our first 2
    # delimited columns below. The status_code is as simple as:
    #       0 : Okay
    #       1 : Warning
    #       2 : Critical
    # The final delimited entry is just the human readable text
    # we want to pass a long with the status.
    #
    # Here we'll send our critical message:
    cat << _EOF | /usr/sbin/send_nsca -H 127.0.0.1 -d ','
    localhost,TestMessage,2,This is a Test Error
    _EOF
    
    # Open your Nagios screen (http://localhost/nagios) at this point and watch the
    # status change (it can take up to 4 or 5 seconds or so to register
    # the command above).
    
    # Cool?  Here is a warning message:
    cat << _EOF | /usr/sbin/send_nsca -H 127.0.0.1 -d ',' -c /etc/nagios/send_nsca.cfg
    localhost,TestMessage,1,This is a Test Warning
    _EOF
    
    # Check your warning on Nagios, when your happy, here is your
    # OKAY message:
    cat << _EOF | /usr/sbin/send_nsca -H 127.0.0.1 -d ',' -c /etc/nagios/send_nsca.cfg
    localhost,TestMessage,0,Life is good!
    _EOF
    

    Since NSCA requires you to listen to a public port, you’ll need to know this last bit of information to complete your NSCA configuration. Up until now the package i provide only open full access to localhost for security reasons. But you’ll need to take the next step and allow your remote systems to talk to you.

    NSCA uses port 5667, so you’ll want to make sure your firewall has this port open using the following command:

    ## Open NSCA Port (as root)
    iptables -A INPUT -m state --state NEW -m tcp -p tcp --dport 5667 -j ACCEPT
    
    # consider adding this change to your iptables configuration
    # as well so when you reboot your system the port is
    # automatically open for you. See: /etc/sysconfig/iptables
    # You'll need to add a similar line as above (without the
    # iptables reference)
    # -A INPUT -m state --state NEW -m tcp -p tcp --dport 5667 -j ACCEPT
    

    Another security in place with the NSCA configuration you installed out of
    the box is that it is being managed by xinetd. The configuration can
    be found here: /etc/xinetd.d/nsca. The security restriction in place that you’ll want to pay close attention to is line 16 which reads:

    only_from = 127.0.0.1 ::1

    If you remove this line, you’ll allow any system to connect to yours; this is a bit unsafe but an option. Personally, I recommend that you individually add each remote system you want to monitor to this line. Use a space to separate more the one system.

    You can consider adding more security by setting up a NSCA paraphrase which will reside in /etc/nagios/nsca.cfg to which you can place the same paraphrase in all of the nsca-clients you set up by updating /etc/nagios/send_nsca.cfg.

    Consider our example above; I can do the following to add a paraphrase:

    # Configure Client
    sed -i -e 's/^#*password=/password=ABCDEFGHIJKLMNOPQRSTUVWXYZ/g' 
       /etc/nagios/send_nsca.cfg
    # Configure Server
    sed -i -e 's/^#*password=/password=ABCDEFGHIJKLMNOPQRSTUVWXYZ/g' 
       /etc/nagios/nsca.cfg
    # Reload xinetd so it rereads /etc/nagios/nsca.cfg
    service xinetd reload
    

I don’t trust you, I want to repackage this myself!

As always, I will always provide you a way to build the source code from scratch if you don’t want to use what I’ve already prepared. I use mock for everything I build so I don’t need to haul in development packages into my native environment. You’ll need to make sure mock is setup and configured properly first for yourself:

# Install 'mock' into your environment if you don't have it already.
# This step will require you to be the superuser (root) in your native
# environment.
yum install -y mock

# Grant your normal every day user account access to the mock group
# This step will also require you to be the root user.
usermod -a -G mock YourNonRootUsername

At this point it’s safe to change from the ‘root‘ user back to the user account you granted the mock group privileges to in the step above. We won’t need the root user again until the end of this tutorial when we install our built RPM.

Just to give you a quick summary of what I did, here are the new spec files and patch files I created:

  • NSCA RPM SPEC File: Here is the enhanced spec file I used (enhancing the one already provided in the EPEL release found on pkgs.org). At the time I wrote this blog, the newest version of NSCA was v2.7.2-8. This is why I repackaged it as v2.7.2-9 to include my enhancements. I created 2 patches along with the spec file enhancements.
    nrpe.conf.d.patch was created to provide a working NRPE configuration right out of the box (as soon as it was installed) and nrpe.xinetd.logrotate.patch was created to pre-configure a working xinetd server configuration.
  • NRPE RPM SPEC File: Here is the enhanced spec file I used (enhancing the one already provided in the EPEL release found on pkgs.org). At the time I wrote this blog, the newest version of NRPE was v2.14-5. However v2.15 was available off of the Nagios website so this is why I repackaged it as v2.15-1 to include my enhancements.
    nsca.xinetd.logrotate.patch was the only patch I needed to create to prepare a NSCA xinetd server working out of the box.

Everything else packaged (patches and all) are the same ones carried forward from previous versions by their package managers.

Rebuild your external monitoring solutions:

Below shows the long way of rebuilding the RPMs from source.

# Perhaps make a directory and work within it so it's easy to find
# everything later
mkdir nagiosbuild
cd nagiosbuild
###
# Now we want to download all the requirements we need to build
###
# Prepare our mock environment
###
# Initialize Mock Environment
mock -v -r epel-6-x86_64 --init

# NRPE (v2.15)
wget http://repo.nuxref.com/centos/6/en/source/custom/nrpe-2.15-4.el6.nuxref.src.rpm 
mock -v -r epel-6-x86_64 --copyin nrpe-2.15-1.el6.src.rpm /builddir/build

# NSCA (v2.7.2)
wget http://repo.nuxref.com/centos/6/en/source/custom/nsca-2.7.2-10.el6.nuxref.src.rpm 
mock -v -r epel-6-x86_64 --copyin nsca-2.7.2-9.el6.src.rpm /builddir/build

#######################
### THE SHORT WAY #####
#######################
# Now, the short way to rebuild everything is through these commands:
mock -v -r epel-6-x86_64 --resultdir=$(pwd)/results 
   --rebuild  nrpe-2.15-1.el6.src.rpm  nsca-2.7.2-9.el6.src.rpm

# You're done; You can find all of your rpms in a results directory
# in the same location you typed the above command in.  You can 
# alternatively rebuild everything the long way allowing you to
# inspect the content in more detail and even change it for your
# own liking

#######################
### THE LONG WAY  #####
#######################
# Install NRPE Dependencies
mock -v -r epel-6-x86_64 --install 
   autoconf automake libtool openssl-devel tcp_wrappers-devel

# Install NSCA Dependencies
mock -v -r epel-6-x86_64 --install 
   tcp_wrappers-devel libmcrypt-devel

###
# Build Stage
###
# Shell into our enviroment
mock -v -r epel-6-x86_64 --shell

# Change to our build directory
cd builddir/build

# Install our SRPMS (within our mock jail)
rpm -Uhi nsca-*.src.rpm nrpe-*.src.rpm

# Now we'll have placed all our content in the SPECS and SOURCES
# directory (within /builddir/build).  Have a look to verify
# content if you like

# Build our RPMS
rpmbuild -ba SPECS/*.spec

# we're now done with our mock environment for now; Press Ctrl-D to
# exit or simply type exit on the command line of our virtual
# environment
exit

###
# Save our content that we built in the mock environment
###

#NRPE
mock -v -r epel-6-x86_64 --copyout /builddir/build/SRPMS/nrpe-2.15-1.el6.src.rpm .
mock -v -r epel-6-x86_64 --copyout /builddir/build/RPMS/nrpe-2.15-1.el6.x86_64.rpm .
mock -v -r epel-6-x86_64 --copyout /builddir/build/RPMS/nagios-plugins-nrpe-2.15-1.el6.x86_64.rpm .
mock -v -r epel-6-x86_64 --copyout /builddir/build/RPMS/nrpe-debuginfo-2.15-1.el6.x86_64.rpm .

#NSCA
mock -v -r epel-6-x86_64 --copyout /builddir/build/SRPMS/nsca-2.7.2-9.el6.src.rpm .
mock -v -r epel-6-x86_64 --copyout /builddir/build/RPMS/nsca-2.7.2-9.el6.x86_64.rpm .
mock -v -r epel-6-x86_64 --copyout /builddir/build/RPMS/nsca-client-2.7.2-9.el6.x86_64.rpm .
mock -v -r epel-6-x86_64 --copyout /builddir/build/RPMS/nsca-debuginfo-2.7.2-9.el6.x86_64.rpm .

# *Note that all the commands that interact with mock I pass in 
# the -v which outputs a lot of verbose information. You don't
# have to supply it; but I like to see what is going on at times.

# **Note: You may receive this warning when calling the '--copyout'
# above:
# WARNING: unable to delete selinux filesystems 
#    (/tmp/mock-selinux-plugin.??????): #
#    [Errno 1] Operation not permitted: '/tmp/mock-selinux-plugin.??????'
#
# This is totally okay; and is safe to ignore, the action you called
# still worked perfectly; so don't panic!

So where do I go from here?
NRPE and NSCA are both fantastic solutions that can allow you to tackle any monitoring problem you ever had. In this blog here I focus specifically on Linux, but these tools are also available on Microsoft Windows as well. You can easily have 1 Nagios Server manage thousands of remote systems (of all operating system flavours). There are hundreds of fantastic tools to monitor all mainstream applications used today (Databases, Web Servers, etc). Even if your trying to support a custom application you wrote. If you can interface with your application using the command line interface, well then Nagios can monitor it for you. You only need to write a small script with this in mind:

  • Your script should always have an exit code of 0 (zero) if everything is okay, 1 (one) if you want to raise a warning, and 2 (two) if you want to raise a critical alarm.
  • No matter what the exit code is, you should also echo some kind of message that someone could easily interpret what is going on.

There is enough information in this blog to do the rest for you (as far as creating a Nagios configuration entry for it goes). If you followed the 2 rules above, then everything should ‘just work’. It’s truely that easy and powerful.

How do I decide if I need NSCA or NRPE?

NRPE & NSCA High Level Overview
NRPE & NSCA High Level Overview

NRPE makes it Nagios’s responsibility to check your application where as NSCA makes it your applications responsible to report its status. Both have their pros and cons. NSCA could be considered the most secure approach because at the end of the day the only port that requires opening is the one on the Nagios server. NSCA does not use a completely secure connection (but there is encryption none the less). NRPE is very secure and doesn’t require you to really do much since it just simply works with the nagios-plugins already available. It litterally just extends these existing Nagios local checks to remote ones. NSCA requires you to configure a cron, or adjust your applications in such a way that it frequently calls the send_nsca command. NSCA can be a bit more difficult to set up but creates some what of a heartbeat between you and the system monitoring it (which can be a good thing too). I pre-configured the NSCA server with a small tweak that will automatically set your application to a critical state if a send_nsca call is missed for an extended period of time.

Always consider that the point of this blog was to state that you can use both at the same time giving you total flexibility over all of your systems that require monitoring.

Credit

All of the custom packaging in this blog was done by me personally. I took the open source available to me and rebuilt it to make it an easier solution and decided to share it. If you like what you see and wish to copy and paste this HOWTO, please reference back to this blog post at the very least. It’s really all I ask.

Sources

I referenced the following resources to make this blog possible:

  • The blog I wrote earlier that is recommended you read before this one:Configuring and Installing Nagios Core 4 on CentOS 6
  • Official NRPE download link; I used all of the official documentation to make the NRPE references on this blog possible.
  • A document identifying the common errors you might see and their resolution here.
  • Official NSCA download link; I used all of the official documentation to make the NSCA references on this blog possible.
  • The NRPE and NSCA images I’m reposting on this blog were taking straight from their official sites mentioned above.
  • Linux Packages Search (pkgs.org) was where I obtained the source RPMs as well as their old SPEC files. These would be a starting point before I’d expand them.
  • A bit outdated, but a great (and simple) representation of how NSCA works with Nagios can be seen here.