Skip to main content

Using Ansible to Fix CFEngine (after a trust failure as a result of re-addressing policy hub)

One of the things I have been given (cursed with) in my life in IT is maintenance of CFEngine. CFEngine is one of the oldest, typically left on the wayside, systems for configuration management on Linux and other Unix (and I think also Windows these days).... I'd love to drastically refactor and clean it up, because its grown pretty organically.

Anyway, its on some old equipment and I need to migrate it to a new IP to keep management happy from a risk perspective. How bad could that get? Turns out that CFEngine is sensitive to this, and I get another deluge of email:

!! Not authorized to trust the's public key (trustkey=false) 
!! Authentication dialogue with failed 
... ad nuseum

So.... I think I just broke all the CFEngine agents, which won't be able to grab policy updates. Let's fix this using Ansible.

I had previously used CFEngine to put some static entries in /etc/hosts on managed servers to cut down on DNS dependency, so in addition to DNS changes (I was keeping the same DNS name of the policy hub), I needed to remove old entries from /etc/hosts. I did this using CFEngine (and then repeated that using Ansible, just for good measure). CFEngine policies would later come in to put in the new IP.

CFEngine do document how to restore trust after an IP change, but we're on a slightly older version due to some other policies that need updating.

The Ansible was nice to develop, but I did end up having to do a few more things than the CFEngine document said, in order to get it running reliably; particularly around the cf_lastseen.tcdb database, which seems to be very important in this regard. Some actions, such as the sysconfig change, are not particular to this change, but just taking an opportunity to clean up a little.

I'll let the playbook (which is standalone) speak for itself. Roll such things out gradually (a single host at a time at first; broken machines in a dev environment can be precious when developing a playbook to reliably fix it in test/production environments. Enjoy!

# This is for CFEngine version "CFEngine Core 3.3.8"
- hosts: all:!
  become: true
  gather_facts: false

    # in this scenario, I'm moving from to

    # Look in /var/cfengine/ppkeys/ for what this should be.
    new_pubkey: |
      -----BEGIN RSA PUBLIC KEY-----
      -----END RSA PUBLIC KEY-----

    - name: stop cfengine, in case its running
      service: name=cfengine3 state=stopped

    - name: store the public-key associated with the new IP
        content: "{{ new_pubkey }}"
        dest: /var/cfengine/ppkeys/root-{{ new_ip }}.pub

      # Note you'll get the following in the cf-agent -IK -b update output
      #  -> Renaming old key from /var/cfengine/ppkeys/ \
      #     to /var/cfengine/ppkeys/

    - name: update the file that sets the policy_server IP
        content: "{{ new_ip }}"
        dest: /var/cfengine/policy_server.dat

    - name: clear out any entries in /etc/hosts that mention my-policy-host
        path: /etc/hosts
        regexp: my-policy-host
        state: absent

    - name: invalidate nscd cache, if used
      command: /usr/sbin/nscd --invalidate hosts
      ignore_errors: true

    - name: remove old last_seen cache data
      # Without this, the test failure rate is annoyingly low.
      # This is removing the cached last-seen data for the old and new IPs
      # the key is specified in hex, because CFEngine seems to have a bug where
      # it includes the terminating NUL in the damn key.
      # The hope is that this will prompt CFEngine to revalidate the public key
      # To determine the key (as in key-value), use
      #   tchmgr list /var/cfengine/cf_lastseen.tcdb
      # and then compare with
      #   tchmgr list -px /var/cfengine/cf_lastseen.tcdb
      # to get the hex equivalent of the key
      # We use the -sx option to input the key in hex, which allows us
      # to input the terminating NUL.
      # Probably I could have just deleted the tcdb file .... but that
      # would invest in most unknowns.
      command: /var/cfengine/bin/tchmgr out -sx /var/cfengine/cf_lastseen.tcdb '{{ item }}'
        - '61 31 30 2E 01 2E 02 2E 02 00' #entry for old IP
        - '61 31 30 2E 01 2E 02 2E 03 00' #entry for new IP
      ignore_errors: true

    - name: give cfengine a test run
      command: /var/cfengine/bin/cf-agent -IK -b update
      register: cfagent
      ignore_errors: true

    - name: see if the problem is gone
      fail: msg="Trust still broken"
      when: "'Not authorized to trust the server' in cfagent.stdout"

    - name: set correct sysconfig for cfengine clients
        dest: /etc/sysconfig/cfengine3
        content: |

    - name: start cfengine again
      # restarted is more robust -- doesn't depend on status returning
      # LSB-correct return code
      service: name=cfengine3 state=restarted


Popular posts from this blog

Use IPTables NOTRACK to implement stateless rules and reduce packet loss.

I recently struck a performance problem with a high-volume Linux DNS server and found a very satisfying way to overcome it. This post is not about DNS specifically, but useful also to services with a high rate of connections/sessions (UDP or TCP), but it is especially useful for UDP-based traffic, as the stateful firewall doesn't really buy you much with UDP. It is also applicable to services such as HTTP/HTTPS or anything where you have a lot of connections...

We observed times when DNS would not respond, but retrying very soon after would generally work. For TCP, you may find that you get a a connection timeout (or possibly a connection reset? I haven't checked that recently).

Observing logs, you might the following in kernel logs:
kernel: nf_conntrack: table full, dropping packet. You might be inclined to increase net.netfilter.nf_conntrack_max and net.nf_conntrack_max, but a better response might be found by looking at what is actually taking up those entries in your conne…

ORA-12170: TNS:Connect timeout — resolved

If you're dealing with Oracle clients, you may be familiar with the error message
ERROR ORA-12170: TNS:Connect timed out occurred I was recently asked to investigate such a problem where an application server was having trouble talking to a database server. This issue was blocking progress on a number of projects in our development environment, and our developers' agile post-it note progress note board had a red post-it saying 'Waiting for Cameron', so I thought I should promote it to the front of my rather long list of things I needed to do... it probably also helped that the problem domain was rather interesting to me, and so it ended being a late-night productivity session where I wasn't interrupted and my experimentation wouldn't disrupt others. I think my colleagues are still getting used to seeing email from me at the wee hours of the morning.

This can masquerade as a number of other error strings as well. Here's what you might see in the sqlnet.log f…

Getting MySQL server to run with SSL

I needed to get an old version of MySQL server running with SSL. Thankfully, that support has been there for a long time, although on my previous try I found it rather frustrating and gave it over for some other job that needed doing.

If securing client connections to a database server is a non-negotiable requirement, I would suggest that MySQL is perhaps a poor-fit and other options, such as PostgreSQL -- according to common web-consensus and my interactions with developers would suggest -- should be first considered. While MySQL can do SSL connections, it does so in a rather poor way that leaves much to be desired.

UPDATED 2014-04-28 for MySQL 5.0 (on ancient Debian Etch).

Here is the fast guide to getting SSL on MySQL server. I'm doing this on a Debian 7 ("Wheezy") server. To complete things, I'll test connectivity from a 5.1 client as well as a reasonably up-to-date MySQL Workbench 5.2 CE, plus a Python 2.6 client; just to see what sort of pain awaits.

UPDATE: 2014-0…