Skip to main content

Please don't use dig etc. in reporting scripts... use 'getent hosts' instead (gawk example)

Okay, so the excrement of the day is flying from the fan and you need to get some quick analytics of who/what is being the most-wanted of the day. Perhaps you don't currently the analytics you need at your fingertips, so you hack up a huge 'one-liner' that perhaps takes a live sample of traffic (tcpdump -p -nn ...), extracts some data using sed, collates the data using awk etc. and then does some sort of reporting step. Yay, that's what we call agile analytics (see 'hyperbole'); its the all-to-common fallback goto, but it does prove to be immensely useful.

Okay, so you've got some report, perhaps it lacks a bit of polish, but it contains IP addresses, and we'd like to see something more recognisable (mind you, some IP addresses can become pretty recognisable). So, you scratch your head a bit and have the usual internal debate "do I do this bash, or fall up to awk, perl/python". At this point (if you go with bash etc. or awk), you'll perhaps think of using `dig +short -x $IP` to get (you hope) the canonical DNS name associated with it.

Oh, oh. A common point of trouble at this part is that you end up calling `dig` many times... often for the same name in a short period of time. Perhaps you'll request things too fast and lookups might fail (rate controls and limits on DNS servers etc.) As someone who looks after DNS servers, I urge you to stop. There is a much better way, and that way is the cache-friendly `getent hosts`.

When you call `dig`, `host`, or `nslookup`, you are going directly to DNS. Any local resolver is bypassed. What a regular program does when it calls `gethostbyname(...)` or `getnameinfo(...)` and related functions, is to look for the presence of a local cache (eg. it will look explicitly for the presence of a UNIX-domain socket /var/run/nscd/socket) and it will consult files like /etc/resolv.conf. Both of these get used to potentially cache DNS queries.

This is why sometimes it can be useful to use 'ping' to lookup DNS resolution... but if your doing that, you perhaps don't know about 'getent'.

The `getent` command is part of your standard Linux environment; it comes with glibc. It's purpose is as a lookup/testing tool for things like /etc/passwd, /etc/groups, /etc/services, and also name resolution and many others --- see the manual page for getent(1). As `getent` is simply a command that exposes API functionality from glibc, you get that potentially huge benefit from caching.

To do a forward lookup, you can use the 'hosts' database. Forward lookups are not as pleasant compared to say `dig +short -t a ...` if you expect an IPv4 address (I'll leave you to play with forward lookups yourself). Reverse lookups are very simple.

Let's get a test-case:

$ host is an alias for has address has IPv6 address 2404:6800:4006:801::2009

$ host domain name pointer

Okay, so if we do a reverse lookup on, we expect to see (this will vary depending where you in the world; topologically it seems I'm close to India)

$ getent hosts

Hooray. Let's see what a negative result looks like.

$ getent hosts
   ... yup, 0 lines of output

The output is meant to be exactly what /etc/hosts could contain; a single IP address, a canonical names, and a list of aliases.


Nice and script-friendly. Let's see how to pull that in AWK (or GAWK), with a simple example first. Let's start off with some input -- perhaps lines with an IP address and some count. I've also include a threshold, just as a reminder that it's good to minimise the number of lookups.

$ echo -e ' 2\n8.8.8.8 12\n8.8.4.4 25' \
  | awk '
    BEGIN {threshold = 5}
    $2 > threshold {
      "getent hosts " $1 | getline getent_hosts_str;
      split(getent_hosts_str, getent_hosts_arr, " ");
      print $1, getent_hosts_arr[2], $3

Real-world example

Well, the previous example was reasonably real-world, but its useful to see something a bit more fully-fledged.

I have a script called dns-live-sample-frequent-each-second which runs tcpdump, and basically outputs the most frequent client (with a minimum threshold) every second. So this 'one-liner', which is not yet encapsulated into a script, will look at 100 samples (ie. 100 seconds) and output a table of the common sources of such spikes.

The first script (dns-live-sample-frequent-each-second) has output like the following (I've anonymised the IP addresses). The fields are time, IP and count of requests that client made that second.

13:19:36 10
13:19:37 36
13:19:38 30
13:19:39 23

There's the 'one-liner' in all its glory (?). If you care to expand it, you'll see that its using the PROCINFO["sorted_in"] functionality of gawk to sort an associated array by its values in descending numerical order. That's certainly a trick worth knowing.

./dns-live-sample-most-frequent-each-second | head -100 | gawk '{clients[$2] += 1} END {print "CLIENT_IP CLIENT_DNS %TOP1"; rest = 0; PROCINFO["sorted_in"] = "@val_num_desc"; for (client in clients) {if (clients[client] >= 0.05*NR) {"getent hosts " client | getline getent_hosts_str; split(getent_hosts_str, getent_hosts_arr, " "); print client, getent_hosts_arr[2], int(clients[client] / NR * 100) } else { rest += clients[client] }} print "REST -", int(rest/NR*100)}' | column -t

And its output

CLIENT_IP  CLIENT_DNS         %TOP1   9  18
REST       -                  73

Right, now to go have a chat with whoever looks after

I'll put that into a script soon and maybe do something with cron... maybe.

Here's the more polished form in a script. I called it dns-live-sample-most-frequent-spikers



echo Looking for spiking clients; sampling each second for $num_seconds seconds...


dns-live-sample-most-frequent-each-second \
  | head -n "$num_seconds" \
  | gawk '
    {clients[$2] += 1}
    END {
        print "CLIENT_IP CLIENT_DNS %TOP1";
        rest = 0;
        PROCINFO["sorted_in"] = "@val_num_desc";
        for (client in clients) {
            if (clients[client] >= 0.05*NR) {
                "getent hosts " client | getline getent_hosts_str;
                split(getent_hosts_str, getent_hosts_arr, " ");
                print client, getent_hosts_arr[2], int(clients[client] / NR * 100)
            } else {
                rest += clients[client]
        print "REST -", int(rest/NR*100)
    }' \
  | column -t


Popular posts from this blog

Use IPTables NOTRACK to implement stateless rules and reduce packet loss.

I recently struck a performance problem with a high-volume Linux DNS server and found a very satisfying way to overcome it. This post is not about DNS specifically, but useful also to services with a high rate of connections/sessions (UDP or TCP), but it is especially useful for UDP-based traffic, as the stateful firewall doesn't really buy you much with UDP. It is also applicable to services such as HTTP/HTTPS or anything where you have a lot of connections...

We observed times when DNS would not respond, but retrying very soon after would generally work. For TCP, you may find that you get a a connection timeout (or possibly a connection reset? I haven't checked that recently).

Observing logs, you might the following in kernel logs:
kernel: nf_conntrack: table full, dropping packet. You might be inclined to increase net.netfilter.nf_conntrack_max and net.nf_conntrack_max, but a better response might be found by looking at what is actually taking up those entries in your conne…

ORA-12170: TNS:Connect timeout — resolved

If you're dealing with Oracle clients, you may be familiar with the error message
ERROR ORA-12170: TNS:Connect timed out occurred I was recently asked to investigate such a problem where an application server was having trouble talking to a database server. This issue was blocking progress on a number of projects in our development environment, and our developers' agile post-it note progress note board had a red post-it saying 'Waiting for Cameron', so I thought I should promote it to the front of my rather long list of things I needed to do... it probably also helped that the problem domain was rather interesting to me, and so it ended being a late-night productivity session where I wasn't interrupted and my experimentation wouldn't disrupt others. I think my colleagues are still getting used to seeing email from me at the wee hours of the morning.

This can masquerade as a number of other error strings as well. Here's what you might see in the sqlnet.log f…

Getting MySQL server to run with SSL

I needed to get an old version of MySQL server running with SSL. Thankfully, that support has been there for a long time, although on my previous try I found it rather frustrating and gave it over for some other job that needed doing.

If securing client connections to a database server is a non-negotiable requirement, I would suggest that MySQL is perhaps a poor-fit and other options, such as PostgreSQL -- according to common web-consensus and my interactions with developers would suggest -- should be first considered. While MySQL can do SSL connections, it does so in a rather poor way that leaves much to be desired.

UPDATED 2014-04-28 for MySQL 5.0 (on ancient Debian Etch).

Here is the fast guide to getting SSL on MySQL server. I'm doing this on a Debian 7 ("Wheezy") server. To complete things, I'll test connectivity from a 5.1 client as well as a reasonably up-to-date MySQL Workbench 5.2 CE, plus a Python 2.6 client; just to see what sort of pain awaits.

UPDATE: 2014-0…