Capturing and Analysing DNS Samples (tcpdump meets SQLite3)

A few nights ago, I was on a bit of a mission to determine how much traffic our servers were sending to our DNS servers, as I was fairly sure it was much, and I wanted to put some improvement in place so we don't make an issue for ourselves.

I didn't want to turn on query logging, because of the rate of traffic, and its not my job to manage the DNS service, but I can see the queries in flight, so tcpdump was an obvious tool to use. But I didn't want to be awking, grepping, sedding, and generally bashing (pun intended) tcpdump output into whatever variety of reports I wanted, so I instead fed it into SQLite3.


I thought I would share how I did that; I won't show you much of the output, as that is not as useful (I'm sure you could think of some more interesting SQL to throw at it.)

It all starts with a tcpdump capture, which can be read back. You could do this without saving the file (which can be very useful to avoid creating a potential space issue), but because I wanted to know the duration of the capture, I decided to write it to disk first so I could query it again -- actually, I should have just created another query for last timestamp minus the first. To make the timestamps more useful, I set the timestamp format to raw, which is basically Unix-time with microseconds. Here's some sample output from just tcpdump (sanitised)

$ /usr/sbin/tcpdump -tt -r /tmp/query-sample.pcap -n | head -3
reading from file /tmp/query-sample.pcap, link-type EN10MB (Ethernet)
1405083214.871468 IP ###.56357 > ###.domain: 24707+ PTR? ###.in-addr.arpa. (45)
1405083214.871663 IP ###.34021 > ###.domain: 8029 [1au] PTR? ###.in-addr.arpa. (56)
1405083214.884989 IP ###.60397 > ###.domain: 15430+ A? ###.example.com. (70)

Here's the script. There is much that could be improved (I'd improve the temporary file location if I was planning on reusing this script.) I've taken some pains to process the query values somewhat to aggregate things usefully, such as reverse lookups, and some cases where a anti-malware software is calling out to DNS as a form of distributed database. I've also tried to aggregate some IP address ranges differently (which could be useful for internal/external); pretend that I have a net-block of 123.45.67.

Note that I haven't added any indexes to this. It didn't seem to suffer much from not having done, but at 50,000 requests, it's reaching that point where it would perhaps be beneficial.

# cat dns-live-query-sample
#!/bin/bash

sample_size="50000"
interface="eth0"

echo "Capturing sample of $sample_size packets from interface $interface"

tcpdump -w /tmp/query-sample.pcap -i "$interface" -nn -p -c "$sample_size" -s0 dst port domain

rm -f /tmp/query-sample.sqlite3
sqlite3 /tmp/query-sample.sqlite3 'create table queries (count int,client,querytype,queryval);'

tcpdump -r /tmp/query-sample.pcap -nn \
    | sed -rne 's/^[^ ]+ IP ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\.[0-9]+ > ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\.[0-9]+: .* ([^?]+)\? ([A-Za-z0-9._-]+) .*/\1 \3 \4/p' \
    | awk '
        { queryval = $3 }
        queryval ~ /\.noisy-example\.com\.$/ { queryval = "SOMETHING.noisy-example.com." }
        queryval ~ /\.10\.in-addr\.arpa\.$/ { queryval = "SOMETHING.10.in-addr.arpa." }
        queryval ~ /\.67\.45\.123\.in-addr\.arpa\.$/ { queryval = "SOMETHING.67.45.123.in-addr.arpa." }
        queryval ~ /[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\.in-addr\.arpa\.$/ { queryval = "SOMETHING.in-addr.arpa." }
        { print $1,$2,queryval }' \
    | sort | uniq -c \
    | sed -re 's/[ \t]+/|/g' -e 's/^\|//' \
    | sqlite3 /tmp/query-sample.sqlite3 '.import /dev/stdin queries'

echo "
    In this report, Cost is a count of such queries received,
    normalised by the number of clients that queried it. Thus,
    something with a Cost greater than 10 (<= are omitted),
    would likely benefit from DNS caching.

    Some queries, namely inverse lookups and things under
    noisy-domain.com (generally some abuse of DNS) are
    aggregated.

"

sqlite3 /tmp/query-sample.sqlite3 <<EOF
.mode column
.header on

.width 4 7 70
select sum(count)/count(count) as Cost, querytype as RRtype, queryval as Name from queries group by RRtype,Name having Cost > 10 order by Cost desc limit 100;

.width 9 6 15
select sum(count) as NumQueries, querytype as RRtype, client as Client from queries group by Client having NumQueries > 10 order by NumQueries desc limit 100;
EOF

tcpdump -tt -r /tmp/query-sample.pcap -nn \
    | grep '\.53: ' \
    | cut -d. -f1 | uniq -c \
    | awk '
        NR == 1 {starttime = $2; next}
        {total += $1; count += 1; last = $1; lasttime = $2}
        END { total -= last; print "Queries / second = " total / (lasttime - starttime) }'

Here is a sample of the output (again, sanitised)

# ./dns-live-query-sample
Capturing sample of 5000 packets from interface eth0
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
5000 packets captured
5000 packets received by filter
0 packets dropped by kernel
reading from file /tmp/query-sample.pcap, link-type EN10MB (Ethernet)

    In this report, Cost is a count of such queries received,
    normalised by the number of clients that queried it. Thus,
    something with a Cost greater than 10 (<= are omitted),
    would likely benefit from DNS caching.

    Some queries, namely inverse lookups and things under
    noisy-domain.com (used for ###) are
    aggregated.


Cost  RRtype   Name
----  -------  -----------------------------------------------
195   A        SOMETHING.noisy-domain.com.
130   A        ###.my.domain.
104   A        ###.my.domain.
102   A        ###.my.domain.
94    AAAA     ###.my.domain.
94    AAAA     ###.my.domain.
...
22    PTR      SOMETHING.in-addr.arpa.
...

NumQuerie  RRtype  Client
---------  ------  ---------------
808        TXT     10.###
473        PTR     10.###
348        PTR     10.###
314        PTR     10.###
...
reading from file /tmp/query-sample.pcap, link-type EN10MB (Ethernet)
Queries / second = 134.722

I should point out: only requests are captured, not responses, so this script, as written, cannot tell you about success of failure (but it may let you see some interesting patterns).

Hope you find this useful; presumably there is already a tool that does this analysis better, but you may find the technique of feeding tcpdump into sqlite useful.

Cheers,
Cameron

Comments

Popular posts from this blog

ORA-12170: TNS:Connect timeout — resolved

Getting MySQL server to run with SSL

From DNS Packet Capture to analysis in Kibana