Skip to main content

[Humbledown highlights] Build an inode Lookup Database using SQLite3

Originally published by myself on at Sun Aug 28 19:25:47 NZST 2011
and since recovered to this location. It has not been tested since its original publication.

Occassionally, such as when investigating NFS traffic, you need to determine which filename(s) correspond to a given inode, or set of inodes. Now, you could answer this question using a command such as find /someplace -inum 12345 -print, but that will generally be quite slow, and very cumbersome if you wanted to lookup multiple inodes.

I decided to do this a different way, by creating a database to map between inode and path. Furthermore, I decided to do this using sqlite3, because a) it ought to be faster than grepping through a file looking for a particular inode, b) sqlite3 is pretty useful and I thought this would be a good excuse to practice using it, c) it was easy to do so.
One important thing to note: if you model a table using an inode as a primary key (which must by definition be unique), then you will run into problems. This is because although inodes are unique in a filesystem, the paths that correspond to an inode are not, so you might have multiple rows with the same inode field. So, although we would want the 'inode' field indexed, it is not a primary key. SQLite3 doesn't care though, as each row has an internal row number.

To generate the database (which is held in one file; SQLite3 is very convenient like that), we can do the following:

$ sqlite3 inodes.sqlite3db 'CREATE TABLE inodes (inode INTEGER, path STRING)'

Done. Now we just need to populate it with data. We can use the find command to help with this. Since inodes are unique to a filesystem, I'm using the -xdev option to prevent crossing over into other filesystems. I'm using the vertical bar to separate the fields in the output, as this is the default for sqlite3's .import function, although it can be changed.

$ find / -xdev -printf "%i|%p\n" | head

Okay, so that shows what the import data is like. We would write the whole thing to a file and then import that, but it makes more sence to pipe it directly into sqlite3. Note that .import can actually read from stdin, but you need to specify it as /dev/stdin, as it doesn't understand -. I'm using sudo to prevent a lot of potential Permission denied messages, although I still get one, which is the Gnome Virtual File System.

$ time sudo find / -xdev -printf "%i|%p\n" | sqlite3 inodes.sqlite3db '.import /dev/stdin inodes'
find: `/home/cameron/.gvfs': Permission denied

real    1m32.258s
user    0m3.708s
sys     0m11.265s


$ sqlite3 inodes.sqlite3db 'SELECT * FROM inodes LIMIT 5;'

Good, so the data has been imported, and we are ready to begin querying. Let's step back a little first: note that we have directories as well, as directories are just a special type of file that lists what is in the directory. We can use the debugfs command to output the contents of an inode (in this example, inode 2):

$ sudo debugfs -R 'cat <2>' /dev/sda5 | hexdump -C
debugfs 1.41.11 (14-Mar-2010)
00000000  02 00 00 00 0c 00 01 02  2e 00 00 00 02 00 00 00  |................|
00000010  0c 00 02 02 2e 2e 00 00  0b 00 00 00 14 00 0a 02  |................|
00000020  6c 6f 73 74 2b 66 6f 75  6e 64 00 00 10 00 00 00  |lost+found......|
00000030  14 00 0b 07 76 6d 6c 69  6e 75 7a 2e 6f 6c 64 00  |....vmlinuz.old.|
00000040  01 f4 0b 00 0c 00 03 02  6d 6e 74 00 01 fc 03 00  |........mnt.....|
00000050  0c 00 04 02 62 6f 6f 74  0f 00 00 00 18 00 0e 07  |....boot........|
00000060  69 6e 69 74 72 64 2e 69  6d 67 2e 6f 6c 64 00 00  |initrd.img.old..|
00000070  0e 00 00 00 10 00 05 07  63 64 72 6f 6d 00 00 00  |........cdrom...|
00000080  01 f6 09 00 0c 00 03 02  65 74 63 00 01 f0 0f 00  |........etc.....|
00000090  10 00 05 02 6d 65 64 69  61 00 00 00 04 f0 0f 00  ||

Anyway, back to querying. To query a single inode is pretty easy:

$ time sqlite3 inodes.sqlite3db 'SELECT * FROM inodes WHERE inode = 1044571;'

real    0m0.238s
user    0m0.136s
sys     0m0.092s

To query multiple random inodes is a bit more annoying:

$ time sqlite3 inodes.sqlite3db 'SELECT * FROM inodes WHERE (inode = 2419) or (inode = 1044571);'

real    0m0.313s
user    0m0.108s
sys     0m0.184s

But we can overcome that nicely:

$ time echo -e '784412\n784499\n784499' \
>  | while read inode
>  do
>     echo "SELECT * FROM inodes WHERE inode = $inode;" 
>  done | sqlite3 inodes.sqlite3db

real    0m0.688s
user    0m0.360s
sys     0m0.296s

So, it took about 1m30s to build the database, less than a second to query it multiple times. Compare that to using 'find ... -inum X -print' multiple times, each of which takes about 1m30s (ignoring any filesystem caching).

Limitations of this method include: must be periodically updated; is not real-time, so you might not have all short-lived files; an inode might not always map to something currently in existance (eg. open a file by name, delete the file [entry in directory], then when the file is actually closed, the file will be deleted: this is common for temporary files).

Using this method in conjunction with something like monitoring NFS traffic, you could use it to find out which files are being accessed. When watching/sampling NFS traffic, you will only determine a mapping between inode and pathname when a LOOKUP request is made, not when operating on a file. If you sample 200,000 NFS packets, you might not find any mappings, depending on the nature of the traffic. So this can be very useful to determine what is going on.


Popular posts from this blog

ORA-12170: TNS:Connect timeout — resolved

If you're dealing with Oracle clients, you may be familiar with the error message
ERROR ORA-12170: TNS:Connect timed out occurred I was recently asked to investigate such a problem where an application server was having trouble talking to a database server. This issue was blocking progress on a number of projects in our development environment, and our developers' agile post-it note progress note board had a red post-it saying 'Waiting for Cameron', so I thought I should promote it to the front of my rather long list of things I needed to do... it probably also helped that the problem domain was rather interesting to me, and so it ended being a late-night productivity session where I wasn't interrupted and my experimentation wouldn't disrupt others. I think my colleagues are still getting used to seeing email from me at the wee hours of the morning.

This can masquerade as a number of other error strings as well. Here's what you might see in the sqlnet.log f…

Getting MySQL server to run with SSL

I needed to get an old version of MySQL server running with SSL. Thankfully, that support has been there for a long time, although on my previous try I found it rather frustrating and gave it over for some other job that needed doing.

If securing client connections to a database server is a non-negotiable requirement, I would suggest that MySQL is perhaps a poor-fit and other options, such as PostgreSQL -- according to common web-consensus and my interactions with developers would suggest -- should be first considered. While MySQL can do SSL connections, it does so in a rather poor way that leaves much to be desired.

UPDATED 2014-04-28 for MySQL 5.0 (on ancient Debian Etch).

Here is the fast guide to getting SSL on MySQL server. I'm doing this on a Debian 7 ("Wheezy") server. To complete things, I'll test connectivity from a 5.1 client as well as a reasonably up-to-date MySQL Workbench 5.2 CE, plus a Python 2.6 client; just to see what sort of pain awaits.

UPDATE: 2014-0…

From DNS Packet Capture to analysis in Kibana

UPDATE June 2015: Forget this post, just head for the Beats component for ElasticSearch. Beats is based on PacketBeat (the same people). That said, I haven't used it yet.

If you're trying to get analytics on DNS traffic on a busy or potentially overloaded DNS server, then you really don't want to enable query logging. You'd be better off getting data from a traffic capture. If you're capturing this on the DNS server, ensure the capture file doesn't flood the disk or degrade performance overmuch (here I'm capturing it on a separate partition, and running it at a reduced priority).

# nice tcpdump -p -nn -i eth0 -s0 -w /spare/dns.pcap port domain

Great, so now you've got a lot of packets (set's say at least a million, which is a reasonably short capture). Despite being short, that is still a massive pain to work with in Wireshark, and Wireshark is not the best tool for faceting the message stream so you can can look for patterns (eg. to find relationshi…