Thursday, February 16, 2006

Squashing a Real Bug on Darwin

I previously talked about a really cool bash trick called process substitution, which allows you to use a process almost anywhere you can use a file. For example, diff <(ls dir1) <(ls dir2) would allow me to diff the contents of dir1 and dir2.

The Problem



For the most part, process substitution works great on Mac OS X, but there are cases where it doesn't work. For example:

$ diff <(echo foo) <(echo bar)

produces no output, when clearly the string "foo" differs from the string "bar". But why?


Troubleshooting



Well, let's check out the tools in our toolbox: gdb, gcc, vm_stat, vmmap, nm, otool, stat, etc. Hmm, let's try a few more experiments first.

$ diff <(echo foo) <(echo bar)
$ diff <(echo foo) <(echo barX)
1c1
< foo
---
> barX

$ diff <(echo foo) <(echo bar)
$ diff <(echo foo) <(sleep 1; echo bar)
1c1
< foo
---
> bar

Interesting...

$ stat <(echo foo) <(echo bar)
520093697 0 prw-rw---- 1 jgm jgm 0 4 "Feb 16 21:12:21 2006" "Feb 16 21:12:21 2006" "Feb 16 21:12:21 2006" 512 8 0 /dev/fd/63
520093697 0 prw-rw---- 1 jgm jgm 0 4 "Feb 16 21:12:21 2006" "Feb 16 21:12:21 2006" "Feb 16 21:12:21 2006" 512 8 0 /dev/fd/62

Very, interesting! According to stat(1) the two named pipes created have identical attributes except for the file name. What's more is that the two files have the same inode number!? But what's that you say? How can a two different files (i.e. not hard-links) on the same filesystem have the same inode number? According to POSIX this isn't allowed. So maybe diff is trying to do a quick short circuit and saying "hey, the files have the same inode number and are on the same filesystem, so they must be the same". Maybe.

Well, let's get the source code for diff and check it out.

$ curl http://darwinsource.opendarwin.org/tarballs/other/gnudiff-13.tar.gz > gnudiff-13.tar.gz
$ tar -zxvf gnudiff-13.tar.g
$ cd gnudiff-13
$ make

OK, now let's test our freshly built diff.

$ /tmp/gnudiff/Build/src/diff <(echo foo) <(echo bar)
$

Yep, our new version has the same problem that we want to fix.

OK, so let's take a look at some source. gnudiff/src/diff.c looks like a good place to start. Just search for the word "main" and we can quickly check out the main function to get an idea of how diff starts to do what it does. Around line 713 we see the call
int status = compare_files ((struct comparison *) 0, from_file, argv[optind]);
which looks very promising. We find the definition of this function at line 1047. Now just skim through this function to get an idea what it does. Around line 1214 we see a comment that looks very promising!

if ((same_files = (cmp.file[0].desc != NONEXISTENT
&& cmp.file[1].desc != NONEXISTENT
&& 0 < same_file (&cmp.file[0].stat, &cmp.file[1].stat)
&& same_file_attributes (&cmp.file[0].stat, &cmp.file[1].stat)))
&& no_diff_means_no_output) {
/* The two named files are actually the same physical file.
We know they are identical without actually reading them. */

}

Oh, I bet this has something to do with the problem! What do those two "same_file*" functions do?

Armed with grep we find them defined as macros in gnudiff/src/system.h around line 361. These two macros basically check some file data returned by stat(2) to see if two files are identical. The attributes checked are things like inode number, uid, gid, size, mtime, ctime, etc. All attributes that were identical when we checked the stat output of our two named pipes. Take a second and glance back up at the output from stat <(echo foo) <(echo bar). I'll wait... back? OK. So, it sorta makes sense why that diff may have failed. And it also makes sense why diff <(echo foo) <(sleep 1; echo bar) would have worked. Can you guess why? (hint: think about the modification times for each fifo)


A Fix



What's the best fix for this? One could argue that the problem is that the HFS+ filesystem allows two files on the same filesystem to have the same inode number, but HFS+ really isn't an inode based filesystem. On HFS+, inode numbers are really just the volume's catalog node ID. Plus, it's probably a big pain to modify the Darwin Kernel or the HFS+ filesystem code.

Maybe we can instead fix the problem in diff. According to this technote a CNID of zero is never used and indicates nil, so maybe diff should not shortcut any files with an inode of 0? Let's try it. In gnudiff/src/system.h, make the following modification to the same_files(s, t) macro:

# define same_file(s, t)
((((s)->st_ino == (t)->st_ino)
&& ((s)->st_dev == (t)->st_dev))
&& ((s)->st_ino != 0) && ((t)->st_ino != 0) \
|| same_special_file (s, t))

Then recompile with make (if necessary, type make clean; make). Now, let's see if we fixed the problem:

$ /tmp/gnudiff/Build/src/diff <(echo foo) <(echo bar)
1c1
< foo
---
> bar


YAY! That seems to have fixed the problem!


Conclusion



I possibly skipped the most important first step here, and that is use Google to see if someone else already figured out my problem! I'll probably go do that now! ;-)

In the meantime, I haven't tested this solution thoroughly, but I imagine it's safe. Hopefully, this (or some other) fix will make it into the diff code soon. G'nite.

1 comment:

Anonymous said...

opendiff <(echo foo) <(echo bar) produces: Bus error