Wednesday, August 30, 2006

Pruning Empty Directories

Just a quickie here. I had a fairly large directory structure that was pretty deep and had many empty directories, where an "empty" directory may still have subdirectories which are themselves empty (i.e. if I did mkdir -p foo/bar/baz, foo/ would be "empty"). The quickest way I found to clean up all the empty directories was to have find do a depth-first traversal (-d) of the directory structure, then use rmdir, which only deletes empty directories. Something like:

$ find -d . -print0 | xargs -0 rmdir 2> /dev/null

Sunday, August 20, 2006

APUE2e Acknowledgement

It's no secret that Advanced Programming in the UNIX Environment, by W. Richard Stevens and Stephen A. Rago, is a staple for all developers who write code for any flavor of Unix. If you don't have have at least one copy already, I'd strongly encourage you to pick one up.

That said, don't forget to check out the book's website at apuebook.com. And while you're there, check out item 18 on the "Additional Acknowledgements" page ;-) Anyway, here's a little bit more detail about that particular issue.

The SUSv3 (Single Unix Specification version 3) states that... "If connect() fails, the state of the socket is unspecified. Conforming applications should close the file descriptor and create a new socket before attempting to reconnect." And as an example, retrying connect() doesn't always work on Darwin 8.6.0 and FreeBSD 6.0-RC1 (the only versions of these OSes that I checked).

The case I found where retrying connect() doesn't work is when I try to connect() to a port that's not listening. The client (calling connect()) sends the SYN, a RST is received (as expected) and connect() returns -1 with errno set to ECONNREFUSED. This is all as expected. However, if that same socket is used to attempt the connect() again, no packets are sent and connect() immediately fails with EINVAL. This code illustrates:

int main(void) {
struct sockaddr_in remote_addr;

bzero(&remote_addr, sizeof(remote_addr));
remote_addr.sin_family = AF_INET;
remote_addr.sin_port = htons(3333);
inet_pton(AF_INET, "127.0.0.1", &remote_addr.sin_addr);

int sock = socket(AF_INET, SOCK_STREAM, 0);

while (connect(sock, (struct sockaddr *)&remote_addr,
sizeof(remote_addr)) == -1) {
perror("failed to connect");
sleep(2);
}
...
}


Again, on Darwin and FreeBSD the second time through the while-loop, EINVAL is immediately returned. And since no packets are actually sent, if the port at 127.0.0.1:3333 ever does open up, it will not be detected.

On the 2.4 Linux kernel I tested, the code does what I initially expected and it returns ECONNREFUSED every time.

Since, the SUSv3 says that a failed connect() leaves the socket in an undefined state, I don't think this is actually a bug. But it looks like it also means that the connect_retry() code in figure 16.9 (of APUE2e) is not portable.

So, to summarize, the issue is that if a connect() call fails for any reason, the state of the socket is undefined. To be portable, you must close the socket and create a new one before calling connect() again.

When I emailed Stephen Rago about this, he was very responsive and nice. He feels that this bug lies with the sockets implementation, but he added an FAQ on the book's website about it anyway.

Again, if you're reading this blog, and you don't already have a copy of this book, you should probably go get one now.

Thursday, August 17, 2006

Old (but useful) Shell Tricks

I used to have a somewhat long list of somewhat interesting Unix/Shell tips-n-tricks on an old version of my website. A few folks asked me where it was, and as it's no longer available, so I figured I'd repost a handful of the tips:




  1. From within an executing script, how do I find the directory where the script lives?


    A simple pwd doesn't work because it gives the Present Working Directory, and we want to know the directory where the script actually resides on disk. I can't remember the previous solution I came up with, but here's another solution that should work:
    #!/bin/bash
    dir=$(dirname $(echo $0 | sed -e "s,^\([^/]\),$(pwd)/\1,"))
    echo I live in $dir

    Here how it works. $0 is the name of the script as it was executed, so this may be foo.sh, ./foo.sh, /tmp/foo.sh, etc.. This gets sent to sed, which then uses a basic regular expression that says "If the first character of $0 was NOT a forward slash, then prepend $0 with my present working directory ($(pwd)) followed by whatever that first character was (\1), finally, take the dirname of this value, then assign it to dir". If the first character in $0 is a forward slash, then we were invoked via an absolute path and so we don't want to change anything.



  2. How do I copy a directory structure from one machine to another?


    $ tar cf - some_directory | ssh kramer "( cd /path/to/destination; tar xf - )"

    tar cf - some_directory creates a tar file of some_directory but the dash (-) tells tar to write to STDOUT instead of writing to an actual file on disk. The STDOUT from the first tar command is piped to STDIN of the next command. The right hand side of the pipe says to log into the host kramer using ssh and run the commands cd and tar xf -. The trick is with the commands "( cd /path/to/destination; tar xf - )". The parens create a subshell, in which the current directory is changed to /path/to/destination, and tar xf - reads from STDIN and extracts the tar file. This STDIN is the same STDIN that was sent to us over the pipe from STDOUT of the first tar command. Thus the directory structure on jerry gets tar'd up, transfered to kramer, then extracted all in one fell swoop.



  3. How do I diff two files on different machines (using Bash)?


    $ diff <(ssh -n george cat /etc/passwd) <(ssh -n kramer cat /etc/passwd)

    This is just the Bash Process Substitution trick.



  4. How can I run a shell script on a remote host without copying the script out?


    One way would be:
    jerry:~$ cat myscript.sh | ssh kramer /bin/sh

    This one is pretty simple how it works, but it is often overlooked as an option for getting things done. The shell script is written to STDOUT. /bin/sh is executed on the remote server, and it reads myscript.sh on STDIN, thus executing the local copy of the script. This is way convient for some things. The only problem I see with this one is that you can't pass command line arguments to the script.



  5. How can I run long pipe lines of commands on a remote host via SSH without escaping all the meta characters?



    jerry:~$ ssh kramer <<EOF
    ps -ef | grep http | awk '{print \$NF}'
    EOF

    The only tricks here are the use of a bash here document, and the fact that the command line is typed directly into ssh's STDIN so there's no need to escape things like pipes and semi-colons, etc. However, notice that you do still need to escape dollar signs because they'll still be interpreted as shell variables.



  6. How do I change every occurance of a string in multiple files?


    Why, use perl pie!
    perl -p -i -e 's/jerry/george/g' *.txt

    See perl -h for a description of the flags.




Friday, August 11, 2006

A quick read of Mac OS X Internals

I often hear people comment on how thick the Mac OS X Internals book is. Well, it is thick, but not too thick. I've already read it cover-to-cover twice, and will show that it's even possible to tackle in one sitting. Check it out. ;-)

Sunday, August 06, 2006

WWDC 2006

I'm getting ready for WWDC which starts tomorrow. I'm very excited, and hopefully it'll fuel some interesting posts coming up.

Friday, August 04, 2006

Tracing Objective-C Messages

Tools like strace, ltrace, truss, ktrace, etc, are very cool, and necessary if you really want to understand how things work. They allow you to watch what a process is doing by showing you when certain functions are called. It would also be really cool if we could see similar information as Objective-C messages are sent.

So, I read through the Objective-C runtime code and discovered a way. A few days later I found a good blog post by Dave Dribin here that outlines the basic idea that I had used. However, his solution requires you to recompile libobjc.dylib, which is undesirable as well as unrealistic in many cases.

Please take a few moments to read his post (again, here), then come back and read the rest of this...

...

OK, as he explains, the symbol that we want access to "_logObjcMessageSends" isn't exported (remember, nm showed it as a little "t") so he rebuilds the libobjc dylib in order to export the symbol. I'd like to propose an alternate solution that doesn't require touching libobjc.dylib.

Rather than looking up the symbol address using dlsym(), we should use the often overlooked nlist(3) function, which will return us the address of "private" symbols. So, in our dylib that we want to insert with DYLD_INSERT_LIBRARIES, we could have code like:

...
typedef int (*ObjCLogProc)(BOOL, const char *, const char *, SEL);
typedef int (*LogObjcMessageSendsFunc)(ObjCLogProc);

struct nlist nl[2];
bzero(&nl, sizeof(struct nlist) * 2);
nl[0].n_un.n_name = "_logObjcMessageSends";

if (nlist("/usr/lib/libobjc.dylib", nl) < 0 || nl[0].n_type == N_UNDF) {
fprintf(stderr, "nlist(%s, %s) failed\n",
"/usr/lib/libobjc.dylib",
nl[0].n_un.n_name);
return;
}

LogObjcMessageSendsFunc fcn = (LogObjcMessageSendsFunc) nl[0].n_value;
(fcn)(&MyLogObjCMessageSendFunction);
...

This code uses nlist() to look up the address of _logObjcMessageSends. The symbol it's looking up happens to be "private", but that's OK. Then once it has the address of the symbol, it casts it to a pointer to a function with the correct signature. Once that's done, the new function pointer is used just like any ol' function.

So, this solution works just like Dave Dribin's, but it doesn't require a recompile of the Objective-C runtime.