Sunday, January 21, 2018

Linux: Scraping web output with wget

This is probably a tip that smarter folks than me know already, but I just ran into this today and decided I'd write it down for future reference.

I'm writing a simple unix shell script that scans the web interfaces of a bunch of internal servers, looking for a specific error message. Very light touch, fairly simple stuff.

Lots of simple stuff like this:
lynx -source "http://servername.int.local" | \
grep "error code"

It turns out that the server I need to install this on doesn't have lynx installed. It does have wget, however. But wget defaults to dumping the output into a file instead of to standard output, which is annoying when you don't actually care about saving the output.

Unless you do it like this:
wget -qO - "http://servername.int.local" | \
grep "error code"

That makes these two bits of sample code (one with lynx, the other with wget) work exactly the same way. The "q" flag is to quiet wget's normally verbose output, and the "O -" flag is telling wget to dump the output to stdout ("-") instead of to a file.

Which one is faster? In my entirely non-scientific testing, lynx seems to be faster. My suggestion is that if you can, test. Or just use lynx if your system has it installed (or if it can be installed), and use wget as a backup only. (And I didn't even really get a chance to test curl. Sorry!)

For now, I've got a solid workaround to get my script to work on a server that doesn't have lynx and where I'm not an administrator.

Thus ends today's lesson in stupid shell script tricks.