Globs#

A common task in shell scripts is to loop over a changing but well–defined set of files, for example to delete all the log files after a successful run of a program. The names of such files often include the date and/or time they were created, so you can’t simply hard–code a path in the cleanup script. You need some way of referring to all the log files, regardless of their actual name, in other words a pattern, where the unknown part of a path might be matched by a wildcard.

Globs are patterns. By far the most common wildcard in globs is the asterisk, *. It matches any number of characters (including zero) at that point in the path:

If you’re wondering why I don’t just echo *.log see the start of Including Files and the excellent Back To The Future: Unix Wildcards Gone Wild, which has several examples of why starting an argument with a wildcard character is a bad idea.

If we care about the number of characters but not their contents we can use the ? wildcard. It matches any single character:

Those familiar with regular expressions will recognize that * is equivalent to .*, and ? is equivalent to .. But there are important differences:

  • Globs are always anchored. That is, ./*.log is equivalent to ^\./.*\.log$ (with the modifier that . includes newline characters), which is why it does not match “./2000-12-31.log.tar.gz”.

  • . in globs is a literal dot, not a wildcard.

In general, it’s best to think of them as two different languages which just happen to have some superficial similarities.

If we care about the specific characters we can specify the characters we want to match at that location, in square brackets:

A handy but dangerous shortcut is matching character classes. For example, we can match any lowercase ASCII character using the [[:lower:]] pattern:

Character classes and more are explained in detail in man 7 regex. This is one example where the functionality of globs and regexes blend together.

Hold on, what’s with LC_ALL? And why isn’t the cedilla treated as a lowercase character? The answer to both comes back to locales, which for the purposes of this chapter can be treated as the mapping from bytes to what is loosely called “characters.” Basically, to ensure that the code has a chance of treating strings identically across configurations you must declare the locale first. LC_ALL is a locale override variable, and the value “POSIX” refers to the POSIX locale which is the only one guaranteed to be available on all modern *nix installations. In the POSIX locale [[:lower:]] maps to [abcdefghijklmnopqrstuvwxyz].

Globbing tips#

TIP: Use a slash at the end of a glob to match only directories:

TIP: Globs won’t match dotfiles by default:

TIP: Set the dotglob shell option to match dotfiles:

 

This page is a preview of The newline Guide to Bash Scripting

Start a new discussion. All notification go to the author.