Grep Command Tutorial For Unix
Perl-style regular expressions use the Perl-Compatible Regular Expressions (PCRE) library to interpret the pattern and perform searches. As the name implies, this style uses Perl’s implementation of regular expressions. Perl has an advantage because the language was optimized for text searching and manipulation. As a result, PCRE can be more efficient and far more function-rich for finding content. The consequence is that it can be horribly messy and complex. To put it another way, using PCRE to find information is like using a weed whacker on yourself to do brain surgery: it gets the job done with minimum of effort, but it is an awful mess.
The specific search features and options with PCRE are not dependent upon grep itself, but use the libpcre library and the underlying version of Perl. This means that it can be highly variable between machines and operating systems. Usually the pcrepattern or pcre manpages will provide machine-specific information on the options that are available on your machine. What follows is a general set of PCRE search functions that should be available on most machines.
Also note that Perl-style regular expressions may or may not be present by default on your operating system. Fedora and Red Hat–based systems tend to include them (assuming you install the PCRE library), but Debian, for instance, does not enable Perl-style regular expressions by default in their grep package. Instead, they ship a pcregrep program, which pro- vides very similar functionality to grep -P. Individuals can, of course, compile their own grep binary that does include PCRE support should they be so inclined.
To test whether Perl-style regular expression support is built- in to your version of grep, run the following command (or something like it):
$ grep -P test /bin/ls
grep: The -P option is not supported
This usually means that when grep was built it could not find the libpcre library or that it was intentionally disabled with the
–disable-perl-regexp configuration option when it was com-piled. The solution is to either install libpcre and recompile grep or find an applicable package for your operating system. The general form of using Perl-style grep is:
grep -P options pattern file
It is important to note that, unlike grep -F and grep -E, there is no “pgrep” command. The pgrep command is used to search for running processes on a machine. All the same command- line options that are present for grep will work with grep -P; the only difference is how the pattern is processed. PCRE pro- vides additional metacharacters and character classes that can be used enhance search functionality. Other than the addi- tional metacharacters and classes, the pattern is constructed in the same way as a typical regular expression.
Choosing Between grep Types and Performance Considerations
Now that we have gone over all four grep programs, the ques- tion is how should you determine which to employ for a given task. For most routine uses, people tend to use the standard grep command (grep -G) because performance isn’t an issue when searching small files and when complex search patterns aren’t necessary. Generally, the basic grep is the default choice for most people, and so the question becomes when it makes sense to use something else.
When to Use grep -E
Although almost everything can be done in grep -G that can be done in grep -E, the latter has the advantage of accomplishing the task in fewer characters, without the counterintuitive escaping discussed earlier. All of the extra functionality in ex- tended regular expressions has to do with quantifiers or sub- patterns. Additionally, if any significant use of backreferences is needed, extended regular expressions are ideal When to Use grep -F
There is one prerequisite to using grep -F, and if a user cannot meet that requirement, grep -F is simply not an option. Namely, any search pattern for grep -F cannot contain any metacharacters, escapes, wildcards, or alternations. Its per- formance is faster, but at the expense of functionality.
That said, grep -F is extremely useful for quickly searching large amounts of data for tightly defined strings, making it the ideal tool to search through immense logfiles quickly. In fact, it is fairly easy to develop a robust “log watching” script with grep -F and a good text file listing of important words or phrases that should be pulled out of logfiles for analysis.
Another good use for grep -F is searching through mail logs and mail folders to ensure delivery of emails to users, especially on systems with many mail accounts. This is made possible by assigning every email message a unique Message ID. For instance:
grep -FHr MESSAGE-ID /var/mail
This command will search for the fixed string MESSAGE-ID for all files inside /var/mail (and recurse any subdirectories), and then display the match and also the filename. This is a quick, down-and-dirty way to see which users have a particular mes- sage sitting in their mailbox. The real bonus is that this infor- mation can be verified without ever having to look inside a user’s mailbox and deal with the privacy issues of reading other people’s mail. In reality, you may wish to search mailbox directories and spam folders, which typically aren’t stored un- der /var/mail, but you get the point of how this works.
When to Use grep -P
Perl-style regular expressions are hands-down the most powerful of all the styles presented in this book. They are also the most complicated, prone to user-error, and potentially ca- pable of bogging down a system’s performance if not done correctly. However, it is clearly the superior style out of all the regular expression formats used in this book.
For this reason, many applications prefer to use PCRE instead of GNU regular expressions. For instance, the popular intru- sion detection system snort uses PCRE to match bad packets on the wire. The patterns are written intelligently so that there can be very little packet loss, even though a single machine can search all the packets going through a fully loaded 100 MB or GB interface. As has been said before, writing a regular expres- sion well tends to be more important than the particular regular expression format you use.
Some people simply prefer to use grep -P as their default (for instance, by specifying -P inside their GREP_OPTIONS environ- ment variable). If searching is going to be done in an “interna- tional” way, the PCRE language character classes make this far easier. PCRE comes with a many more character classes for finely tuning a regular expression, beyond what is possible with the POSIX definitions, for instance. Most importantly, the ability to use the various PCRE options (e.g., PCRE_MULTILINE) allows searching in more powerful ways than GNU regular expressions.
For simple to moderately complex regular expressions, grep-E suffices. However, there are limitations, and those may push a user toward PCRE. It is a trade-off between complexity and functionality. PCRE also helps users craft regular expressions that can be almost immediately transferred directly into Perl scripts (or transferred from Perl scripts) without having to go through a great deal of translation.
Advanced Tips and Tricks with grep
As mentioned earlier, grep can be used in very powerful ways to search for content in files or across a filesystem. It is possible to use previous matches to search later strings (called backreferences). There are also a variety of tricks to search nonpublic personal information and even find binary strings in binary files. The following sections discuss some advanced tips and tricks.
Backreferences
The grep program has the ability to match based on multiple previous conditions. For instance, if you want to find all lines that repeatedly use a particular set of words, a single grep pat- tern will not work; however, it is possible to do this with the use of backreferences.
Suppose you wish to find any line that has multiple instances of the words “red”, “blue”, or “green”. Imagine the following text file:
The red dog fetches the green ball. The green dog fetches the blue ball. The blue dog fetches the blue ball.
Only the third line repeats the use of the same color. A regular expression pattern of ”(red|green|blue)*(red|green| blue)” would return all three lines. To overcome this problem, you could use backreferences:
grep -E '(red|green|blue).*\1' filename
This command matches only the third line, as intended. For extended regular expressions, only a single digit can be used to specify a backreference (i.e., you can only refer back to the ninth backreference). Using Perl-style regular expressions, the- oretically you can have many more (at least two digits).
This could be used to validate XML syntax (i.e., the “opening” and “closing” tags are the same), HTML syntax (match all lines with the various opening and closing “heading” tags, such as <h1>, <h2>, etc.), or even to analyze writing for pointless repe-tition of buzzwords.
It is important to note that backreferences require the use of parentheses to determine reference numbers. grep will read the search pattern from left to right, and starting with the first par- enthetical subpattern it finds, it will start numbering from 1.
Typically, backreferences are used when a subpattern contains alternation, as in the previous example. It is not required, how- ever, for a subpattern to actually contain alternation. For in- stance, assuming there is a large subpattern that you wish to refer back to later in the regular expression, you could use a backreference as an artificial “alias” for that subpattern with- out having to type out the entire pattern multiple times. For instance:
grep -E '(I am the very model of a modern major general.).*\1' filename
would search for repetitions of the sentence “I am the very model of a modern major general.” separated by any amount of optional content. This certainly reduces the number of key- strokes and makes the regular expression more manageable, but it also causes some performance considerations as discussed previously. The user needs to weigh the benefits of convenience with performance, depending on what she is try- ing to accomplish.
Binary File Searching
Up to this point, it seems that grep could only be used to search for text strings in text files. This is what it is most used for, but grep can also search for strings in binary files.
It is important to note that “text” files exist on computers mostly for human readability. Computers talk purely in binary and machine code. The entire ASCII character set consists of
255 characters, of which only about 60 are “human-readable.” However, many computer programs contain text strings as well. For instance, “help” screens, filenames, error messages, and expected user input may appear as text inside binary files.
The grep command does not distinguish to any great extent between searching text or binary files. As long as you feed it patterns (even binary patterns), it will happily search any file for the patterns you tell it to search. It does do an initial check to see if a file is binary and alters the way it displays results accordingly (unless you manually specify other behavior):
bash$ grep help /bin/ls
Binary file /bin/ls matches
This command searches for the string “help” in the binary file ls. Instead of showing the line where the text appears, it simply indicates that a match was found. The reason again relates to the fact that computer programs are in binary and therefore not human-readable. There are no “lines” in programs, for in- stance. Binary files don’t add line breaks because they would alter the code—they are simply a feature to make text files more readable, which is why grep tells you only whether there is a match. To get an idea of the kind of text that is in a binary file, you can use the strings command. For instance, strings /bin/ ls would list all the text strings in the ls command.
There is another way to search binary files that is specific for binary data as well. In this case, you need to rely on some tricks, because you cannot type in binary data directly with a normal keyboard. Instead, you need to use a special form of a regular expression to type in the hexadecimal equivalent of the data you want to search. For instance, if you wanted to search a binary file that had a hexadecimal string of ABAA, you would type the following command:
bash$ grep '[\xabaa]' test.hex
Binary file test.hex matches
The general format is to type /x and then the hexadecimal string you wish to match. There is no real limit to the size of the string you can enter. This type of searching could be useful in malware analysis. For instance, the metasploit framework (http://www.metasploit.org) can generate binary payloads to exploit remote machines. This payload could be used to establish a remote shell, add accounts, or accomplish other malicious activity.
Using hexadecimal searching, it would be possible to deter- mine from binary strings which of the metasploit payloads were being used in an actual attack. Additionally, if you could determine a unique hexadecimal string that was used in a virus, you could create a basic virus scanner using grep. In fact, many older virus scanners did more or less this very thing by search- ing for unique binary strings in files against a list of known bad hexadecimal signatures.
Many buffer overflows or exploit payloads are written in C, and it is typical to write out each hexadecimal digit in C with the \x escape.
In case of any ©Copyright or missing credits issue please check CopyRights page for faster resolutions.