Bash Remove Duplicates From File And Sort
Uniq is a command used to find out the unique lines from the given input (stdin or from filename as command argument) by eliminating the duplicates. It can also be used to find out the duplicate lines from the input. Uniq can be applied only for sorted data input. Hence, uniq is to be used always along with the sort command using pipe or using a sorted file as input.
You can produce the unique lines (unique lines means that all lines in the input are printed, but the duplicate lines are printed only once) from the given input data as follows:
$ cat sorted.txt
bash
foss
hack
hack
$ uniq sorted.txt
bash
foss
hack
Or:
$ sort unsorted.txt | uniq
Or:
$ sort -u unsorted.txt
Display only unique lines (the lines which are not repeated or duplicate in input file) as follows:
$ uniq -u sorted.txt
bash
foss
Or:
$ sort unsorted.txt | uniq -u
In order to count how many times each of the line appears in the file, use the following command:
$ sort unsorted.txt | uniq -c
1 bash
1 foss
2 hack
Find duplicate lines in the file as follows:
$ sort unsorted.txt | uniq –d
hack
To specify keys, we can use the combination of -s and -w arguments.
- -s specifies the number for the first N characters to be skipped
- -w specifies the maximum number of characters to be compared
This comparison key is used as the index for the uniq operation as follows:
$ cat data.txt
u:01:gnu
d:04:linux
u:01:bash
u:01:hack
We need to use the highlighted characters as the uniqueness key.
This is used to ignore the first 2 characters (-s 2) and the max number of comparison characters is specified using the –w option (-w 2):
$ sort data.txt | uniq -s 2 -w 2
d:04:linux
u:01:bash
While we use output from one command as input to the xargs command, it is always preferable to use a zero byte terminator for each of the lines of the output, which acts as source for xargs. While using the uniq commands output as the source for xargs, we should use a zero terminated output. If a zero byte terminator is not used, space characters are by default taken as delimiter to split the arguments in the xargs command. For example, a line with text “this is a line” from stdin will be taken as four separate arguments by the xargs. But, actually, it is a single line. When a zero byte terminator is used, \0 is used as the delimiter character and hence, a single line including space is interpreted as a single argument.
Zero byte terminated output can be generated from the uniq command as follows:
$ uniq -z file.txt
The following command removes all the files, with filenames read from files.txt:
$ uniq –z file.txt | xargs -0 rm
If multiple line entries of filenames exist in the file, the uniq command writes the filename only once to stdout.
String pattern generation with uniq
Here is an interesting question for you: We have a string containing repeated characters. How can we find the number of times each of the character appears in the string and output a string in the following format?
Input: ahebhaaa
Output: 4a1b1e2h
Each of the characters is repeated once, and each of them is prefixed with the number of times they appear in the string. We can solve this using uniq and sort as follows:
INPUT= “ahebhaaa”
OUTPUT=` echo $INPUT | sed ‘s/[^\n]/&\n/g’ | sed ‘/^$/d’ | sort | uniq -c | tr -d ‘ \n’`
echo $OUTPUT
In the above code, we can split each of the piped commands as follows:
echo $INPUT # Print the input to stdout
sed ‘s/./&\n/g’
Append a newline character to each of the characters so that only one character appears in one line. This is done to make the characters sortable by using the sort command. The sort command can take only items delimited by newline. ff sed ‘/^$/d’: Here the last character is replaced as character +\n. Hence an extra newline is formed and it will form a blank line at the end. This command removes the blank line from the end.
- sort: Since each character appears in each line, it can be sorted so that it can serve as input to uniq.
- uniq –c: This command prints each of the line with how many times they got repeated(count).
- tr –d ‘ \n’: This removes the space characters and newline characters from the input so that output can be produced in the given format.
In case of any ©Copyright or missing credits issue please check CopyRights page for faster resolutions.