I have a big text file like this example:
example:
chr1 109472560 109472561 -4732 CLCC1
chr1 109472560 109472561 -4732 CLCC1
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109477498 109477499 206 CLCC1there are some repeated lines and I want to take only one repeat of them. for the above example the expected output would look like this:
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1I am trying to do that in awk using the following command:
awk myfile.txt | uniq > uniq_file_name.txtbut the output is empty. do you know how to fix it?
44 Answers
EDIT: Since hek2mgl sir mentioned in case you need to remove continuous similar lines then try following.
Let's say following is Input_file:
cat Input_file
chr1 109472560 109472561 -4732 CLCC1
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109472560 109472561 -4732 CLCC1Run following code now:
awk 'prev!=$0;{prev=$0}' Input_fileOutput will be as follows.
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109472560 109472561 -4732 CLCC1The following snippet will remove all duplicate lines, not only repeating lines
awk '!a[$0]++' Input_fileAppend > output_file to above command in case you want to take output into a separate file.
Explanation: Adding explanation for above code now. This is only for explanation purposes for running code use above mentioned one only.
awk '
!a[$0]++ ##Checking condition here if current line is present in array a index or NOT, if not then increase its value by 1. ##So that next time it will make condition as FALSE, since we need to have only unique lines. ##awk works on method of condition and action, so if condition is TRUE it will do some action mentioned by programmer. ##Here I am not mentioning action so by default print of current line will happen, whenever condition is TRUE.
' Input_file ##mentioning Input_file name here. 5 This is to show the difference between uniq, awk '!a[$0]++' and sort -u.
uniq: removes the consequitive duplicate lines, keeps order :
$ echo "b\nb\na\nb\nb" | uniq
b
a
bawk !a[$0]++: removes all duplicates, keeps order
$ echo "b\nb\na\nb\nb" | awk '!a[$0]++'
b
asort -u: removes all duplicates and sorts the output
$ echo "b\nb\na\nb\nb" | sort -u
a
b 0 Your command:
$ awk myfile.txt | uniq > uniq_file_name.txtand more precisely this part:
$ awk myfile.txtwill hang as there is no program or script for awk to execute. The minimum you need to do to print all the lines is:
$ awk 1 myfile.txtBut since you had no awk script, I assume you don't need awk, then just use uniq (depending on your need, either):
$ uniq myfile.txt
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1or
$ sort myfile.txt | uniqwhich for that input will produce the same output.
Update:
Regarding the discussion in the comments about why sort: If repeated lines means all duplicated records in the file, use sort. If it means consecutive duplicated lines forget the sort.
Using Perl
> cat user106.txt
chr1 109472560 109472561 -4732 CLCC1
chr1 109472560 109472561 -4732 CLCC1
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109477498 109477499 206 CLCC1
chr1 109477498 109477499 206 CLCC1
> perl -ne ' print if $kv{$_}++ == 1 ' user106.txt
chr1 109472560 109472561 -4732 CLCC1
chr1 109477498 109477499 206 CLCC1
>To remove repeated lines
> echo "a\nb\nb\nb\nc\nc\nd\na" | perl -ne ' print if $prev ne $_ ; $prev=$_ ' -
a
b
c
d
a
> 2