4

I have a file that looks like this:

    7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
    8  C00000002 score:  -39.520 nathvy =  49 nconfs =         3129
    9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
   10  C00000002 score:  -38.454 nathvy =  49 nconfs =         9473
   11  C00000004 score:  -37.704 nathvy =  24 nconfs =          156
   12  C00000001 score:  -37.558 nathvy =  41 nconfs =           51
    2  C00000002 score:  -48.649 nathvy =  49 nconfs =         3878
    3  C00000001 score:  -44.988 nathvy =  41 nconfs =         1988
    4  C00000002 score:  -42.674 nathvy =  49 nconfs =         6740
    5  C00000002 score:  -42.453 nathvy =  49 nconfs =         4553
    6  C00000002 score:  -41.829 nathvy =  49 nconfs =         7559

My second column are some IDs that are not sorted here, some of them are repeating, such as (C00000001) for example. All of them have a different number assigned followed by score: (number most often starts with -).

What I would like to do is:

1) read second column (non sorted IDs) and to always pick the first one that appears. So in case of C00000001 it would pick the on with score : -37.558.

2) now when I have unique values presented, I would like to sort them based on the number after score:, meaning the most negative number to be on the first position while the most positive one to be on the last position.

I would like to have output printed out the same way as my input file (same structure).

Ravexina
  • 55,668
  • 25
  • 164
  • 183
djordje
  • 311
  • The first score that appears for C00000001 is -37.558. Or is the order defined by the first column? – Melebius Sep 04 '18 at 05:45
  • oh, thanks Melebius, my fault, will edit it now..I wrote the number with the highest score for this particular ID. So, at first step we dont look at the score, we just pick up the first unique value that appears and then organize them based on number under score, from most negative to most positive. – djordje Sep 04 '18 at 05:49

3 Answers3

8
$ sort -k2,2 -u < filename | sort -k4,4n

7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
12 C00000001 score:  -37.558 nathvy =  41 nconfs =           51

Explanation:

  1. sort -k2,2 -u: sorts the lines based on second column and does not change the order of them (cause they're basically the same value) and keep the first one.
  2. sort -k4,4n: sort numerically based on the scores (there is no need for -r to reverse it).
Ravexina
  • 55,668
  • 25
  • 164
  • 183
  • You should use angle brackets for filename: <filename>. At the first moment, I thought it’s a sorting option. See http://docopt.org/, for example. – Melebius Sep 04 '18 at 06:11
  • 2
    Sure, I'll try to keep it in mind ;). but have you seen this? – Ravexina Sep 04 '18 at 06:15
  • ... or rather a variable reference like $filename. As the angle brackets are a confusing syntax for shell scripts. – Grzegorz Oledzki Sep 04 '18 at 08:27
  • @Thor I have saw your comment the first time you post it, I'm not able to get your suggestion to work at any form, however I have updated my command (Yesterday) to: sort -k4,4n, and it is enough to get the highest value in this situation. – Ravexina Sep 05 '18 at 07:28
1

With GNU awk > 4.0:

$ gawk '
    !seen[$2] {seen[$2] = $0} 
    END {PROCINFO["sorted_in"] = "@val_num_asc"; for (i in seen) print seen[i]}
  ' file
    7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
    9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
   12  C00000001 score:  -37.558 nathvy =  41 nconfs =           51
steeldriver
  • 136,215
  • 21
  • 243
  • 336
0

Contributing with an additional single-line command that can easily be configured

for row in $(cat tmp |  awk '{print $2}' | sort | uniq); do cat tmp | grep $row | head -n 1; done | sort -r --key=4

7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
12  C00000001 score:  -37.558 nathvy =  41 nconfs =           51
toerq
  • 11
  • 1