4

I have very recently started using linux and I am almost completely oblivious on sed commands. I need to edit a file that contains a bunch of long lines starting with the common character ">" and delete the rest of this line only keeping the first word but not touch any lines that do not start with ">" using sed command.

In other words, i need to turn this (only a part of the first entry for demonstration purposes):

>YAL001C TFC3 SGDID:S000000001, Chr I from 151006-147594,151166-151097, Genome Release 64-1-1, reverse complement, Verified ORF, "Largest of six subunits of the RNA polymerase III transcription initiation factor complex (TFIIIC); part of the TauB domain of TFIIIC that binds DNA at the BoxB promoter sites of tRNA and similar genes; cooperates with Tfc6p in DNA binding"
MVLTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKKVKQFVLSCVILKKDIE
VYCDGAIP*

into this:

>YAL001C
MVLTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKKVKQFVLSCVILKKDIE
VYCDGAIP*
Braiam
  • 67,791
  • 32
  • 179
  • 269
user300245
  • 41
  • 1
  • 2

1 Answers1

8

I present here four solutions, two using sed, one using awk, and one using perl. To start:

$ sed -r 's/^(>[^ ]+) .*/\1/' inputfile

On your sample input, this produces the output:

>YAL001C
LTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKKVKQFVLSCVILKKDIE
VYCDGAIP*

The code uses sed's substitute command s. The substitute command is in the form s/old/new/. In this case, the "old" part consists of these pieces:

  • ^

    This is sed-speak for the start of a line.

  • (>[^ ]+)

    This refers to a group of characters consisting of an angle bracket followed by one or more non-blank characters. Because this is in parenthesis, we will be able to refer to it later as \1.

  • .*

    This refers to a blank followed by any number of any characters.

When the substitute command is done, the whole of any such line is replaced by just the > and the non-blank characters which immediately follow it.

Any lines not starting with that combination are sent to the output unchanged.

Alternate solution

In the comments, steeldriver suggests an alternate approach:

sed '/^>/ s/\s.*//'

In this solution, the substitute command is preceded by the modifier /^>/ which restricts the substitute command to operate only on lines that start with >. Knowing that the line starts with an angle bracket, then it is only necessary to remove the first blank and everything which follows the first blank. This is what the command s/\s.*// does.

All other lines are passed through unchanged.

Alternate solution using awk

awk '/^>/ {print $1;next} 1' inputfile

This awk script consists to two expressions:

  • /^>/ {print $1;next}

    awk supports the same style of modifiers as sed. The initial expression, thus, restricts this command to operate only on lines that start with >. For those lines, the first field is printed. next tells awk to skip to the next line and start over.

  • 1

    1 is awk's cryptic short-hand for print the whole line. This is executed only on lines for which the next command in the preceding expression is not executed, meaning that awk reaches this command only if the line does not start with >.

Alternate solution using perl

steeldriver also suggests:

perl -anle 'print $F[0] if /^>/ || $_'

The four options have the following meaning:

  • -n tells perl to implicitly loop over input lines

  • -a tells perl to turn on autosplitting, creating the @F array

  • -l enables automatic line-ending processing

  • -e tells it to run the command which follows, eliminating the need for a perl script file.

The perl command itself is fairly readable:

print $F[0] if /^>/ || $_

This command prints the first field if the line starts with >. Otherwise, it prints the whole line.

John1024
  • 13,687
  • 43
  • 51