I present here four solutions, two using sed
, one using awk
, and one using perl
. To start:
$ sed -r 's/^(>[^ ]+) .*/\1/' inputfile
On your sample input, this produces the output:
>YAL001C
LTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKKVKQFVLSCVILKKDIE
VYCDGAIP*
The code uses sed's substitute command s
. The substitute command is in the form s/old/new/
. In this case, the "old" part consists of these pieces:
^
This is sed-speak for the start of a line.
(>[^ ]+)
This refers to a group of characters consisting of an angle bracket followed by one or more non-blank characters. Because this is in parenthesis, we will be able to refer to it later as \1
.
.*
This refers to a blank followed by any number of any characters.
When the substitute command is done, the whole of any such line is replaced by just the >
and the non-blank characters which immediately follow it.
Any lines not starting with that combination are sent to the output unchanged.
Alternate solution
In the comments, steeldriver suggests an alternate approach:
sed '/^>/ s/\s.*//'
In this solution, the substitute command is preceded by the modifier /^>/
which restricts the substitute command to operate only on lines that start with >
. Knowing that the line starts with an angle bracket, then it is only necessary to remove the first blank and everything which follows the first blank. This is what the command s/\s.*//
does.
All other lines are passed through unchanged.
Alternate solution using awk
awk '/^>/ {print $1;next} 1' inputfile
This awk
script consists to two expressions:
/^>/ {print $1;next}
awk
supports the same style of modifiers as sed
. The initial expression, thus, restricts this command to operate only on lines that start with >
. For those lines, the first field is printed. next
tells awk
to skip to the next line and start over.
1
1
is awk
's cryptic short-hand for print the whole line. This is executed only on lines for which the next
command in the preceding expression is not executed, meaning that awk
reaches this command only if the line does not start with >
.
Alternate solution using perl
steeldriver also suggests:
perl -anle 'print $F[0] if /^>/ || $_'
The four options have the following meaning:
-n
tells perl
to implicitly loop over input lines
-a
tells perl to turn on autosplitting, creating the @F
array
-l
enables automatic line-ending processing
-e
tells it to run the command which follows, eliminating the need for a perl script file.
The perl command itself is fairly readable:
print $F[0] if /^>/ || $_
This command prints the first field if the line starts with >
. Otherwise, it prints the whole line.
sed '/^>/ s/\s.*//'
(on only those lines starting with>
, replace everything from the first whitespace character onwards with nothing) – steeldriver Jul 03 '14 at 20:13perl -anle 'print $F[0] if /^>/ || $_'
maybe? – steeldriver Jul 03 '14 at 21:02