1

File a.txt

chr:1:10539:A:C 10539 C A 0.545987 0.508902  0 0.36065 + 1
chr:2:13494:A:G 13494 A G 0.330493 0.0264746  0 0.733423 + 1
chr:7:13494:A:G 13494 A G 0.330493 0.0264746  0 0.733423 + 1

File b.txt

1 4972
2 4972
3 4972
7 4970

I am looking for a way to find partial match between $1 of a.txt and $1 b.txt and replace $7 in a.txt with corresponding $2 from b.txt.

So that output will look like

chr:1:10539:A:C 10539 C A 0.545987 0.508902  4972 0.36065 + 1
chr:2:13494:A:G 13494 A G 0.330493 0.0264746  4972 0.733423 + 1
chr:7:13494:A:G 13494 A G 0.330493 0.0264746  4970 0.733423 + 1

Thank you for any help.

ThePooh
  • 11
  • 1
    Please [edit] your question and include an example line for a chromosome number >9 and maybe X or Y? All your lines show single digit chr names, so people won't know that you can have 2-digit and even letters. That changes the approaches we can use for this. – terdon Jan 15 '18 at 09:28

1 Answers1

0

An awk approach:

$ awk 'NR==FNR{a[$1]=$2; next} {split($1,b,/:/); $7=a[b[2]]}1;' b.txt a.txt 
chr:1:10539:A:C 10539 C A 0.545987 0.508902 4972 0.36065 + 1
chr:2:13494:A:G 13494 A G 0.330493 0.0264746 4972 0.733423 + 1
chr:7:13494:A:G 13494 A G 0.330493 0.0264746 4970 0.733423 + 1

Explanation

  • NR==FNR{a[$1]=$2; next} : NR is the line number of the current input line and FNR is the current line number of the current file. The two will only be equal while reading the 1st file. Therefore, this will save the information from b.txt in the array a whose indices are the chromosomes from b.txt and whose values are the associated numbers. The next skips to the next line and ensures that the second block is not run for b.txt.

  • split($1,b,/:/); $7=a[b[2]] : this will only be run for a.txt. First, it splits the 1st field on : into the array b. So the 2nd element of b will be the chromosome. Then, it sets the 7th field of the file to be whatever was stored in the array a for the chromosome stored in b[2] (this is what a[b[2]] means: a[ b[2] ]).

  • 1;: this is awk shorthand for "print this line".

terdon
  • 100,812