0

I'm looking to text process an existing file and have the output of the alrorithm output to a new file. I thought it would be an easy task, but it's flummoxing me, probably because I don't know ls from cat from awk at this point.

I have an existing very, very large text file that's formatted as:

00:02:00.100 --> 00:02:00.100
BLAH BLAH BLAH

00:02:00.100 --> 00:02:00.100 BLAH BLAH BLAH

I basically am trying to output a txt file with just

BLAH BLAH BLAH BLAH BLAH BLAH

I can probably create a Word macro to do it as well, and even the all caps I could correct.

So far, I have

cat file.vtt | grep -v [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9][[:space:]][[:punct:]][[:punct:]][[:punct:]][[:space:]][0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]

That outputs on the screen the results and it's definitely removed the timecode stamps, but for the life of me I can't figure out how to remove the hard returns between timecodes and just have the text lines flow.

The existing text file also uses >> to indicate a hard return. Is there some way I could incorpoate that into the string, to insert a carriage return every time >> is in the existing file?

And finally, how in the world do I cause the original xyz.txt to be overrwritten with the output of the string?

muru
  • 197,895
  • 55
  • 485
  • 740

1 Answers1

1

Assuming file.vtt follows the Unix/Linux style \n and not Windows style \r\n carriage return(If not, then run it through dos2unix first.) ... then,

awk '!/-->/ {sub(">>","\n"); printf("%s ", $0)}' file.vtt > xyz.txt 

Will match and print lines(in file.vtt) not containing --> while removing any \n and substitute >> with \n(new line) then, redirect the output to xyz.txt(overwriting its contents if it exists or creating it if it doesn't).

Alternatively, to edit the original file file.vtt(CAUTION ... replacing its contents) instead, use gawk like so:

gawk -i inplace '!/-->/ {sub(">>","\n"); printf("%s ", $0)}' file.vtt
Raffa
  • 32,237
  • Thanks for that. I was actually incorrect. It's not a .vtt file, but rather a Windows Notepad .txt file.

    I received several awk syntax command line errors when trying the approach above. It may seem like it would be simpler to just manually edit the file, but I have like 600 files, all over 6 hours of text, haha.

    – ecceben Jun 28 '22 at 16:04
  • @ecceben Then run it through dos2unix first. – Raffa Jun 28 '22 at 16:20
  • 1
    Thanks. I ran it through dos2unix and it worked perfectly. I used gawk instead of awk. – ecceben Jun 28 '22 at 20:35