Splitting a large text file every x pattern repeats

Question

I'm trying to split a large file every x patterns without success, how I can achieve that?

The file structure:

> ASDF ABCDEFGHIJKWERQWEWQYASTRDTAYDGAHSFDTS

> QWERT ASJDHASDJHASDHASDHASJDHAJDHJHAD

> ASDF ABCDEFGHIJKWERQWEWQYASTRDTAYDGAHSFDTS

> QTRE AGAHDSJHDASJDHASJDHASJHDAJSDHJASHDJASHDJASHJDHASJDHASJDHAJSHDASJHDJASHDJASHDJASHDJASHDJASJDASHDSUHQYWGEYWGYWGQYWDWBCDEFGHIJKWERQWEWQYASTRDTAYDGAHSFDTS

> ASDF ABCDEFGHIJKWERQWEWQYASTRDTAYDGAHSFDTSASHDJASHDJASDHAJSDHAJDHQUHWUDHUHAWUHASUDHUASDHSUDHSU

It has thousands of lines with different lengths and multiple lines per ">" header. I want to split that large file into smaller filtering every 100 ">" headers per file, is that possible to make?

Thanks in advance!

score 1 · Accepted Answer · answered Jan 18 '16 at 23:04

Here is a small perl script for you. You can save it as split_files.pl and run it as perl split_files.pl input.txt. The output will be stored in files called chunk_0, chunk_1 etc.

#!/usr/bin/perl                                                           
use strict;
use warnings;

my$infile=shift(@ARGV);

my$linecount=0;
my$filecount=0;
my$outfile="chunk_".$filecount;

open(IN,'<',$infile) or die $!;
open(OUT,'>',$outfile) or die $!;
$/="\n>";
while(<IN>)
{
    chomp;
    $_=~s/>//g;
    if($linecount==100)
    {
        $filecount++;
        $outfile="chunk_".$filecount;
        close OUT or die $!;
        open(OUT,'>',$outfile) or die $!;
        $linecount=0;
    }
    print OUT ">",$_,"\n";
    $linecount++;
}
close OUT or die $!;
close IN or die $!;

Explanation:
The trick of the script is the line $/="\n>";. This line changes the default linebreak charachter (\n) to a "newline+>" (\n>). In the while-loop, each block beginning with ">" is used at once. I used two counting variables ($linecount and $filecount). The lines (or blocks in this case) are counted and when this count hits 100, a new file is used for the output.

score 0 · Answer 2 · answered Feb 03 '17 at 18:17

Python approach

The script below splits filename given on command-line into files once specific number of > characters has been seen at the beginning of a line. The number of seen characters is also specified on command-line. Thus the syntax is as follows:

$ ./split_file.py input.txt 3

Script source

#!/usr/bin/env python
import sys

def write_split_file(count,orig_name,lines):
    split_name = orig_name + '.split.' + str(count)
    with open(split_name,'w') as fd:
       fd.write("\n".join(lines))

def main():
    counter = 0
    limit = int(sys.argv[2])
    line_list = []
    with open(sys.argv[1]) as fd1:
        for line in fd1:
            line_list.append(line.strip())
            if line.startswith('>'):
               counter+=1
               if counter % limit == 0:
                  write_split_file(counter,sys.argv[1],line_list)
                  line_list = []

    if line_list:          
       write_split_file(counter,sys.argv[1],line_list)

if __name__ == '__main__': main()

Note: the script is written for Python 2, but is compatible with Python 3. Can easily be modified to split based on a variable starting string.

Splitting a large text file every x pattern repeats

2 Answers2

Python approach

Script source