Extracting certain values from text

Question

I have a text file:

[31/May/2016:11:58:29-0500]/segment?city=london&language=en&x=12345&y=6789&z=1
[31/May/2016:11:59:15-0500]/segment?language=en&city=madrid&x=4589.4583&y=4865.5465&z=3
[31/May/2016:12:05:13-0500]/segment?city=london&language=en&x=12345&y=6789&z=1
[31/May/2016:12:15:13-0500]/segment?city=london&language=en&x=12345&y=6789&z=1
[31/May/2016:12:26:53-0500]/segment?language=en&city=newyork&x=45724.75575&y=424424.77474&z=3

I need to extract certain values: date, name of the city, language, x, y, z in that order. Please notice that in some lines there is a different order and in future files order might be also different then that.

Output should look like:

31/May/2016:11:58:29-0500 london en 12345 6789 1
31/May/2016:11:59:15-0500 madrid en 589.4583 4865.5465 3
31/May/2016:12:05:13-0500 london en 12345 6789 1
31/May/2016:12:15:13-0500 london en 12345 6789 1
31/May/2016:12:26:53-0500 newyork en 45724.75575 424424.77474 3

or even better if comma can be edited, as a certain csv standard output would looked like this:

31/May/2016:11:58:29-0500,london,en,12345,6789,1
31/May/2016:11:59:15-0500,madrid,en,589.4583,4865.5465,3
31/May/2016:12:05:13-0500,london,en,12345,6789,1
31/May/2016:12:15:13-0500,london,en,12345,6789,1
31/May/2016:12:26:53-0500,newyork,en,45724.75575,424424.77474,3

score 7 · Answer 1 · answered Aug 24 '16 at 17:16

Since the order can change, this will take a little bit of scripting. Here's a Perl version:

#!/usr/bin/perl -nl

my $time = $1 if /\[(.+?)\]/; 
my $city = $1 if /city=(.*?)(&|$)/;
my $lang = $1 if /language=(.*?)(&|$)/;
my $x = $1 if /\bx=(.*?)(&|$)/; 
my $y = $1 if /\by=(.*?)(&|$)/; 
my $z = $1 if /\bz=(.*?)(&|$)/;
print join ",", ($time, $city, $lang, $x, $y, $z)

Save that as foo.pl, make it executable (chmod +x foo.pl) and run it like this:

./foo.pl file.txt

You could also squeeze that into a "one-liner":

perl -lne '$t=$1if/\[(.+?)\]/;$c=$1if/city=(.*?)(&|$)/;$l=$1if/language=(.*?)(&|$)/;$x=$1if/\bx=(.*?)(&|$)/;$y=$1if/\by=(.*?)(&|$)/;$z=$1if/\bz=(.*?)(&|$)/;print join",",($t,$c,$l,$x,$y,$z)' file

Explanation

The -n means "read the input file line by line and apply the script to each line. The -l adds a newline to each print call and strips newlines from each input line.

In each case, we use a regular expression to find the target string and assign it to a variable if a match was found. The first regex, \[(.+?)\] matches anything between a [ and the first ]. The parentheses around the .+ are capturing groups and let us refer to what was captured as $1. So, $time will be whatever was inside the [ ].

The other regexes follow the same idea. The \b means a "non-word character" and ensures that y= will not match city etc. The (&|$) means either a & or the end of the line ($) and is needed for capturing patterns at the very end of the line.

Finally, we join these with commas and print them.

steeldriver · Accepted Answer · 2016-08-24T23:13:16.477

Since these seem to be essentially structured as URL queries, you might want to look at using a dedicated query parser - such as the one from python's urlparse module. For example

#!/usr/bin/python2

import sys,re
from urlparse import urlparse,parse_qs

keys = ['city', 'language', 'x', 'y', 'z']

with open(sys.argv[1],'r') as f:
        for line in f:
                u = urlparse(line.strip('\n'))
                q = parse_qs(u.query)

                # extract the strings we want from the dict-of-lists
                values = ','.join(['-'.join(q[key]) for key in keys])

                # extract the timestamp portion of the path (between `[` and `]`)
                m = re.search('(?<=\[).*?(?=\])', u.path)
                ts = m.group(0)

                # print as a comma-separated list
                print '{},{}'.format(ts, values)

Then

$ ./queryparse.py queries.txt
31/May/2016:11:58:29-0500,london,en,12345,6789,1
31/May/2016:11:59:15-0500,madrid,en,4589.4583,4865.5465,3
31/May/2016:12:05:13-0500,london,en,12345,6789,1
31/May/2016:12:15:13-0500,london,en,12345,6789,1
31/May/2016:12:26:53-0500,newyork,en,45724.75575,424424.77474,3

NOTE: the parse_qs method returns a dict of lists i.e. it allows for multiple values for each query key: the '-'.join(q[key]) notionally turns each value-list into a hyphen-separated string, however in this case we expect only a single value for each key.

score 0 · Answer 3 · answered Aug 25 '16 at 09:33

Since the order can change, this is slightly harder, but sed can handle it:

s/\[(.*)\](\/segment\?)(.*)/\3,\1/ #Match text between [], append to end of line and remove /segmennt?
s/city=([^&,]*)[&,](.*)/\2,\1/     #Match city= followed by any character
s/language=([^&,]*)[&,](.*)/\2,\1/ #except & and , which are the separators and append to end of line
s/x=([^&,]*)[&,](.*)/\2,\1/
s/\by=([^&,]*)[&,](.*)/\2,\1/      #Avoid matching city again by making sure y is at a word boundary 
s/z=([^&,]*)[&,](.*)/\2,\1/

Run as: sed -rnf scriptfile inputfile

Extracting certain values from text

3 Answers3

Explanation