0

I learned one Regex to be able to detect sentences with the Oxford comma and sentences without. For example

for a sentence like the below

I went to the store and bought eggs, milk, apples, butter, and bread.

I will use

(?:\w+,\s+){2,}and\s+\w+   

and for

I went to the store and bought eggs, milk, apples, butter and bread.

I will use

(?:\w+,\s+){1,}\w+\s+and\s+\w+. 

It works fine in Ultraedit using Perl.

However, I am using a software called SDL Studio and that is a Cat tool (translation tool), and it can use Regex, but for some reason it is not accepting those Regex above. Can you please let me know any other Regex formula that might work well instead of the above, using a more standard Regex engine?

ilkkachu
  • 1,837

1 Answers1

2

The character classes (\w, \s and others) are a feature of Perl regexes. The most commonly supported replacement for that would be character sets in square brackets: [a-zA-Z] for letters, [0-9] and [ \t\n] for whitespace. Of course, that kind of assumes there aren't any other letters than the 26 English letters in plain ASCII and ignores a couple rarer whitespace characters too. There are also named character classes, like [[:alpha:]] that work similarly, but there might be places they don't work in.

(?:...) is also a Perlism, which you can replace with (...) if you're not interested in capturing the matching portion within the parenthesis.

So, I'd try turning the first RE into:

([[:alpha:]]+,[[:space:]]+){2,}and[[:space:]][[:alpha:]]+

or the more simple, straightforward, and non-general:

([a-z]+, +){2,}and +[a-z]+ 

Both work with GNU grep with extended regular expressions enabled (-E command line flag) and are somewhat standard, but of course what your application supports might not be the same. The next construct to be a problem would be the {N,M} counting match, which is rather annoying to replace, since you'd need to repeat the previous group. (Though note that (...){1,} is exactly the same as (...)+.)

There's a reference on the usual regexes in the regex(7) man page, and if you really want to know gory details and differences between variants, see Why does my regular expression work in X but not in Y? at unix.SE.

ilkkachu
  • 1,837
  • Thanks a million ilkkachu. Perfect solution for finding it and it works. ([a-z]+, +){2,}and +[a-z]+ seems to be working just find. Can you please give me one that matches the other pattern, i.e. the one that does not include the Oxford comma, for sentences like: I went to the store and bought eggs, milk, apples, butter and bread. I went to the store and bought eggs, milk, apples, garlic, strawberry, butter and cheese. – Sam Mouha Mar 16 '17 at 07:37
  • would this work: ([a-z]+, +)[a-z]+ and. – Sam Mouha Mar 16 '17 at 08:05