Regular expression to match quoted and nonquoted strings
Posted Wed, 23 Jun 2004
The method I was using to match quoted strings didn't work all that well and it wasn't entirely flexible - so I wrote a lamer version which works for everything I've tried -
# Make the underlined part match characters that can't be in a word.
# For HTML, set this to [^\s>]
my $regex = q!(?:"([^"]*)"|'([^']*)'|([^\s]+))!;
while (<>) {
while (s/^$regex\s+//) {
my $string = $1 || $2 || $3;
print "Quoted string: $string
";
}
}
It works, I tested it. Here's some sample output:
nightfall(~) > perl quoteregex.pl testing Quoted string: testing foo bar baz Quoted string: foo Quoted string: bar Quoted string: baz "hello there" how are 'you doing' Quoted string: hello there Quoted string: how Quoted string: are Quoted string: you doing 'foo Quoted string: 'foo 'foo bar baz Quoted string: 'foo Quoted string: bar Quoted string: baz
For an example on how to get this to work with html, here's something that'll pull all the links from a webpage (anchor tags, and only the 'href' attribute):
#!/usr/bin/perl
use strict;
use HTTP::Handle;
my $hd = HTTP::Handle->new();
my $regex = q!(?:"([^"]*)"|'([^']*)'|([^\s>]+))!;
$hd->url($ARGV[0] || "http://www.google.com");
$hd->connect();
my $fd = $hd->fd();
undef $/;
my $source = <$fd>;
while ($source =~ s/<a?s+(?:[^>]+s+)*href=$regex[^>]*>//s) {
my $link = $1 || $2 || $3;
print "Link: $link\n";
}