Search this site


Metadata

Articles

Projects

Presentations

Regular expression to match quoted and nonquoted strings

The method I was using to match quoted strings didn't work all that well and it wasn't entirely flexible - so I wrote a lamer version which works for everything I've tried -

# Make the underlined part match characters that can't be in a word.
# For HTML, set this to [^\s>]
my $regex = q!(?:"([^"]*)"|'([^']*)'|([^\s]+))!;

while (<>) {
	while (s/^$regex\s+//) {
		my $string = $1 || $2 || $3;
		print "Quoted string: $string
";
	}
}

It works, I tested it. Here's some sample output:

nightfall(~) > perl quoteregex.pl
testing
Quoted string: testing
foo bar baz
Quoted string: foo
Quoted string: bar
Quoted string: baz
"hello there" how are 'you doing'
Quoted string: hello there
Quoted string: how
Quoted string: are
Quoted string: you doing
'foo
Quoted string: 'foo
'foo bar baz
Quoted string: 'foo
Quoted string: bar
Quoted string: baz

For an example on how to get this to work with html, here's something that'll pull all the links from a webpage (anchor tags, and only the 'href' attribute):

#!/usr/bin/perl

use strict;
use HTTP::Handle;

my $hd = HTTP::Handle->new();
my $regex = q!(?:"([^"]*)"|'([^']*)'|([^\s>]+))!;

$hd->url($ARGV[0] || "http://www.google.com");
$hd->connect();
my $fd = $hd->fd();

undef $/;
my $source = <$fd>;

while ($source =~ s/<a?s+(?:[^>]+s+)*href=$regex[^>]*>//s) {
	my $link = $1 || $2 || $3;
	print "Link: $link\n";
}