While America threw on its eating pants and combed the Thursday circulars for deals, John Gruber spent Thanksgiving preparing to unveil his regular expression for finding URLs in arbitrary text.
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Pretty dense. Let’s be that guy and break it out, /x
style (ignoring white space, with comments)
\b #start with a word boundary
( #<capture_1>
( #<capture_2>
[\w-]+://? #protocol://, second slash optional
| #OR
www[.] #a literal www.
) #</capture_2>
[^\s()<>]+ #non-whitespace, parens or angle brackets
#repeated any number of times
(?: #<noncapture_1> (end game)
\([\w\d]+\) #handles weird parenthesis in URLs (http://example.com/blah_blah_(wikipedia))
#won't handle this twice foo_(bar)_(and)_again
| #OR
( #<capture_3>
[^[:punct:]\s] #NOT a single punctuation or white space
| #OR
/ #trailing slash
) #</capture_3>
) #</noncapture_1>
) #</capture_1>
It’s a great start for a hard problem. I see two places for possible improvement.
- While John accounted for Wikipedia’s weird parenthesis in URLs, he didn’t account for double parentheses, such as http://example.com/blah_blah_(wikipedia)_and_more_(parens)_eh
- It won’t capture URLs that lack both a protocol and a www. So if I were to use example.com in a paragraph, the regular expression would skip over it.
Neither is an easy problem, but the last 10% of any regular expression based solution is always the hardest. The multiple parenthesis can be solved by replacing this
\([\w\d]+\) #handles weird parenthesis in URLs (#handles weird parenthesis in URLs (http://example.com/blah_blah_(wikipedia)))
with this
(?:\([\w\d)]+\)[^\s()<>]*)+ #handles extra weird parenthesis in URLs (http://example.com/blah\_blah\_(wikipedia)\_and\_more\_(parens)_eh)
This still wouldn’t handle something like
http://example.com/blah_blah_(wikipedia
but I’m willing to leave that one on the floor for today.
Capturing non-www URLs would seem, at first blush, easy. Replace this
( #<capture_2>
[\w-]+://? #protocol://, second slash optional
| #OR
www[.] #a www.
) #</capture_2>
with this
( #<capture_2>
[\w-]+://? #protocol://, second slash optional
| #OR
[\w\d]+[.] #anything followed by a "."
) #</capture_2>
However, you’re going to end up with false positives if the text contains poor punctuation.
What happens is someone forgets to put a space after their sentences.We have an edge case?
This seems an unsolvable problem. It’s perfectly conceivable that “we” could become a top level country code for domain names in the future. Still, depending on you situation, this may be an acceptable solution.
I tested both examples (not extensively) with BBEdit’s PCRE engine, so caveat emptor to the Perl and Ruby heads out there.