Gruber’s URL Regular Expression Explained

astorm

Frustrated by Magento? Then you’ll love Commerce Bug, the must have debugging extension for anyone using Magento. Whether you’re just starting out or you’re a seasoned pro, Commerce Bug will save you and your team hours everyday. Grab a copy and start working with Magento instead of against it.

Updated for Magento 2! No Frills Magento Layout is the only Magento front end book you'll ever need. Get your copy today!

Programming Quickies

Quick dispatches from the life of a working programmer.

While America threw on its eating pants and combed the Thursday circulars for deals, John Gruber spent Thanksgiving preparing to unveil his regular expression for finding URLs in arbitrary text.

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

Pretty dense. Let’s be that guy and break it out, /x style (ignoring white space, with comments)

\b                          #start with a word boundary
(                           #<capture_1>
    (                           #<capture_2>
        [\w-]+://?                  #protocol://, second slash optional
        |                           #OR
        www[.]                      #a literal www. 
    )                           #</capture_2>
    [^\s()<>]+                  #non-whitespace, parens or angle brackets
                                #repeated any number of times
    (?:                         #<noncapture_1> (end game)
        \([\w\d]+\)                 #handles weird parenthesis in URLs (http://example.com/blah_blah_(wikipedia))
                                    #won't handle this twice foo_(bar)_(and)_again
        |                           #OR
        (                           #<capture_3>
            [^[:punct:]\s]              #NOT a single punctuation or white space
            |                           #OR
            /                           #trailing slash
        )                           #</capture_3>
    )                           #</noncapture_1>
)                           #</capture_1>

It’s a great start for a hard problem. I see two places for possible improvement.

While John accounted for Wikipedia’s weird parenthesis in URLs, he didn’t account for double parentheses, such as http://example.com/blah_blah_(wikipedia)_and_more_(parens)_eh
It won’t capture URLs that lack both a protocol and a www. So if I were to use example.com in a paragraph, the regular expression would skip over it.

Neither is an easy problem, but the last 10% of any regular expression based solution is always the hardest. The multiple parenthesis can be solved by replacing this

\([\w\d]+\) #handles weird parenthesis in URLs (#handles weird parenthesis in URLs (http://example.com/blah_blah_(wikipedia)))

with this

(?:\([\w\d)]+\)[^\s()<>]*)+ #handles extra weird parenthesis in URLs (http://example.com/blah\_blah\_(wikipedia)\_and\_more\_(parens)_eh)

This still wouldn’t handle something like

http://example.com/blah_blah_(wikipedia

but I’m willing to leave that one on the floor for today.

Capturing non-www URLs would seem, at first blush, easy. Replace this

    (                           #<capture_2>
        [\w-]+://?                  #protocol://, second slash optional
        |                           #OR
        www[.]                      #a www. 
    )                           #</capture_2>

with this

    (                           #<capture_2>
        [\w-]+://?                  #protocol://, second slash optional
        |                           #OR
        [\w\d]+[.]                  #anything followed by a "." 
    )                           #</capture_2>

However, you’re going to end up with false positives if the text contains poor punctuation.

What happens is someone forgets to put a space after their sentences.We have an edge case?

This seems an unsolvable problem. It’s perfectly conceivable that “we” could become a top level country code for domain names in the future. Still, depending on you situation, this may be an acceptable solution.

I tested both examples (not extensively) with BBEdit’s PCRE engine, so caveat emptor to the Perl and Ruby heads out there.

Originally published November 27, 2009

Uncategorized

Permalink: https://alanastorm.com/url_regex_explained/

Originally Posted: 27th November 2009

Categories

Archives

Recent Posts

Categories

Gruber’s URL Regular Expression Explained

Programming Quickies

astorm