jw@l@nta's weblog: URL Validity

Thursday, December 04, 2008

URL Validity

Few days ago NewsFromNepal was asking me for some help on some URL validation code. The task was to verify a list of URLs and print the title of the page if it's valid/working. Here's a simple Python script to do that. The URLs are stored per line in url.txt


#! /usr/bin/python
# URL Validity

import urllib2
import re

urls = open('urls.txt', 'r').readlines()
badurls = []

print "Working URLs:",
for url in urls:
    try:
        handle = urllib2.urlopen(url)
        html = handle.read()
        match = re.search("<title>(.*)</title>",html)

        print "\n", url.strip(), " -> ",
        if match:
            print match.groups()[0],
    except:
        badurls.append(url)

if badurls:
    print "\nBad URLs:"
    for bad in badurls:
        print bad.strip()

5 comments:

fireashes12/05/2008 10:41 AM
That is more than what I have even imagined. Thanks Jwalanta.
ReplyDelete
Replies
fireashes12/05/2008 11:14 AM
FYI it does not stop itself after reading all the urls from urls.txt file
ReplyDelete
Replies
Jwalanta Shrestha12/05/2008 11:49 AM
really?? works find on mine though.. what's ur urls.txt look like??
ReplyDelete
Replies
fireashes12/05/2008 12:21 PM
http://www.nepalnews.com/archive/2008/dec/dec05/news01.php
http://www.nepalnews.com/archive/2008/dec/dec05/news02.php
http://www.nepalnews.com/archive/2008/dec/dec05/news03.php
http://www.nepalnews.com/archive/2008/dec/dec05/news04.php
http://www.nepalnews.com/archive/2008/dec/dec05/news05.php
http://www.nepalnews.com/archive/2008/dec/dec05/news06.php
http://ashesdhakal.tripod.com

had to change "< title>(.*)< /title>" to '< p class="inner_news_title">(.*)< / p>' for nepalnews.com

(added space for posting)
Thanks you very much
ReplyDelete
Replies
fireashes12/05/2008 12:21 PM
works very well
ReplyDelete
Replies