Few days ago NewsFromNepal was asking me for some help on some URL validation code. The task was to verify a list of URLs and print the title of the page if it's valid/working. Here's a simple Python script to do that. The URLs are stored per line in url.txt
#! /usr/bin/python
# URL Validity
import urllib2
import re
urls = open('urls.txt', 'r').readlines()
badurls = []
print "Working URLs:",
for url in urls:
try:
handle = urllib2.urlopen(url)
html = handle.read()
match = re.search("<title>(.*)</title>",html)
print "\n", url.strip(), " -> ",
if match:
print match.groups()[0],
except:
badurls.append(url)
if badurls:
print "\nBad URLs:"
for bad in badurls:
print bad.strip()
That is more than what I have even imagined. Thanks Jwalanta.
ReplyDeleteFYI it does not stop itself after reading all the urls from urls.txt file
ReplyDeletereally?? works find on mine though.. what's ur urls.txt look like??
ReplyDeletehttp://www.nepalnews.com/archive/2008/dec/dec05/news01.php
ReplyDeletehttp://www.nepalnews.com/archive/2008/dec/dec05/news02.php
http://www.nepalnews.com/archive/2008/dec/dec05/news03.php
http://www.nepalnews.com/archive/2008/dec/dec05/news04.php
http://www.nepalnews.com/archive/2008/dec/dec05/news05.php
http://www.nepalnews.com/archive/2008/dec/dec05/news06.php
http://ashesdhakal.tripod.com
had to change "< title>(.*)< /title>" to '< p class="inner_news_title">(.*)< / p>' for nepalnews.com
(added space for posting)
Thanks you very much
works very well
ReplyDelete