Google
 

Thursday, December 04, 2008

URL Validity

Few days ago NewsFromNepal was asking me for some help on some URL validation code. The task was to verify a list of URLs and print the title of the page if it's valid/working. Here's a simple Python script to do that. The URLs are stored per line in url.txt


#! /usr/bin/python
# URL Validity

import urllib2
import re

urls = open('urls.txt', 'r').readlines()
badurls = []

print "Working URLs:",
for url in urls:
try:
handle = urllib2.urlopen(url)
html = handle.read()
match = re.search("<title>(.*)</title>",html)

print "\n", url.strip(), " -> ",
if match:
print match.groups()[0],
except:
badurls.append(url)

if badurls:
print "\nBad URLs:"
for bad in badurls:
print bad.strip()

5 comments:

  1. That is more than what I have even imagined. Thanks Jwalanta.

    ReplyDelete
  2. FYI it does not stop itself after reading all the urls from urls.txt file

    ReplyDelete
  3. really?? works find on mine though.. what's ur urls.txt look like??

    ReplyDelete
  4. http://www.nepalnews.com/archive/2008/dec/dec05/news01.php
    http://www.nepalnews.com/archive/2008/dec/dec05/news02.php
    http://www.nepalnews.com/archive/2008/dec/dec05/news03.php
    http://www.nepalnews.com/archive/2008/dec/dec05/news04.php
    http://www.nepalnews.com/archive/2008/dec/dec05/news05.php
    http://www.nepalnews.com/archive/2008/dec/dec05/news06.php
    http://ashesdhakal.tripod.com

    had to change "< title>(.*)< /title>" to '< p class="inner_news_title">(.*)< / p>' for nepalnews.com

    (added space for posting)
    Thanks you very much

    ReplyDelete