Python – Finding Broken Links In An HTML File

I recently wanted to be able to check broken links in an html file and did not want to buy any commercial programs. So I wrote a program in python which shows you the broken links in an html file. Here is the code:

#!/usr/bin/env python

import os, sys

def usage():

    print “usage: %s <html file>” % sys.argv[0]
    print “checks the html file for broken links”

def fileExists(file):

    • inf = os.stat(file)
      return True
      return False
  • try:except OSError:

def extractLink(line, tag):

    index = line.find(tag)+len(tag)+1
    end = line.find(“\””, index+1)
    link = line[index:end]
    return link

def getDirectory(file):

    • index = file.find(“/”, index+1)
  • index = 0
    while file.find(“/”, index+1) > index:directory = file[:index]
    return directory

######################
# the main program starts here #
######################
if len(sys.argv) < 2:

    usage()
    sys.exit()

file = sys.argv[1]
text = open(file, “r”).readlines()
linklist = []
tag = “href=”
#extract the links from the text
for line in text:

      • linklist.append(link)
    • link = extractLink(line, tag)
      if not “\”” in link and not “‘” in link:

  • if tag in line:

if file.startswith(“/”):

    directory = os.path.abspath(getDirectory(file))

else:

    directory = os.path.abspath(getDirectory(os.getcwd()+”/”+file))

if not directory.endswith(“/”):

    directory = directory+”/”

print “-“*30
print “missing file(s): ”
print “-“*30
for link in linklist:

      • print link
    • fl = link
      val = fileExists(fl)
      if not val:

      • print link
    • fl = directory+link
      val = fileExists(fl)
      if not val:

  • if link.startswith(“/”):

    elif not link.startswith(“http://&#8221;):

print “-“*30

There are a lot of possibilities to improve this program, e.g. you could provide the line on which the broken link is etc. Feel free to post any improvements or comments, I hope this is helpful.

Advertisements

2 Responses to “Python – Finding Broken Links In An HTML File”

  1. fstephens Says:

    I would like to try this script out, but if I cut an past the code into a file, there are special characters that mess it up. I tried running it through html2text (both the whole page and just the code part) and no luck.
    Could you provide it in plain ASCII text?
    It would save me having to reconstruct it.
    Thanks

  2. knubbl Says:

    oh yeah .. I am sorry there is a problem with the formatting .. It is always messed up and I can’t use the code tag for more than one line .. it sucks. I’ll email you it ok?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: