I recently wanted to be able to check broken links in an html file and did not want to buy any commercial programs. So I wrote a program in python which shows you the broken links in an html file. Here is the code:
#!/usr/bin/env python
import os, sys
def usage():
- print “usage: %s <html file>” % sys.argv[0]
print “checks the html file for broken links”
def fileExists(file):
-
- inf = os.stat(file)
return True- return False
try:except OSError:
def extractLink(line, tag):
- index = line.find(tag)+len(tag)+1
end = line.find(“\”", index+1)
link = line[index:end]
return link
def getDirectory(file):
-
- index = file.find(“/”, index+1)
index = 0
while file.find(“/”, index+1) > index:directory = file[:index]
return directory
######################
# the main program starts here #
######################
if len(sys.argv) < 2:
- usage()
sys.exit()
file = sys.argv[1]
text = open(file, “r”).readlines()
linklist = []
tag = “href=”
#extract the links from the text
for line in text:
-
-
- linklist.append(link)
link = extractLink(line, tag)
if not “\”" in link and not “‘” in link: -
if tag in line:
if file.startswith(“/”):
- directory = os.path.abspath(getDirectory(file))
else:
- directory = os.path.abspath(getDirectory(os.getcwd()+”/”+file))
if not directory.endswith(“/”):
- directory = directory+”/”
print “-”*30
print “missing file(s): “
print “-”*30
for link in linklist:
-
-
- print link
fl = link
val = fileExists(fl)
if not val:-
- print link
fl = directory+link
val = fileExists(fl)
if not val: -
if link.startswith(“/”):
elif not link.startswith(“http://”):
print “-”*30
There are a lot of possibilities to improve this program, e.g. you could provide the line on which the broken link is etc. Feel free to post any improvements or comments, I hope this is helpful.


July 31, 2008 at 5:14 pm
I would like to try this script out, but if I cut an past the code into a file, there are special characters that mess it up. I tried running it through html2text (both the whole page and just the code part) and no luck.
Could you provide it in plain ASCII text?
It would save me having to reconstruct it.
Thanks
August 1, 2008 at 7:31 am
oh yeah .. I am sorry there is a problem with the formatting .. It is always messed up and I can’t use the code tag for more than one line .. it sucks. I’ll email you it ok?