I recently came across the need to extract links from an html file. Of course I wanted to automate the whole procedure. There is an easy way of doing this using the bash shell.
cat file | grep "=href" | cut -d"/" -f3
This gives you some ugly links so you can improve it by grepping the domain name.
cat file | grep "=href" | cut -d"/" -f3 | grep domain
Of course domain is the domain name which should be included in the links.
Since I still wasn’t satisfied I wrote a little program in python which does exactly the same thing as the commands mentioned above. Here is the code:
print “usage: %s <file> ” % sys.argv
print “prints all the links contained in that file”
def extractLink(line, tag):
index = line.find(tag)+len(tag)+1
end = line.find(“\””, index+1)
link = line[index:end]
if len(sys.argv) < 2:
file = sys.argv
text = open(file, “r”).readlines()
linklist = 
tag = “href=”
for line in text:
if tag in line:
link = extractLink(line, tag)
if not “\”” in link and not “‘” in link:
This seems to be a lot of code but it actually isn’t considering that it was written in a higher level language. If you want to use this code you have to align it properly.
Feel free to post some improvements.