Bash/Python – Extracting Links From An HTML File

I recently came across the need to extract links from an html file. Of course I wanted to automate the whole procedure. There is an easy way of doing this using the bash shell.

cat file | grep "=href" | cut -d"/" -f3

This gives you some ugly links so you can improve it by grepping the domain name.

cat file | grep "=href" | cut -d"/" -f3 | grep domain

Of course domain is the domain name which should be included in the links.

Since I still wasn’t satisfied I wrote a little program in python which does exactly the same thing as the commands mentioned above. Here is the code:

#!/usr/bin/env python

import sys

def usage():

print “usage: %s <file> ” % sys.argv[0]
print “prints all the links contained in that file”

def extractLink(line, tag):

index = line.find(tag)+len(tag)+1
end = line.find(“\””, index+1)
link = line[index:end]
return link

if len(sys.argv) < 2:


file = sys.argv[1]
text = open(file, “r”).readlines()
linklist = []
tag = “href=”
for line in text:

if tag in line:

link = extractLink(line, tag)
if not “\”” in link and not “‘” in link:

print link

This seems to be a lot of code but it actually isn’t considering that it was written in a higher level language. If you want to use this code you have to align it properly.

Feel free to post some improvements.


2 Responses to “Bash/Python – Extracting Links From An HTML File”

  1. naveen Says:

    Hi sir ur program was really grt …..i really a grt fan of u sir ……..Sir if u dont mind can u pls guide how u achieved all these things even i want to be like u…….Pls Guide me sir


  2. knubbl Says:

    what do you want to learn? Programming? Hacking? or everything?

