I recently came across the need to extract links from an html file. Of course I wanted to automate the whole procedure. There is an easy way of doing this using the bash shell.
cat file | grep "=href" | cut -d"/" -f3
This gives you some ugly links so you can improve it by grepping the domain name.
cat file | grep "=href" | cut -d"/" -f3 | grep domain
Of course domain is the domain name which should be included in the links.
Since I still wasn’t satisfied I wrote a little program in python which does exactly the same thing as the commands mentioned above. Here is the code:
#!/usr/bin/env python
import sys
def usage():
print “usage: %s <file> ” % sys.argv[0]
print “prints all the links contained in that file”
def extractLink(line, tag):
index = line.find(tag)+len(tag)+1
end = line.find(“\”", index+1)
link = line[index:end]
return link
if len(sys.argv) < 2:
usage()
sys.exit()
file = sys.argv[1]
text = open(file, “r”).readlines()
linklist = []
tag = “href=”
for line in text:
if tag in line:
link = extractLink(line, tag)
if not “\”" in link and not “‘” in link:
print link
This seems to be a lot of code but it actually isn’t considering that it was written in a higher level language. If you want to use this code you have to align it properly.
Feel free to post some improvements.


June 10, 2008 at 8:03 pm
Hi sir ur program was really grt …..i really a grt fan of u sir ……..Sir if u dont mind can u pls guide how u achieved all these things even i want to be like u…….Pls Guide me sir
Regards
Naveen
June 11, 2008 at 6:08 am
what do you want to learn? Programming? Hacking? or everything?