scraping and parsing html

I would like to read J. Krishnamurti books on my Kindle. Unfortunately, no ebooks were available although I did find that jkrishnamurti.org has an extensive collection of books on their website. At present there is no full download, only a per-chapter html viewer, and some of the books ran over 80 chapters, which is more than I am going to copy+paste into a text file.

I decided I'd use python and the HTMLParser library and write a throw-away parser. I realized parsing multi-page websites into text files might be useful for other purposes so I wrote a simple closure that requires two parse functions and returns a custom scraper function that will scrape all pages of an article or book and save it as a single plain text file.

One parse function must return the URL of the next webpage to scrape, and the other parse function must return the HTML of the readable text. Basically, pass in two parse functions (specific for whatever website you're attempting to scrape) into the closure and it will return a scraper utilizing those parse functions.

from posixpath import basename, dirname
from traceback import print_exc
import urllib, StringIO, HTMLParser, re, sys

def make_scraper(findtext, findnext):

    def scraper(inurl, outfilename):
        base_url = dirname(inurl)
        next_url = basename(inurl)
        chapter = 1
        while True:
            html = urllib.urlopen(base_url + '/' + next_url).read()
            match = findtext(html)
            if match:
                print "chapter %s - %s" % (chapter, next_url)
                f = open(outfilename, 'a')
                f.write('nnCHAPTER %snn%s' % (chapter, dehtml(match.group(1))) )
                f.close
                next_match = findnext(html)
                if next_match:
                    next_url = next_match.group(1)
                    chapter += 1
                else:
                    break
            else:
                break

    return scraper

The dehtml function is a very simple HTMLParser implementation that strips out all the HTML tags and maintains line and paragraph breaks.

class _DeHTMLParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = re.sub('[ trn]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('nn')
        elif tag == 'br':
            self.__text.append('n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('nn')

    def text(self):
        return ''.join(self.__text).strip()

def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=sys.stderr)
        return text

For example, I want to scrape and parse books from J. Krishnamurti so I will use the following parse functions to create my custom scraper.

## custom parse functions
nextchapter = re.compile('<div id="chapter-forward"><a href=([^>]*)>',
                          re.M | re.S).search
parsetext = re.compile('<div id="chapter-forward">.*<div class="clear">' + 
                       '(.*)<!-- box user preferences //-->s*<div id="sidebar">',
                       re.M | re.S).search

## create custom parser/scraper
get_jkrish_text = make_scraper(parsetext, nextchapter)

I can now use get_jkrish_txt(). Here it is downloading a shorter (9 chapter) book, Flame of Attention

>>> get_jkrish_txt(jkrish_url, 'flameofattention.txt')
chapter 1 - view-text.php?tid=29&chid=56860&w=
chapter 2 - view-text.php?tid=29&chid=56861&w=
chapter 3 - view-text.php?tid=29&chid=56862&w=
chapter 4 - view-text.php?tid=29&chid=56863&w=
chapter 5 - view-text.php?tid=29&chid=56864&w=
chapter 6 - view-text.php?tid=29&chid=56865&w=
chapter 7 - view-text.php?tid=29&chid=56866&w=
chapter 8 - view-text.php?tid=29&chid=56867&w=
chapter 9 - view-text.php?tid=29&chid=56868&w=
>>> 
This entry was posted in html, python. Bookmark the permalink.

Comments are closed.