Sunday, April 14, 2013

Handling Invalid HTML with Python Mechanize

Python's Mechanize not recognizing a form or HTML? Try BeautifulSoup


If you are using the python mechanize module to parse HTML forms and run into problems, you can try using BeautifulSoup's prettify function to fix the HTML that mechanize retrieves.

If you don't already have it installed, download and install BeautifulSoup 3.2.1 from http://www.crummy.com/software/BeautifulSoup/

Don't use BeautifulSoup4. It is messed up and I couldn't get it working. At the time I tried to use it, it didn't work and searches on the internet showed that it hasn't been working for a while for most people.

If you have easy_install you can just type "easy_install -Z beautifulsoup". After it installs, you might have to go to site-packages and move the BeautifulSoup.py file from it's own directory into site-packages itself. It wouldn't import for me until I moved it.

Import mechanize and BeautifulSoup
import mechanize
from BeautifulSoup import BeautifulSoup

Put the following class into your python script
class PrettifyHandler(mechanize.BaseHandler):
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            response = mechanize.response_seek_wrapper(response)
        # only use BeautifulSoup if response is html
        if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
            soup = BeautifulSoup(response.get_data())
            response.set_data(soup.prettify())
        return response

When you create a browser with mechanize, add the following handler.
br = mechanize.Browser()
br.add_handler(PrettifyHandler())

Now just use mechanize like normal. The mechanize browser will now use BeautifulSoup to parse all responses where html is contained in the contect type (mime type) text/html. This helped me get mechanize to work with forms