The CollectingParser is an example of how to use the SGML library (which comes standard with Python) to parse HTML documents. It currently works with Python 1.3 (hasn't been tested against 1.4 yet). See the comments at the head of the file for how to use it. Here's an example of what it can do: % python CollectingParser.py http://www.python.org/sigs/web-sig/ URL: http://www.python.org/sigs/web-sig/ Title: Web SIG - Using Python for handling the World Wide Web Size: 2939 Tables: 0 Frames: 0 Java: 0 Forms: 0 Isindex: 0 Foregrond color: Link color: #0000FF Background color: #FFFFFF Background image: 1 Images: http://www.python.org/../pics/ArrowLeft.gif http://www.python.org/../pics/ArrowRight.gif http://www.python.org/sigs/web-sig/HTMLgen_banner.gif http://www.python.org/../pics/ArrowLeft.gif http://www.python.org/../pics/ArrowRight.gif Links: http://www.python.org/../ (Home) http://www.python.org/../python/ (Software) http://www.python.org/../doc/ (Documentation) http://www.python.org/../psa/ (PSA) http://www.python.org/../workshops/ (Workshops) http://www.python.org/../sigs/ (SIGs) http://www.python.org/../locator/ (Search) http://www.python.org/sigs/web-sig/HTMLgen.beta.tar.gz (download it here) http://www.python.org/sigs/web-sig/mission (list mission statement) mailto:webmaster@python.org (webmaster@python.org) mailto:python-help@python.org (python-help@python.org) http://www.python.org/../ (Home) http://www.python.org/../python/ (Software) http://www.python.org/../doc/ (Documentation) http://www.python.org/../psa/ (PSA) http://www.python.org/../workshops/ (Workshops) http://www.python.org/../sigs/ (SIGs) http://www.python.org/../locator/ (Search) Words: sig, python, handling, wide, software, documentation, psa, workshops, sigs, sig, python, handling, wide, august, 15th, robin, friedrich, released, beta, version, htmlgen, module, download, 160k, gzip, tar, file, containing, test, script, supporting, files, complete, documentation, set, created, daniel, larsson, gendoc, package, changes, set, classes, flexible, table, generation, rewrite, list, classes, support, full, nesting, probably, beta, release, important, addresses, list, content, submissions, addr, sig, python, org, subscriptions, addr, sig, request, python, org, list, admin, addr, sig, admin, python, org, list, owner, addr, sig, owner, python, org, get, instructions, list, send, message, containing, word, help, body, subscriptions, address, sig, request, python, org, contact, list, owner, need, individual, help, click, see, list, mission, statement, comments, send, email, webmaster, python, org, questions, python, send, email, python, help, python, org, software, documentation, psa, workshops, sigs Contents of the tar file: CollectingParser.py the CollectingParser class Stopwords.py a list of stopwords to ignore Comments and suggestions are welcome. I hope this is useful to some. Tessa Lau tlau@cs.washington.edu