- Currently Listening to:
- The Strokes — What Ever Happened?
After an interesting meeting today, we’ve each chosen a website to extract data from, to be fed into construct as RDF. The idea of standardising on Python for all of the newly created sensors was brought up, which is good as I’ve already started working on my Python scraper.
Hidden Data
I didn’t mention this in the meeting, but some very useful data, like currency conversion rates, are generally not shown on public-facing websites. To get at them requires a form submission, and then scraping the resulting HTML page. Things I learned during my fourth year project may be able to help here, since one of the sites I tested on was this currency conversion page.
Access to a feed of realtime data costs $540 a year. With my project, for the cost of a HTTP GET request, you could have up to the minute data on any currency available in their system. This was made possible by a very useful Perl module called HTML::Form, which allowed me to simulate form submits, and thus retrieve the HTTP response page. Something similar is bound to exist for Python.
Working with Trees
There are two main approaches to screen-scraping: using heavy, regular expression-laden parsing for certain patterns of text in a string, or constructing a treelike representation of a page in memory, and then traversing this tree looking for certain elements. My favoured method is the latter, since it is generally more robust to small cosmetic changes to the underlying HTML page. Scraper rewrites are still required for when a page is reorganised, but this happens less frequently than a site having a few colours changed around.
Beautiful Soup is a very useful package for Python, which will robustly convert even an invalid HTML page into a tree, and then provides you the methods required to traverse the tree. This way, scrapers can be bashed out pretty quickly. Here’s some code to set it up; after this’ll come the page-specific code that extracts the relevant table rows or whatever is required.
import urllib, sys, re, BeautifulSoup
def get_page(url):
"""Fetches an arbitrary page from the web and prints it."""
try:
location = urllib.urlopen(url)
except IOError, (errno, strerror):
sys.exit("I/O error(%s): %s" % (errno, strerror))
content = location.read()
# Clear out all troublesome whitespace
content = content.replace("\n", "")
content = content.replace("\r", "")
content = content.replace("\t", "")
content = content.replace("> ", ">")
content = content.replace(" ", " ")
location.close()
return content
def generate_tree(page):
"""Converts a string of HTML into a document tree."""
return BeautifulSoup.BeautifulSoup(page)
Once you have this set up, fetching a certain element on a page becomes as easy as writing:
print generate_tree(get_page('http://www.imdb.com/')).first('table')
Polling Period
We discussed how often the sensors/scrapers should fetch their target webpage to re-parse it. Polling a page too often is likely to get your IP address blocked. Personally I don’t think this is as big a problem as was made out. Most RSS readers are designed to poll a feed once every 30 minutes to an hour. This is a reasonable period. Bar a few examples (stock quotes specifically), very few sites that we’re monitoring will be updating more frequently than that. In fact, the period could likely be increased. It would be relatively simple to set up a cron job to run each of the sensors in order every 30 minutes.
This approach could then be extended. RSS readers are/should be designed to honour various HTTP headers so that they don’t continually re-fetch the same feed over and over again if it’s not changing. All HTML files are sent with those same headers, so we could have conditions set up that the sensors will first do a HEAD request, and if we get a 304 response or if the Last Modified headers are within the last update cycle, we defer the update until the next cycle.
Ideally, the polling would be adaptive, so we have a single script that takes as input the derived update frequency of each page, and writes a new cron file with modified periodicity for each site. Thus, pages like the Dublin Bus timetables, which I’m working on, will be re-parsed very infrequently, since the site is rarely updated. Conversely, sites that serve constantly-updated information, like stock quotes and currency conversion rates, will be fetched much more often (but never more than a lower bound, like every 10 minutes).
I’ve begun learning some Python, primarily because Mark found an open source 3D graphics package called Blender, which uses packages written in Python.
So far, it looks like it’s similar in many ways to Perl, which is good because I already have plenty of experience with Perl, having used it for my final year project. Also, Lorcan is talking about doing some screen-scraping on major websites to glean data like movie showtimes and current stock prices, to be fed into Construct as contextual data.
I’ve done some screen-scraping in Perl before, but I’m guessing most of the others won’t want to code their screen-scrapers in Perl too. This will lead to serious code maintainability problems, which will invariably happen since every time the source website is updated you may have to recode some or all of your corresponding screen-scraper. Such is the life we’ve chosen. It would be best if we didn’t have to have designated caretakers for each module, so standardising on one language for them all would be nice. And I know which way the tide is turning (Joe characterised Perl with “I don’t like any language where my cat can walk across the keyboard and it will still compile” — touché).
Update: I’ve done some work on a scraper in Python.
- Google's Python Class - Google's Python Class - Google Code
Welcome to Google's Python Class -- this is a free class for people with a little bit of programming experience who want to learn Python. The class includes written materials, lecture videos, and lots of code exercises to practice Python coding. These materials are used within Google to introduce Python to people who have just a little programming experience. The first exercises work on basic Python concepts like strings and lists, building up to the later exercises which are full programs dealing with text files, processes, and http connections. The class is geared for people who have a little bit of programming experience in some language, enough to know what a "variable" or "if statement" is. Beyond that, you do not need to be an expert programmer to use this material.
- The SEC and the Python « Prof. Jayanth R. Varma’s Financial Markets Blog
Last week, the SEC put out a 667 page proposal regarding disclosures for asset backed securities. What I found exciting was this:
We are proposing to require that most ABS issuers file a computer program that gives effect to the flow of funds, or “waterfall,” provisions of the transaction. We are proposing that the computer program be filed on EDGAR in the form of downloadable source code in Python. … (page 205)
- What Pythonistas Think of Ruby | Free PeepCode Blog
There was an audible gasp from the audience when Gary posted a slide with this quote from Matz:
Ruby inherited the Perl philosophy of having more than one way to do the same thing. I inherited that philosophy from Larry Wall, who is my hero actually.2
I’m not sure if the shock was from hearing someone embrace TIMTOWTDI or from learning that someone considers Larry Wall to be a hero.
If you open a python prompt and type
>>> import this
you’ll see the Zen of Python, 19 guidelines that Python programmers live by. One is “There should be one—and preferably only one—obvious way to do it.”
- Like, Python
Like making computers do your bidding?
Enjoy Python features like lambdas? Indent-grouping? List comprehensions?
Tired of Old Man Python telling you what you can and can't say to your computer?
#!usr/bin/python
# My first Like, Python script!
yo just print like "hello world" bro
- itymbi ...: Scheme, Ruby, Python, Perl
Perl was invented to make a uniform framework to handle what was commonly done in 3 languages before (sh, awk, sed). Unfortunatelly it borrowed a lot of the ugly baggage from those languages and had a horribly inconsistant feel due to the multiple parents it borrowed from. Reading complex sh/awk/sed programs from before Perl existed shows the need for Perl quite clearly. Larry Wall's Rnmail and Pnews applications can still be downloaded and are good examples of the problems Perl was written to address. Python was a way of simplifying Perl's syntax so everyone (includng the original developer) could read it. [Disclaimer, I dislike Python's decision to use whitespace as a control structure. I think this was a mistake in the Makefile syntax, and continues to be a liability in Python.] Ruby was a ground-up design of a scripting language with a stronger theoretical basis and support for techniques like threading and OO and Blocks (theoretically more powerful than perl/python like Lambdas).
- A script for text placeholders in VoodooPad « michael-mccracken.net
- How Do You Look When Merging Fails ;-) « Andi Albrecht
Finally I wrote a little fun script that takes a picture of you exactly at the unique moment when merging fails and it sends it directly and without any further questions to Twitpic and Twitter:
- PEP 8 -- Style Guide for Python Code
This document gives coding conventions for the Python code comprising the
standard library in the main Python distribution. Please see the
companion informational PEP describing style guidelines for the C code in
the C implementation of Python[1].
- Unladen Swallows: Making Python faster
- Unit testing in Coders at Work
I suspect, having done a small amount of TDD myself, that this is actually a pattern that arises when a programmer tries to apply TDD to a problem they just don’t know how to solve. If I was a high-priced consultant/trainer like Jeffries, I’d probably give this pattern a pithy name like “Going in Circles Means You Don’t Know What You’re Doing”. Because he had no idea how to tackle the real problem, the only kinds of tests he could think of were either the very high-level “the program works” kind which were obviously too much of a leap or low-level tests of nitty-gritty code that is necessary but not at all sufficient for a working solver.
- plope - You're the Smartest Guy In The Room
I get it: it's not easy being a genius. So, if you don't mind, I have a request. Given that it would certainly not tax you professionally, because it's all so simple and obvious, do you think that you could contribute something to Python or some Python-related project that demonstrates your immense base of knowledge and helps other people?
- Python Library for Google Translate - good coders code, great reuse
- The Python Paradox is now the Scala Paradox – Martin Kleppmann at Yes/No/Cancel
- Stock Picking using Python - Steve Hanov's Programming Blog
The stock market is a lot different than it was just a few months ago. Once again, I present my stock selections, as found via python script. Comparing it with last time, you will find most of the same names are on there.
- Go deh!: XKCD Knapsack Solution