Of course, in the time it took me to colour this in, I could’ve written ten papers…
Archive for 2005
No, not the as-yet-unknown-quantity that is the paper I’m trying to put together for AVI 2006. I just got word from one of the editors at O’Reilly that the book I contributed to, PHP Hacks, has been published and is in shops. I should be getting my ‘author copy’ in the post over the next few days. Huzzah!
Mark found an interesting video (80mb) rendered with Blender and OpenGL 2.0, that has a nice zoomable interface. It’d look good running in the Viz lab.
After an interesting meeting today, we’ve each chosen a website to extract data from, to be fed into construct as RDF. The idea of standardising on Python for all of the newly created sensors was brought up, which is good as I’ve already started working on my Python scraper.
I didn’t mention this in the meeting, but some very useful data, like currency conversion rates, are generally not shown on public-facing websites. To get at them requires a form submission, and then scraping the resulting HTML page. Things I learned during my fourth year project may be able to help here, since one of the sites I tested on was this currency conversion page.
Access to a feed of realtime data costs $540 a year. With my project, for the cost of a HTTP GET request, you could have up to the minute data on any currency available in their system. This was made possible by a very useful Perl module called HTML::Form, which allowed me to simulate form submits, and thus retrieve the HTTP response page. Something similar is bound to exist for Python.
Working with Trees
There are two main approaches to screen-scraping: using heavy, regular expression-laden parsing for certain patterns of text in a string, or constructing a treelike representation of a page in memory, and then traversing this tree looking for certain elements. My favoured method is the latter, since it is generally more robust to small cosmetic changes to the underlying HTML page. Scraper rewrites are still required for when a page is reorganised, but this happens less frequently than a site having a few colours changed around.
Beautiful Soup is a very useful package for Python, which will robustly convert even an invalid HTML page into a tree, and then provides you the methods required to traverse the tree. This way, scrapers can be bashed out pretty quickly. Here’s some code to set it up; after this’ll come the page-specific code that extracts the relevant table rows or whatever is required.
import urllib, sys, re, BeautifulSoup def get_page(url): """Fetches an arbitrary page from the web and prints it.""" try: location = urllib.urlopen(url) except IOError, (errno, strerror): sys.exit("I/O error(%s): %s" % (errno, strerror)) content = location.read() # Clear out all troublesome whitespace content = content.replace("\n", "") content = content.replace("\r", "") content = content.replace("\t", "") content = content.replace("> ", ">") content = content.replace(" ", " ") location.close() return content def generate_tree(page): """Converts a string of HTML into a document tree.""" return BeautifulSoup.BeautifulSoup(page)
Once you have this set up, fetching a certain element on a page becomes as easy as writing:
We discussed how often the sensors/scrapers should fetch their target webpage to re-parse it. Polling a page too often is likely to get your IP address blocked. Personally I don’t think this is as big a problem as was made out. Most RSS readers are designed to poll a feed once every 30 minutes to an hour. This is a reasonable period. Bar a few examples (stock quotes specifically), very few sites that we’re monitoring will be updating more frequently than that. In fact, the period could likely be increased. It would be relatively simple to set up a cron job to run each of the sensors in order every 30 minutes.
This approach could then be extended. RSS readers are/should be designed to honour various HTTP headers so that they don’t continually re-fetch the same feed over and over again if it’s not changing. All HTML files are sent with those same headers, so we could have conditions set up that the sensors will first do a HEAD request, and if we get a 304 response or if the Last Modified headers are within the last update cycle, we defer the update until the next cycle.
Ideally, the polling would be adaptive, so we have a single script that takes as input the derived update frequency of each page, and writes a new cron file with modified periodicity for each site. Thus, pages like the Dublin Bus timetables, which I’m working on, will be re-parsed very infrequently, since the site is rarely updated. Conversely, sites that serve constantly-updated information, like stock quotes and currency conversion rates, will be fetched much more often (but never more than a lower bound, like every 10 minutes).
So far, it looks like it’s similar in many ways to Perl, which is good because I already have plenty of experience with Perl, having used it for my final year project. Also, Lorcan is talking about doing some screen-scraping on major websites to glean data like movie showtimes and current stock prices, to be fed into Construct as contextual data.
I’ve done some screen-scraping in Perl before, but I’m guessing most of the others won’t want to code their screen-scrapers in Perl too. This will lead to serious code maintainability problems, which will invariably happen since every time the source website is updated you may have to recode some or all of your corresponding screen-scraper. Such is the life we’ve chosen. It would be best if we didn’t have to have designated caretakers for each module, so standardising on one language for them all would be nice. And I know which way the tide is turning (Joe characterised Perl with “I don’t like any language where my cat can walk across the keyboard and it will still compile” — touché).
Update: I’ve done some work on a scraper in Python.
IBM Research has some examples of a Weather Visualisation system they have designed.
* “Explore the styles of interaction possible across different devices and a heterogeneous computing environment”
* Support simultaneous multi-user interactions across different displays
I am expecting to demo my simulation on the multiple displays in the Viz lab in the months ahead, so this will be a good introduction to the technology.
From the fact sheet (PDF):
High-end graphical images can be viewed in two visualization modes — SVN (Scalable Visual Networking) to increase screen resolution and multiplicity of physical displays; and RVN (Remote Visual Networking) to allow remote use of the application.
These two modes reflect two of the challenges in my PhD research: creating a visualisation of a large dataset across many displays, and to allow parts of the visualisation to migrate across devices.
After my initial foray into predicting the future was met with puzzlement, I’ve been thinking back over the idea of, as Aaron put it, “marrying the Scientific Visualisation with the Information Visualisation”. This seemed like the logical way to go, but right now it doesn’t look like what’s actually required or even desired for this project. Nonetheless, I want to write down the reasons I originally started thinking along this track.
- Spatial Representation
- First of all, because an autonomic system is made up of a large and fluctuating number of sensors and actuators, it made sense to have some form of spatial representation of where the sensors are located. This allows such actions as the person watching the visualisation saying “show me activity for the sensors at the rear of the car”, or for those clustered in the engine, for example. This would surely be a useful UI for interacting with the simulation.
- Sensor Grouping
- Beyond these ‘logical groupings’, they could also simply drag a box around the sensors they were interested in, and use the usual Shift-click/Ctrl-click interaction to add or remove sensors from their selection, and then generate the visualisation from this selection. Splitting the sensors driving the visualisation into groups like this would simplify the task of focusing on certain parts of the simulation, or moving parts of it onto other display devices (particularly low-power devices, with not enough processing power to generate the entire visualisation).
- Using a 3D Camera
- When sensors fail, as they are wont to do, the camera in the 3D environment can be positioned to show the location of the failure. This would allow the user to select nearby sensors and get realtime data from just those sensors surrounding the problematic one.
Posted by Ross at 6:02 PM under Uncategorized