Scraping Denver Legislation with CasperJS

I hate scraping. There’s so much information out there on the Web, but too much of it’s stored in a totally non-semantic format, perhaps locked behind some clunky form that you have to POST before processing results page by page. Or maybe what you need is spread across thousands of PDFs—perish the thought! Then, of course, months or—if you’re lucky—years later, when the scraped site changes, you have to start anew.

But such is life; information is rarely packed up neatly, and even when it appears to be, there’s usually still plenty of work to be done. Sometimes the information is worth going through the trouble.

As someone who is new to Denver and has a passive interest in politics, I’ve recently been poking around at Denver City Council legislation. Legislative data is an interesting area. A number of folks have for years made hobbies of scraping federal and state data, stitching together networks of crawlers designed for disparate sources, but there are of course far more local governments—and a lot of interesting action happens at the local level.

As an aside, you could probably get at much of this type of public information either through freedom of information or “sunshine” laws, or perhaps through paid services. That won’t take away the effort though—it’s just a different sort of effort (perhaps of the monetary sort). Maybe you’re a business, and this is no problem, but maybe you’re a hobbyist who has no idea where the data will take him. For the record, I did reach out to City Council through an open records request, but it seems like they are ill-equipped to handle this sort of bulk data request, which is a shame.

You can manually search through more recent (from 2010) legislation on the Denver City Council site:

Legislation search page

Note that it wants you to search for something specific. Clearly, a data-harvesting use case is not in mind. Fortunately, though, you can actually just click “Search” for a list of legislation:

List of legislation

When you click an item, the application provides some more details:

Legislation details

Notice that there are multiple versions. These may be different actions or revisions. Also note that the underlying data is not necessarily correct, or at least not straightforward: above, we see a bold red “Failed”, next to text indicating that the measure moved out of committee. Great.

So I built a scraper to automatically collect legislation information, using CasperJS. CasperJS will run on top of PhantomJS and SlimerJS, and provides some high-level functionality that makes scraping easy.

The item summary list contains some useful information, and scraping this was the first step. In the process, I discovered that, without selecting a year, the “Next” button would eventually produce an empty set of items. But it’s easy enough to select a specific year, then go from there.

I grabbed the item IDs, in preparation for deeper scraping on the legislation. For the details, I visited each item, then collected just some of the information available. The full code is available as a gist on GitHub.

The result is several JSON files, one for each year, that look like this:

[
    {
    "title": "A bill for an ordinance authorizing the issuance of City and County of Denver, Colorado, Tax-Exempt Gener
al Obligation Better Denver Bonds, Series 2010A, the City and County of Denver, Colorado, Taxable General Obligation Be
tter Denver Bonds (Direct Pay Build America Bonds), Series 2010B, and the Tax-Exempt General Obligation Refunding Bonds
, Series 2010C, for the purpose of financing and/or refinancing and defraying the cost of acquiring, constructing, inst
alling and improving various civic facilities, together with all necessary, incidental or appurtenant properties, facil
ities, equipment and costs, and refunding a portion of the CityΓÇÖs outstanding general obligation bonds; providing for
 the levy of general ad valorem taxes to pay the principal of and interest on the Bonds; authorizing the execution of c
ertain agreements and providing other details in connection therewith; ratifying action previously taken relating there
to; providing other matters relating thereto; and making other provisions relating thereto.",
    "num": "CB10-0331",
    "type": "Council Bill",
    "date": "5/17/2010",
    "version": 2,
    "result": "passed",
    "votes": [
      {
        "member": "(Council Member) Boigon",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Hancock",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Johnson",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Linkhart",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Madison",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Brown",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Faatz",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Lehmann",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Lopez",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Montero",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Nevitt",
        "vote": "Yes"
      },
      {
        "member": "(Council Member) Robb",
        "vote": "Yes"
      }
    ]
  }, ...
]

Of course, this is only the first step. Data cleaning must take place to ensure that the scraper is indeed faithfully harvesting the data, and that the data itself makes sense (e.g., the “failed” vote above). If I can think of an interesting use case, I’ll pursue this further—and perhaps make my solution a bit less scrappy.