Parks KML Explained

Yesterday, I demonstrated a KML file of Ontario Parks. Although this approach has some fairly serious limitations over a full mashup, the portability and ease of deployment make it an excellent choice in some situations.

In this article, I explain how the techniques used to generate the Ontario Parks KML. These are general-purpose tools that could certainly be harnessed to write a map_data file for the book’s Chapter 2 framework.

Developing the Parks KML

The Parks KML file has its origins with an actual mashup at the Ontario Parks website. That mashup reads from a data file (located here) that looks like the following:

<markers>
    <marker lat="49.750000" lng="-92.640000"
        park="Aaron" url="aaro.html" type="op"/>
    <marker lat="45.380000" lng="-79.220000"
        park="Arrowhead" url="arro.html" type="op"/>
    <marker lat="44.784600" lng="-79.991000"
        park="Awenda" url="awen.html" type="op"/>
    ...
</markers>

We need a simple PHP script to grab this data, and generate KML. Each marker node from the original mashup needs to be translated into a KML Placemark, such as the following:

<Placemark>
    <name>Awenda</name>
    <description>
        http://ontarioparks.com/english/awen.html
    </description>
    <Point>
        <coordinates>-79.991000,44.784600</coordinates>
    </Point>
</Placemark>

This conversion is simple enough that we could actually just implement it as an XML transformation, from one format to the next. However, as you could see from the demo, I actually did something slightly more with the Parks example, and that was to grab a blurb and facilities list for each park’s info bubble.

Spidering

To get the park descriptions, I was going to need to actually visit the page for each one. I knew it would take me a number of tries to get all the parsing working correctly, and I don’t want to hammer Ontario Parks, so I took the opportunity to put last week’s quickie session caching CURL class to work.

$url_index = 'http://ontarioparks.com/english/markers.xml';

require('../curl.php');
$curl = new CURL();
$curl->enableCache();

// fetch page, strip HTTP headers
$text_response = strstr($curl->get($url_index), '<markers>');
$xml = new SimpleXMLElement($text_response);

Having got the master list of parks in $xml, I could iterate through them with a foreach loop, and build up an array of data on the parks.

$url_base = 'http://www.ontarioparks.com/english/';
$marker_data = array();

foreach($xml->marker as $marker) {
    $page = (string)$marker['url'];

    // Some parks don't actually have a page.
    if ('sorry.html' == $page) {
        $desc = '<em>(No data available about this park)</em>';
    } else {
        $page_data = $curl->get($url_base . $page);
        // extract info here; build $desc
    }

    $marker_data[] = array(
        'name' => (string)$marker['park'],
        'desc' => $desc,
        'll' => (string)$marker['lng'] . ',' . (string)$marker['lat']
    );
}

That first time running through all the 300+ parks was a pretty slow pageview, but after that, they all come from the session cache.—instantaneous *and* guilt-free!

Parsing

With a blob of page text for each park, the next task was to extract the description from it. Ontario Parks may be on the ball with their classy homepage and wordpress blog, but these old park pages are a markup disaster.

When iterating through rational XML, it makes sense to use tools such as SimpleXML and XPath. But when it’s a dog’s breakfast like this, some decisive regexes can end up being the easier approach. If we want to isolate the first description paragraph, for example, from Quetico, we can look at the surrounding markup for some good parsing hooks:

<td width="56%" valign="top"><img src="quet-mainImage.jpg" alt="Image of Quetico" width="345" height="222" />
    <br />
    <p>Quetico is a protected, pristine wilderness retreat of international acclaim west of Lake Superior on the Canada-U.S. border. The park’s tangled network of lakes once formed water routes travelled by Ojibway and fur traders. Now it is primarily the destination of experienced canoeists seeking solitude and rare glimpses of wildlife by cascading waterfalls, glassy lakes and endless forests. The park is accessible at four points by canoe and two by car (Dawson Trail Campground and Lac la Croix Ranger Station).</p>
    <p><a href="emerald_ash.html" class="fishing">Bringing your own supply of firewood to the park this summer? Please read this.</a></p>
    <p><br />
    </p>
</td>

It’s tempting to try to use the image as an anchor, but smaller parks like Craigleigh don’t actually include one, so that’s not much help. It turns out that the string <td width="56%" valign="top"> appears nowhere else but here, which makes it the best candidate we’ve got. This wouldn’t really be acceptable as a permanent solution, but we’re only running this once to generate our file, and we can tweak it as many times as it takes to get it right.

if (preg_match('#<td width="56%" valign="top">.*?<p>(.*?)</p>.*?</td>#is', $page_data, $match)) {
    $paragraph = trim(strip_tags($match[1]));
}

That regex may look scary as night, but the key part is that it grabs the first paragraph inside the specified td, and dot meta-character can include newlines. I only strip_tags it to get rid of the image, from those parks that have it.

Scraping out the activities and facilities is slightly more involved, but it’s the same principle at work, only in two stages. The first stage grabs the table surrounding them, and then a preg_match_all snatches each individual one. You can see it in action in the full code link at the bottom of the post. If there’s interest, I’ll put up a future article about regular expressions, specifically tailored to the art of scraping.

KML Output

Once I had everything in the big $marker_data array, it was just a matter of doing the actual generating KML part. You can see that I use short tags here, since it’s a template situation. You shouldn’t use them in portable code, but if it’s your own scripts, it can really turn up the readability and help distinguish PHP from markup.

<? echo '<?xml version="1.0" encoding="UTF-8"?>'; ?>
<kml xmlns="http://www.google.com/earth/kml/2">
<Document>
    <name>Ontario Parks</name>
<?  foreach($marker_data as $this_point): extract($this_point); ?>
    <Placemark>
        <name><?= $name ?></name>
        <description><![CDATA[<?= $desc ?>]]></description>
        <Point>
            <coordinates><?= $ll ?></coordinates>
        </Point>
    </Placemark>
<?  endforeach; ?>
</Document>
</kml>

It would have been possible to use SimpleXML or DOM to generate this XML, but it’s a simple enough example that to do so would probably have been overkill.

Saving To A File

When I was debugging all this, I’d let it simply output to the browser, where I could make a visual inspection. Once it came time to run it on Google Maps and Google Earth, I had it saved to a file for me, a task most easily accomplished with output buffering:

<?php ob_start(); ?> // start buffering

// do output ...

<?php
  $fp = @fopen($kml_file, 'w');

  // save contents of output buffer
  @fwrite($fp, ob_get_contents());
  @fclose($fp);
?>

Once the script ends, the buffer gets written out to the screen anyways, but it’s also captured and saved to a file. (to suppress the output, you could change ob_get_contents to ob_get_clean.)

Limitations of KML

As stated yesterday, there are some fairly significant limitations on KML, particularly in the Google Maps implementation. The most glaring of these is the filesize limit, which could be overcome in part by just sending the points in batches and withholding infoWindow contents until specifically requested. Other points to consider:

  • You can’t use custom icons. There are 250 common ones that ship with Maps and Earth, which you can reference, although we didn’t bother in this case. This is a limitation that will hopefully be lifted in the near future, especially given that KML can already make external image references for overlays.
  • You can’t control everything. For example, the name field will become a tooltip in Earth, but it’s only a sidebar label in Maps. It is possible to do tooltips in a full mashup, of course, as demonstrated in Chapter 9 of the book (see the example here), but the Maps implementation of KML doesn’t include this yet.
  • You can’t use JavaScript in the information bubbles, nor can you use the Google Maps tabbed infowindow. It would have been cool to also grab each park’s photograph, as well as the information about facilities and activities, but this data simply wouldn’t all fit in a single infowindow.

These limitations may be frustrating, but they’re hardly deal-breakers. For those dozens of mashups where you only need a few simple points plotted, why not just publish KML?

Full source here: Generate.php


2 Responses to “Parks KML Explained”  

  1. 1 Cam

    Stefan Geens has also linked to us from http://www.ogleearth.com/2006/08/short_news_3d_l.html so we figured it was only fair to link back. Thanks Stefan!

  2. 2 Internet Banking

    @Markus I get your drift on where you were going there. I often think of my past and use it as a means to analyze where I am and where I want to get to. Where I struggel is balancing it all out. How do you guys balance things out?


Buy Our Books!

(Here's Why) PHP book Rails book DOM book mashups book