[GIS] Parse XML files in Python (ElementTree)

pythonxml

I am trying to parse some xml and rss feeds to extract some of their data in order to store it in a PostGIS database.

The file I want to parse is here: http://earthquake.usgs.gov/earthquakes/catalogs/1hour-M1.xml and looks like this:

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss">
  <updated>2012-03-05T19:57:55Z</updated>
  <title>USGS M 1+ Earthquakes</title>
  <subtitle>Real-time, worldwide earthquake list for the past hour</subtitle>
  <link rel="self" href="http://earthquake.usgs.gov/earthquakes/catalogs/1hour-M1.xml"/>
  <link href="http://earthquake.usgs.gov/earthquakes/"/>
  <author><name>U.S. Geological Survey</name></author>
  <id>http://earthquake.usgs.gov/</id>
  <icon>/favicon.ico</icon>
  <entry><id>urn:earthquake-usgs-gov:ak:10425830</id><title>M 2.5, Alaska Peninsula</title><updated>2012-03-05T19:37:00Z</updated><link rel="alternate" type="text/html" href="http://earthquake.usgs.gov/earthquakes/recenteqsww/Quakes/ak10425830.php"/><summary type="html"><![CDATA[<img src="http://earthquake.usgs.gov/images/globes/60_-155.jpg" alt="58.558&#176;N 155.673&#176;W" align="left" hspace="20" /><p>Monday, March  5, 2012 19:37:00 UTC<br>Monday, March  5, 2012 10:37:00 AM at epicenter</p><p><strong>Depth</strong>: 185.70 km (115.39 mi)</p>]]></summary><georss:point>58.5578 -155.6727</georss:point><georss:elev>-185700</georss:elev><category label="Age" term="Past hour"/></entry>
  <entry><id>urn:earthquake-usgs-gov:nc:71742560</id><title>M 1.5, San Francisco Bay area, California</title><updated>2012-03-05T19:31:16Z</updated><link rel="alternate" type="text/html" href="http://earthquake.usgs.gov/earthquakes/recenteqsus/Quakes/nc71742560.php"/><summary type="html"><![CDATA[<img src="http://earthquake.usgs.gov/images/globes/40_-120.jpg" alt="37.988&#176;N 122.455&#176;W" align="left" hspace="20" /><p>Monday, March  5, 2012 19:31:16 UTC<br>Monday, March  5, 2012 11:31:16 AM at epicenter</p><p><strong>Depth</strong>: 0.20 km (0.12 mi)</p>]]></summary><georss:point>37.9882 -122.4550</georss:point><georss:elev>-200</georss:elev><category label="Age" term="Past hour"/></entry>
  <entry><id>urn:earthquake-usgs-gov:ak:10425819</id><title>M 2.2, Central Alaska</title><updated>2012-03-05T19:28:00Z</updated><link rel="alternate" type="text/html" href="http://earthquake.usgs.gov/earthquakes/recenteqsus/Quakes/ak10425819.php"/><summary type="html"><![CDATA[<img src="http://earthquake.usgs.gov/images/globes/65_-150.jpg" alt="63.217&#176;N 150.524&#176;W" align="left" hspace="20" /><p>Monday, March  5, 2012 19:28:00 UTC<br>Monday, March  5, 2012 10:28:00 AM at epicenter</p><p><strong>Depth</strong>: 112.30 km (69.78 mi)</p>]]></summary><georss:point>63.2167 -150.5241</georss:point><georss:elev>-112300</georss:elev><category label="Age" term="Past hour"/></entry>
  <entry><id>urn:earthquake-usgs-gov:nc:71742550</id><title>M 1.8, Northern California</title><updated>2012-03-05T19:18:22Z</updated><link rel="alternate" type="text/html" href="http://earthquake.usgs.gov/earthquakes/recenteqsus/Quakes/nc71742550.php"/><summary type="html"><![CDATA[<img src="http://earthquake.usgs.gov/images/globes/40_-125.jpg" alt="38.818&#176;N 122.821&#176;W" align="left" hspace="20" /><p>Monday, March  5, 2012 19:18:22 UTC<br>Monday, March  5, 2012 11:18:22 AM at epicenter</p><p><strong>Depth</strong>: 2.40 km (1.49 mi)</p>]]></summary><georss:point>38.8177 -122.8205</georss:point><georss:elev>-2400</georss:elev><category label="Age" term="Past hour"/></entry>
  <entry><id>urn:earthquake-usgs-gov:ak:10425806</id><title>M 2.1, Southern Alaska</title><updated>2012-03-05T19:14:44Z</updated><link rel="alternate" type="text/html" href="http://earthquake.usgs.gov/earthquakes/recenteqsus/Quakes/ak10425806.php"/><summary type="html"><![CDATA[<img src="http://earthquake.usgs.gov/images/globes/60_-145.jpg" alt="60.501&#176;N 145.118&#176;W" align="left" hspace="20" /><p>Monday, March  5, 2012 19:14:44 UTC<br>Monday, March  5, 2012 10:14:44 AM at epicenter</p><p><strong>Depth</strong>: 17.10 km (10.63 mi)</p>]]></summary><georss:point>60.5011 -145.1175</georss:point><georss:elev>-17100</georss:elev><category label="Age" term="Past hour"/></entry>
</feed>

I am quite new to Python and to XML but I believe that the right direction to choose is ElementTree. Thus I have started with the following code:

#-*- coding: utf-8 -*-

import os
import urllib
import xml.etree.ElementTree as ET

def main():
  feed = urllib.urlopen("http://earthquake.usgs.gov/earthquakes/catalogs/1hour-M1.xml")

  try:
    tree = ET.parse(feed)
    print "Download ok"
    root = tree.getroot()
    print root
    event = root.find("entry")
    for e in event:
      print e.attrib
  except Exception, inst:
    print "Unexpected error opening %s: %s" % (tree, inst)

if __name__ == "__main__":
  main()

but an error is thrown…

Can anyone point me the best direction to follow for this? I do not really understand how to extract the data from the various tags. And then, what's the ideal strategy to "store" this data: an array of dictionaries?

Thanks in advance!

Best Answer

Before I try to answer, a tip. Your exception handler covers up the nature of the problem. Just let the original exception rise up and you'll have more information to share with people who are interested in helping you.

I like to use feedparser to parse Atom feeds. It does indeed give you dict-like objects. I submitted a patch to feedparser 4.1 to parse the GeoRSS elements into GeoJSON style dicts. See https://code.google.com/p/feedparser/issues/detail?id=62 and blog post at http://sgillies.net/blog/566/georss-patch-for-universal-feedparser/. You'd use it like this:

>>> import feedparser
>>> feed = feedparser.parse("http://earthquake.usgs.gov/earthquakes/catalogs/1hour-M1.xml")
>>> feed.entries[0]['where']
{'type': 'Point', 'coordinates': (-122.8282, 38.844700000000003)}

My patched version of 4.1 is in my Dropbox and you can get it using pip.

$ pip install http://dl.dropbox.com/u/10325831/feedparser-4.1-georss.tar.gz

Or just download and install with "python setup.py install".

Related Question