I have a shapefile, and information on its attributes is stored in a XML Document format. For example, the attribute of COUNTY is stored as follows:
-<attr>
<attrlabl>COUNTY</attrlabl>
<attrdef>County abbreviation</attrdef>
<attrtype>Text</attrtype>
<attwidth>1</attwidth>
<atnumdec>0</atnumdec>
-<attrdomv>
-<edom>
<edomv>C</edomv>
<edomvd>Clackamas County</edomvd>
<edomvds/>
</edom>
-<edom>
<edomv>M</edomv>
<edomvd>Multnomah County</edomvd>
<edomvds/>
</edom>
-<edom>
<edomv>W</edomv>
<edomvd>Washington County</edomvd>
<edomvds/>
</edom>
</attrdomv>
</attr>
When converting the shapefile to a PostGIS/PostgreSQL table, I also want to create a PostgreSQL/PostGIS table that will describe information on the attributes. So, the new table includes these columns: attribute(attrlabl), definition(attrdef), type(attrtype), width(attwidth), and categories(attrdomv).
I appreciate any suggestions.
Best Answer
Parsing XML is always awkward, especially if there is the possibility of a range of different input formats (eg tags in capitals, mixed case or lower case). Therefore I would recommend using a parser such as BeautifulSoup, to perform the scraping of the data into python structures. From these structures you can write the data to the required format for your database.
This should be the most efficient way of doing this, as developing your own regexes for such a task is always much more work than you think it will be.