Extracting map information from an SVG

This post explains how I converted an SVG map from Wikipedia into a map in the Khan Academy Computer Science environment. At the bottom of the post is the Python script I wrote to do the conversion.

The final map on Khan Academy

A while ago I used some JSON data I found to create a world map on Khan Academy. More recently I was asked whether I could make a map of Europe to be used for the game Diplomacy. Initially I thought I could just get a list of European countries and then rerun the program that drew the World Map using just the European countries, scaled to fit the screen. However, I realised that Diplomacy is set in the early 1900s, when Europe's borders were very different. Luckily, there was an SVG map of the board used on the Wikipedia map, so I was able to just use that.

I had thought that I would be able to use the same program I used for the world map, but I'd forgotten that that used JSON data to create an SVG and processing.js map, so I had to write a new Python script. I started trying to scrape the data with regex, but soon realised that I'd forgotten how to use Python's regex and that using regex to get data from an XML-like data structure is a silly idea.

Python XML parsers

In the past I've used both Beautiful Soup and lxml to parse XML. I used Beautiful Soup when I first started learning about SVG and remembered finding it very complicated. It could well be that I just didn't know very much about SVG or Python at the time as in general people seem to like it. However, I went for lxml, which I used more recently for my SVG optimiser (the inspiration for the optimiser came in part from all the trouble I had with making my first chloropleth maps with such messy SVGs).

I couldn't remember how to use lxml and found the documentation not all that helpful. Fortunately I was able to look through the code for my SVG optimiser and find what I wanted. The main reason for writing this blog post is in case I need to parse an SVG again (and to help anyone else who might want to). It turns out, lxml is very easy to use and I only need a few of commands.

Using lxml

First you parse the SVG, which generates an element tree object:

from lxml import etree

tree = etree.parse(open(filename, 'r'))

Then you can loop through the elements in the tree searching for the tags you want:

for element in tree.iter():
    print element.tag

The code above with print out the name of the SVG tags (such as line or text). Unfortunately, the tag includes the namespace (which is probably for good reason), so the tags look more like: {http://www.w3.org/2000/svg}text. That means if you want to get the actual tag name you have to first strip off the namespace. I did this by simply splitting at a closing parenthesis, which may not be the most robust way to do things but is quick and easy:

for element in tree.iter():
    print element.tag.split("}")[1]

Getting the properties you want

The only other thing I needed to know was how to extract properties and this is done using element.get(). For example to get the "points" values from all the polygon elements:

for element in tree.iter():
    if element.tag.split("}")[1] == 'polygon':
        print element.get("points")

Finally, to split the points property into individual coordinates, I did use regex. I initially tried splitting at spaces to get each coordinate and commas to get the x- and y-values because it look like the map used those consistently, but it turned out there were a couple of examples where it didn't. So I used the following to split the points by space and comma to get a single list of values:

import re

re_split = re.compile('\s+|,')

The SVG was quite nicely put together so it was relatively easy to extract the data. Each territory was in a group element which had a title property giving its name, followed by one or more polygon elements containing either a land or water territory. The land territories had a class "l", while the water territories had the class "w". There were also labels which were text elements containing a three letter code (extracted with element.text). I extracted the data into a dictionary of land territories and one of water territories, plus a list of labels.

from lxml import etree
import re

re_split = re.compile('\s+|,')

land = {}
water = {}
labels = []

def getCountryData(filename):
  tree = etree.parse(open(filename, 'r'))

  for el in tree.iter():
    if el.tag.split("}")[1] == 'g':
      name = el.get("title")

    elif el.tag.split("}")[1] == 'polygon':
      if el.get("class") == "w":
        water[name] = re_split.split(el.get("points"))
        land[name] = re_split.split(el.get("points"))
      elif el.tag.split("}")[1] == 'text':
        labels.append([el.text, el.get("x"), el.get("y")])


Writing the data for a Khan Academy program

The Khan Academy computer science environment uses Processing.js. I wrote a function to generate a text file I could paste into a program. It first set the fill colour to SEACOLOUR, which I later defined and put at the top of the program. Then it drew all the water territories using beginShape() followed by a list of vertex(x, y) commands and ending with endShape(). Since the SVG polygon element automatically closes paths, I had to add the first coordinate to the end of each list to ensure things were closed. I did the same with the land territories, before finally drawing the labels.

def writedata(f, data):
  for name, c in data.iteritems():
    f.write("\n//%s\n" % name)

    for i in range(0, len(c) - 1, 2):
      f.write("vertex(%s, %s);\n" % (c[i], c[i + 1]))
      f.write("vertex(%s, %s);\n" % (c[0], c[1]))

def writeKACSCode(filename):
  with open(filename, 'w') as f:
    f.write("stroke(0, 0, 0);\n")
    writedata(f, water)
    writedata(f, land)
    for data in labels:
      f.write("text('%s', %s, %s);\n" % (tuple(data)))

The full code can be downloaded below.

A few rough edges

Despite the SVG generally being quite well put together, there were a few inconsistencies, which made it harder. Firstly there was one sea which was drawn as a path rather than a polygon, so I had to rewrite that by hand (luckily it was quite short and easy). Then there were several polyline elements which I also converted to polygon, but that left some extraneous lines across St. Petersburg and Spain. These turned out to be created from two sets of polylines each, so I manually combined them into single polygon. Then some of the water masses had to be moved so they were drawn over the land as the land had been drawn drawn as a solid block with the water on top.

Ideally I would have drawn the water as a single rectangle covering the whole board then drawn the land on top with some lines showing the boundaries of the water territories. That would have saved a lot of intensive polygon drawing, but it's rather too much effort, especially as some land masses, such as Ireland are not drawn, but only exist as places where there is no water.

readDiplomacyMap.txt1.55 KB

Post new comment

The content of this field is kept private and will not be shown publicly.