Tags: , ,

After exporting pages from Confluence, the next step was to convert between the Confluence wiki format and the MediaWiki format. The differences are sometimes quite amusing - in a link with a different name from the page it is linking to, one puts the name first and the page second, and the other the page first and the name second.

Not really my best code - I think I was writing Python in a very PHPesque way. But it functioned sufficiently to convert our pages.

#!/usr/bin/env python

from cStringIO import StringIO
import os.path
import codecs
import re

def append_page(all_files, name, filenames):
    for f in filenames:
        all_files.append('%s' % (f,))

link_re = re.compile("\[([^\]]+)\]")
linktext_re = re.compile("\[([^\|\]]+)\|([^\]]+)\]")
bold_re = re.compile("\*([^ *]+)\*")
italic_re = re.compile(r"\b_([^ _]+)_\b")

def mangle(contents):
    utf8_w = codecs.getwriter('utf-8')
    utf8_r = codecs.getreader('utf-8')
    output = utf8_w(StringIO())
    noformat_count = 0
    for line in contents.split("\n"):
        if "h1." in line:
            line = line.replace("h1.", "==")
            line = line + " =="

        if "h2." in line:
            line = line.replace("h2.", "===")
            line = line + " ==="

        if "h3." in line:
            line = line.replace("h3.", "====")
            line = line + " ===="

        if "h4." in line:
            line = line.replace("h4.", "=====")
            line = line + " ====="

        ltm = linktext_re.search(line)
        if ltm:
            if 'http:' in ltm.group(2) or 'ftp:' in ltm.group(2):
                line = re.sub(linktext_re, r"[\2 \1]", line)
            else:
                line = re.sub(linktext_re, r"[\2|\1]", line)

        if (noformat_count % 2) == 0:
            lm = link_re.search(line)
            if lm:
                line = re.sub(link_re, r"[[\1]]", line)

        bm = bold_re.search(line)
        if bm:
            line = re.sub(bold_re, r"'''\1'''", line)

        im = italic_re.search(line)
        if im:
            line = re.sub(italic_re, r"''\1''", line)

        while '{noformat}' in line:
            if (noformat_count % 2) == 0:
                line = line.replace('{noformat}', '<pre>')
            else:
                line = line.replace('{noformat}', '</pre>')
            noformat_count += 1

        output.write(line + "\n")

    value = output.getvalue()
    a = utf8_r(StringIO(value)).read()
    return a

files = []
os.path.walk('orig-pages', append_page, files)

print files

for f in files:
    contents = codecs.open('orig-pages/%s' % (f,), 'r', 'utf-8').read()
    contents = mangle(contents)
    codecs.open('conv-pages/%s' % (f,), 'w', 'utf-8').write(contents)

2 Responses

  1. Nicholas RileyJuly 21, 2006 at 08:00 PM.

    We're just thinking about switching _to_ Confluence from a home-grown, rather feature-poor wiki written in Python (price is not an issue, we already have licenses for Confluence). MediaWiki is an option too, I guess, but I really don't like to deal with a gigantic PHP application. What kinds of problems did you find with Confluence that triggered the switch? And yeah, you might try using the methods on the compiled regex objects next time, it's shorter and clearer to read. :-)
  2. Neil Blakey-MilnerJuly 21, 2006 at 08:20 PM.

    There were two reasons. One was cost-related - we couldn't use the for-Open-Source-projects license to discuss our commercial offerings and it didn't seem worth it to pay for something solved in so many other systems. Second was familiarity/popularity - hopefully more people are familiar with it, and so barriers to contributing would be low. Would hopefully also be easier to modify to our needs - that's already proven to be the case. Neil

Have your say

The text area above accepts Post Markup, a BBCode work-alike.

[b]foo[/b]: foo
[i]foo[/i]: foo
[link]http://nxsy.org/[/link]: http://nxsy.org/ [nxsy.org]
[link http://nxsy.org/]Neil[/link]: Neil [nxsy.org]
        

You can also use:

[code python]
import foo
[/code]