Initially I thought I'd have to write a VBScript that iterated through a Word document and converted things on the fly. Fortunately - thank God for Google - it turned out that there's a much easier way - using Word's .NET programmability support, we can import some COM DLLs, automatically generating .NET wrappers on the fly, and Voila!
The reference article I found is this one: Export customized XML from Microsoft Word with VB.NET.
The thing that stumped me at first was that whenever I added the reference to Microsoft Office 11.0 Object Library, it wouldn't find the Word automation DLL. It turns out that .NET programmability isn't installed by default for office components - you have to select it for each Office application manually.
With that out of the way, I thought things would proceed smoothly. Nope.
I hit the first wall when I was trying to call the function that opens a Word document for processing. Here's the generated .NET wrapper function signature:
Document Open(ref object FileName, ref object ConfirmConversions, ref object ReadOnly, ref object AddToRecentFiles, ref object PasswordDocument, ref object PasswordTemplate, ref object Revert, ref object WritePasswordDocument, ref object WritePasswordTemplate, ref object Format, ref object Encoding, ref object Visible, ref object OpenAndRepair, ref object DocumentDirection, ref object NoEncodingDialog, ref object XMLTransform);
Good stuff, huh?
First of all, all but the first parameter are optional - knew that from the documentation. That is, they have default values. However, C# does not support parameters with a default value - it opts for overloading. Unfortunately, the wrapper generated doesn't take that into consideration - i.e. it doesn't generate 15 overloads.
So now I have 2 options:
Call directly, passing all parameters
Use reflection to call the function supplying only the parameters I care for, and using default values for the rest.
The first option didn't look that bad. After all, I'm a self-proclaimed typing ninja. After half an hour of fiddling trying to get this to work, I discovered that not only am I a self-proclaimed typing ninja, but I'm R.E.T.A.R.D.E.D too.
The function takes all its parameters by reference to an object. So, say you want to pass a string for the first parameter. You can't do any of the following:
// Argument '1': cannot convert from 'string' to 'ref object'
// Argument '1': cannot convert rom 'ref string' to 'ref object'
wordApplication.Documents.Open(ref fileName, ...);
// A ref or out argument must be an ssignable variable
wordApplication.Documents.Open(ref ((object)fileName), ...);
So, to the best of my admittedly rather shallow knowledge, you're basically left with:
object name = (object)fileName;
wordApplication.Documents.Open(ref name, ...);
Then it struck me that there are fifteen other parameters.
For a single function call, out of many to-be-made function calls.
At this point, I decided I'd give reflection a try - maybe it'd look better and be easier to maintain. At least it'd allow me to use the default values. So:
MethodInfo methodInfo = typeof(Word.Documents).GetMethod("Open");
ParameterInfo parameterInfo = methodInfo.GetParameters();
object parameters = new object[parameterInfo.Length];
parameters = (object)fileName;
for (int i = 1; i < parameterInfo.Length; i++)
parameters = Type.Missing;
Word.Document document = (Word.Document)typeof(Word.Documents).InvokeMember("Open",
BindingFlags.Public | BindingFlags.Instance | BindingFlags.InvokeMethod | BindingFlags.OptionalParamBinding,
null, wordApplication.Documents, parameters);
At this point, I was unsure which of the 2 methods is lamer. So I decided I'd implement the first method completely and see how they compare in lameness.
Clickety clickety click, clickety clickety click, and a while later I have this:
object falseObj = (object)false;
object trueObj = (object)true;
object nullObj = null;
object strObj = (object)"";
object formatObj = (object) Word.WdOpenFormat.wdOpenFormatAuto;
object encodingObj = (object) Microsoft.Office.Core.MsoEncoding.msoEncodingAutoDetect;
object directionObj = (object) Word.WdDocumentDirection.wdLeftToRight;
document = wordApplication.Documents.Open(ref name, ref trueObj, ref trueObj, ref falseObj, ref strObj, ref strObj, ref falseObj, ref strObj, ref strObj, ref formatObj, ref encodingObj, ref falseObj, ref falseObj, ref directionObj, ref falseObj, ref nullObj);
"Phew, done! Let's run this baby", thought Retardo. Only to get bombed to death by a type-mismatch COM Exception.
Now, all the events so far almost brought a tear to me eye. In a sad way. Sad, frustrated, and bewildered. There's no information about which parameter it is that is causing the mismatch, and there's nothing wrong as far as I can see. Let the experimentation begin!
Use integers for true and false...Nope.
Pass nullObj instead of empty string objects for all the password stuff...Nope.
Pass strObj instead of nullObj for the last parameter, XMLTransform...OMGLOL!!@@ Works!
So, basically, this parameter is a string. I didn't know that. The documentation says:
Optional Objectt. Specifies a transform to use.
So, now that this method works, I can move on. But now I've learnt my lesson - life is hard. Always. I've not seen the worst yet...
Armed with this honorable spirit, I proceeded with the mucking. As expected, I bumped into more issues, examples of which are:
1. When iterating through the document paragraphs, I first used a foreach loop on the Document.Paragraphs collection. The collection implements IEnumerable, so this worked fine. Later, I needed a one paragraph look-ahead, so I switched to an index-based loop. Only to be greeted by an exception. To let the BOTCHED thing speak for itself:
An unhandled exception of type 'System.Runtime.InteropServices.COMException' occurred in WordToWiki.exe
Additional information: The requested member of the collection does not exist.
By the requested member, it means the indexer. Fo'real, m'nizzle?
2. I rely on style names for formatting. For example, paragraphs with the style named "C#" are converted to C# source tags. To retrieve the style, I used a style property. Only to discover that it's not supported by the C# language - had to use the set_ get_ syntax instead.
Now, not all issues stem from the tools. After working on the thing for a couple of hours more - surprisingly, it was going smoother than I expected - I noticed that there's a lot of swapping going on, and that Windows has slowed down considerably. Additionally, Opera kept crashing. A quick look at the task manager revealed that I had like 30 running instances of Winword.exe, at memory consumption of 1GB+.
I forgot to add code to close the word application instance used internally. Took me a couple of minutes to clean this mess up, then I added the code.
I struggled a bit with bold and italics after this - iterating through the document character by character takes ages. Iterating it sentence by sentence is cool, except that I get repeated sentences for some reason - I'll look into that later, God-willing. For now, I've used Word's Search/Replace functionality to do bold and italics. It works, but not neatly. For example, say you have the word "Huzzah" in bold. You'd get tags surrounding each character - i.e. the search/replace isn't greedy.
This journal entry was written in Word then converted using the alpha software. I'll release it when it's done under a non-restrictive license, God-willing - Probably ZLib or something similar.