Converting .doc to .html

Started by
4 comments, last by ciroknight 18 years, 3 months ago
Hey, I'm wondering if any of you know how .doc (word) files are "formated"? More precisely, how could I take the styles from a .doc and convert them into css? I googled the thread's title to find a lot of pages but they all seem to talk about using MWWord to do the conversion. Basically I'd like to load the myDoc.doc in my program and have it return myDoc.html with valid html. This is for a newsletter written at work. I'd like to make my job easier by having the conversion done automatically instead of copy pasting the text to then format it using html tags. (if there's a "proper" term for this let me know and I could probably do the research on my own... I don't really know what to call this since I've never done something like this before.) Thanks, Seb
Advertisement
You could try opening it in Microsoft Word it self, and Saving As .html
Although this probably doesn't use CSS (I haven't tried recently), it will usually give a decent translation.
As for reading .doc files, try googling for the format, it is well documented.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Some program can do this work
you can do search it on www.download.com site.

--Mojtaba--
You could also try OpenOffice. Would do the same as MS Word though (open .doc and Save As... .html). Does not use CSS, encodes all with HTML tags.
Quote:Original post by swiftcoder
You could try opening it in Microsoft Word it self, and Saving As .html
Although this probably doesn't use CSS (I haven't tried recently), it will usually give a decent translation.[...]
Whatever you do, don't do this. It will generate HTML that is around 100 times larger than the text itself, and everything is specified using absolute units so that anybody that needs it larger or smaller is screwed. Changing the font size by editing the HTML isn't an option because it's VERY messy html (that is filled with non-standard tags), even 'tidy' (a program designed to clean up HTML files, with an option to deal with words junk) chokes on HTML generated by anything newer then word 2000.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Copy->Paste, reproduce format.

But if you can't do it manually, OO.o does the best job of anything I've found (and yes, I've had to look; would you BELIEVE a company would force me to produce webpages from .docs automatically when I'm a fully qualified web developer?), but I still recommend delving into the generated code and removing some of the fat.

This topic is closed to new replies.

Advertisement