Word Processors: Data Organisation

Started by
6 comments, last by Emmanuel Deloget 15 years, 9 months ago
Can anyone suggest how a program like MS Word would arrange document data (to keep it simple lets just assume blocks of text and not images, tables etc) in memory? How would it maintain paragrpahs in memory? What about dividing this text in to 'pages' - how does it calculate that? What if it needs to split a paragraph of text over two pages? I assume the entire document would be loaded in to some form of tree structure but how would this be organised in to pages? Does each page somehow index in to a single "document tree" (perhaps maintaining "start" and "end" points)? Does each page maintain a subtree of the document tree? Can anyone offer any suggestions of how this is done?
Advertisement
First the obligatory sanity check: Why?

Quote:Can anyone suggest how a program like MS Word would arrange document data


Office file formats. Note that most complexity comes from 20 years of legacy.
Open Office formats.

Quote:to keep it simple lets just assume blocks of text and not images, tables etc) in memory? How would it maintain paragrpahs in memory?


You need to differentiate between storage and layout. Storage is just that. When a document is displayed, the layout is performed based on current size.

HTML is a good example of such representation.

Quote:I assume the entire document would be loaded in to some form of tree structure but how would this be organised in to pages?


Technical details. Depends on your available memory, etc... And documents generally aren't organized in pages, since layouts can change (A4 to A3, or similar). That's part of layout process.

Quote:Does each page somehow index in to a single "document tree" (perhaps maintaining "start" and "end" points)?


A more practical version uses more or less SGML based layout using tags. OpenOffice uses XML + embedded_files within a zip.

Quote:Can anyone offer any suggestions of how this is done


It's not, plain and simple. Already all the possible high-fidelity editors are available (Google apps, Office and OpenOffice suites, tex).


But if you want to find out how this works - look at web browser and HTML. It's exactly the same concept. Everything else is just technical aspects.
File formats say nothing about how the data is stored in memory. For example, it would be silly to build a XML document in memory (operations are guaranteed to be too slow).

The Design Pattern: Elements of Reusable Object Oriented Software book cites a few interesting pattern. One of them is the flyweight pattern that allow to store graphical representation of characters used in a word processor.

This document talks about the creation of a small word processor software and describes a few data structures.

Speaking of the needed data structure, they are not that easy to devise. Think about it:
  • text shall be easily transformed into an image
  • but you shall be able to search through the text easily
  • you shall also be able to support multiple level of undo/redo
  • it should not take too much memory
  • but remember that each character can have its own distinctive style
  • you shall also support special features (lists, tables, ...)
And so on.

Fortunately, most changes are local. Here is how I'd do it:
  • A paragraph class, to which you can attach a style object. This will allow you to handle basic styles at the paragraph level.
  • In each paragraph, a list of character instances (again, a style object is attached to each character; if no style is attached, we inherit the portions of the paragraph style).
.
class character_style {   character_flyweight flyweight;};class paragraph_style {   character_style* base_style;};class character_information{  wchar_t char_value;};class character {  character_style* style;  character_information* character_info;};class paragraph {  paragraph_style* style;  list<character> characters;};
Then, how so I support the needed features?
  1. undo/redo: while the action is the same as before, store the action into the current action. If the action changes, creates a new action and store it in the action LIFO. Everything the user does is an action (typing a character, changing the style of a set of characters, ...). Some actions can't be stacked up into a single action instance. For instance, typing the enter key twoce shall create two actions, but typing a character twice shall create only one action. Undo is done by playing the action backward.
  2. display: maintain a flyweight with the graphical representations of the characters; build a boxed graphical representation of the text (à la LaTeX). Render.
  3. search: you shall be able to search directly in the main text database
  4. lists: that's a paragraph style.

Memory-wise: a 500,000 character document would take a bit of memory (but then, it's also quite a big document); at 5 characters per word, 50 words per paragraph, that's 2000 paragraphs (basic version: 3 pointers - a paragraph object, a list of character and a style object) = roughly 25KB. The text itself needs roughly 4MB (500,000 chars x 2 pointers (character description + style)). The rendering information is made of a tree and should take a few more MB.

Of course, you can add more complexity to this scheme.
I seriously doubt any word processor is using any per-character object, despite what the GoF Flyweight pattern might imply. More likely, they store attributes that refer to a block of text, in much the same way HTML tags apply formatting to any contained text nodes. There is so much coherence between one character and an adjacent character on average that it would be a drastic waste of storage to do it any other way.
Interesting responses, thanks guys. That 'Design Patterns' book looks good I will need to order a copy of that. Don't worry I haven't taken leave of my senses and embarked on writing the next big thing in Word Processors :)

I do have an idea for a WYSIWYG application for composing guitar tablature and a word processor is most analogous to what I am thinking about. I will need to check out LaTeX too.
Also take a look at gap buffers, if you need efficient insertion.
Quote:Original post by Codejack

I do have an idea for a WYSIWYG application for composing guitar tablature and a word processor is most analogous to what I am thinking about. I will need to check out LaTeX too.


Did you try putting "guitar tablature editor" into google? There's even an open source application on front page.
Quote:Original post by Kylotan
I seriously doubt any word processor is using any per-character object, despite what the GoF Flyweight pattern might imply. More likely, they store attributes that refer to a block of text, in much the same way HTML tags apply formatting to any contained text nodes. There is so much coherence between one character and an adjacent character on average that it would be a drastic waste of storage to do it any other way.

Although I haven't checked the source code, I'm pretty sure that LaTeX is going down to the character level. But then, LaTeX is not a word processor (it's more a text compiler).

Using per-character objects have a few benefits. But one must veryfy whether it's a good idea or not (speed-wise). A WP have to be fast.

This topic is closed to new replies.

Advertisement