I have to disagree in part. Although not being mentioned explicitly in the OP, I assume we're speaking about a format for the game and not one for the development stage. Here are my 2 Cents:
File formats like XML in general, Collada, OpenEXR, ... are good for development, file exchange, and (in the case of XML) for my part also for configuration. They are application file formats, so to say. They are less suited as file format for games, although they can be used there, of course. The reason is that they tend to be bloated, are needlessly complicated to be parsed, and hence have a negative impact on memory footprint and execution duration.
But we want to get rid of e.g. load screen, don't we? Well, to overcome this issue, games usually have their own file formats. These are binary, compact, with as less need for interpretation as possible, perhaps even packed to enable better streaming performance. A good example is BitSquid. There is also an article at Gamasutra; unfortunately I don't find it ATM. There are also a dozen or so threads here on GDnet that discuss this topic. I mean, would someone use NetPBM P3 format because it is human readable? A big No. Instead, we want textures even to be compressed already on disk, and complain that Khronos had let passed so much time until compressed textures could finally be loaded directly.
Back to dialogs! Dialog is a topic I'm currently concerned with in depth, because I'm just now developing an interactive, non-linear graphics novel system with, of course, locale support. Text itself is just one side of the problem. Embedding it is another.
The OP mentioned that the dialog is pre-defined and linear for now, so we don't need to speak about dynamic text substitution, article and pronoun and proper name handling, and similar things. Although, IMHO, features like an inventory would profit from such mechanism, too. If interested in, there is an article at Gamasutra that sheds some light on this problem and proposes a possible solution (don't get scare off by the term "MMOG" in its caption). It is a kind of scripting but with an emphasis on the textual character rather than programming.
Coming now to the embedding of text. Text needs to be fragmented to allow its handling, and the fragments need to be identified. Fragmentation is to be expected due to a change of the speaker, a dramatic pause, to avoid text overflow on screen, waiting for the player finished reading, whatever. Text needs to be put on screen somewhere, at some point in time, in a defined sequence. This introduces three layers: The text fragments itself, each one with a unique identifier (preferably numeric) and an identifier of the speaker (if pre-defined), the transitions from one fragment to another (here comes pausing, waiting on user action, sequencing, branching, parallelism, and the like), and finally the decision where to render. The first two layers have an impact on the file format.
Edited by haegarr, 18 October 2013 - 02:08 AM.