Databases, how they work?

Started by
7 comments, last by gdunbar 17 years, 1 month ago
Hello people, Instead of fully dedicating myself in producing a demo of my own ideas, like you all do so nicely :-D, I tend to 'waste' my time by figuring out the technical side existing programs. After abandoning a lot of projects where I gained the understanding of a working mechanism, I arrived at the position where I want to know how databases work. Now a database is merely a mapped-file, right? So not much to learn about that, since it could be implemented in every way you want. But there are applications which use such database-files, the hosting-application. I wish to know how they work, so that will be my general question for this thread. Narrowing down the subject down into sub-questions, my question might becomes maybe easier to answer. 1) What does exactly 'host' a database-file? Is this a single process (application) that is being started every time the OS is being started? Or is it somewhat integrated into the OS? 2) How is the interfacing with a 'hosted' database-file actually being done? Socketwise, like with a TCP/IP connection (or UDP, connection-less)? Or is the being done else way, making it possible to send data parallel? 3) How is 'database-host' using the database-file? Does this go via the use of a single filestream, or does it use multiple filestreams to the same database-file? Does it even use a filestream, or has it a more advanced way of using the file? 4) A more technical question that maybe be more depending on what database type is being used (like SQL, Oracle), how do they map new records? Is this in a linked-list style or more like linking blocks of records? 5) Are there any protocols on how a database-file should be mapped, or is that just database-type dependent? 6) Whats up with ADO (in ActiveX Data Objects)? Is this a protocol or more a sort of API like OpenGL and DirectX or interface itself? Are there any equivalents, like Direct3D is toward OpenGL and visa-versa? These are the questions I would really like to know the answer to. When I have the majority of the question answered, I will most likely start an other project focussed upon building a very simplified application which will host a database-file and supports an sort of interface toward it. Can anyone enlighten me? Thanks in advance, Regards, Xeile - "Mr. what-to-know-it-all"
Advertisement
Quote:Original post by Xeile
Now a database is merely a mapped-file, right?


Not necessarily. Some databases can actually spread out data over several files (I think BerkeleyDB does it, for instance).

Quote:1) What does exactly 'host' a database-file? Is this a single process (application) that is being started every time the OS is being started? Or is it somewhat integrated into the OS?


Database files are hosted by the filesystem, just like any other file. They have nothing special about the, and do not necessarily exist in RAM unless a process loads the there. That process is usually the database server.

Quote:2) How is the interfacing with a 'hosted' database-file actually being done? Socketwise, like with a TCP/IP connection (or UDP, connection-less)? Or is the being done else way, making it possible to send data parallel?


A database file is a disk representation of a database. Reading and writing is handled by a database server, which may be contacted in any number of ways: sockets, COM, libraries with C linkage... a database server is ultimately just a way of storing data in a file with a nifty query system, it can be done in any number of ways.

Quote:3) How is 'database-host' using the database-file? Does this go via the use of a single filestream, or does it use multiple filestreams to the same database-file? Does it even use a filestream, or has it a more advanced way of using the file?


Depends. You may map a portion of the disk to memory, or read relevant portions of it every time a query is made, with caching. It depends upon the database size, the number of tables and the access frequency for each of these.

Quote:4) A more technical question that maybe be more depending on what database type is being used (like SQL, Oracle), how do they map new records? Is this in a linked-list style or more like linking blocks of records?


Depends on the indexing mode. For instance, they may use an unordered vector for storage (constant-time insertion and removal) along with a tree or hashtable that points into that vector (constant or logarithmic time updates and searches).

Quote:5) Are there any protocols on how a database-file should be mapped, or is that just database-type dependent?


Database-dependent, although some of these publish their file formats, which can then be used elsewhere.

Quote:Original post by Xeile
1) What does exactly 'host' a database-file? Is this a single process (application) that is being started every time the OS is being started? Or is it somewhat integrated into the OS?
Usually a single process on the OS, running as a service.

Quote:Original post by Xeile
2) How is the interfacing with a 'hosted' database-file actually being done? Socketwise, like with a TCP/IP connection (or UDP, connection-less)? Or is the being done else way, making it possible to send data parallel?
It depends on the DB. I know MySQL tries to use Windows pipes (Which are slightly more efficient that sockets on Windows), and can also use sockets.

Quote:Original post by Xeile
3) How is 'database-host' using the database-file? Does this go via the use of a single filestream, or does it use multiple filestreams to the same database-file? Does it even use a filestream, or has it a more advanced way of using the file?
Under Windows, DBMS's will memory map [parts of] the file and let the OS handle it. I presume other OSs will do similar things.

Quote:Original post by Xeile
4) A more technical question that maybe be more depending on what database type is being used (like SQL, Oracle), how do they map new records? Is this in a linked-list style or more like linking blocks of records?
Almost certainly DB-specific, but I'm not really sure.

Quote:Original post by Xeile
5) Are there any protocols on how a database-file should be mapped, or is that just database-type dependent?

6) Whats up with ADO (in ActiveX Data Objects)? Is this a protocol or more a sort of API like OpenGL and DirectX or interface itself? Are there any equivalents, like Direct3D is toward OpenGL and visa-versa?
Again, not sure.

Quote:
1) What does exactly 'host' a database-file? Is this a single process (application) that is being started every time the OS is being started? Or is it somewhat integrated into the OS?


I'm unaware of any databases that require kernel-level drivers. Maybe some really advanced DBs that run on high-end systems and mainframes, but every single one I've come across is a userland application. Just like a web server, file server

Quote:2) How is the interfacing with a 'hosted' database-file actually being done? Socketwise, like with a TCP/IP connection (or UDP, connection-less)? Or is the being done else way, making it possible to send data parallel?


Every modern database I've come across allow multiple simultaneious connectiosn over TCP.

Quote:
3) How is 'database-host' using the database-file? Does this go via the use of a single filestream, or does it use multiple filestreams to the same database-file? Does it even use a filestream, or has it a more advanced way of using the file?


I believe the data structure is maintained in memory, and reads/writes from/to disk are performed only when necessary. At the very least the indices to the tables will be loaded - but that ties into you're next question. Thats probably very implementation dependent though.

Quote:
4) A more technical question that maybe be more depending on what database type is being used (like SQL, Oracle), how do they map new records? Is this in a linked-list style or more like linking blocks of records?


If what I learned in my CS class some 7 years ago :-( is correct, then most good DBs will store their records in B trees(also called B+ trees). The indices will be stored in memory, so insertion and deletion of records will be very fast. In a good, balanced B tree, you won't have to search more than about 4 levels before arriving at your node. I believe all the basic operations - search, insertion, deletion are constant time or very near that.

If you really want to find out DB internals, I would download a version of PostgreSQL (. It is a very robust DB, and the source code is totally open.
1) A DBMS is just an application, really. It may be a single process or may be several processes - this is implementation dependent. (On Windows typically it'll be one process, on Unix frequently multiple processes.)

2) Again implementation dependent and usually configurable. Typically over TCP/IP as "normally" the DBMS will reside on a dedicated machine with client applications being remote. (Depends on configuration of course!) Using TCP/IP to connect locally (loop-back adaptor) shouldn't "hit the wire" but be rerouted by the TCP/IP software stack (i.e. comms remain on the machine itself).

3) Implementation and configuration dependent. For example, IBM DB2 supports two types of tablespace (or dbspace - terminology is again implementation dependent) (logical construct for locating tables) - "System Managed" and "Database Managed". System managed spaces result in individual filesystem files for each table; database managed space uses pre-sized ("cooked") files or raw devices and manages which parts of the files/devices belong to which tables internally. IBM Informix SE uses only files (like system managed spaces); IBM Informix Dynamic Server uses only cooked files or raw devices for its dbspaces.

4) All DBMSs I know of use "pages". A page is a defined size block. Inside the block the space is managed in "slots" with each slot holding one data record. (For user data at any rate, indices and LOBs are different...)

A "row id" is usually used to identify (internally) a specific row - the row id will tell you the page and slot. As a user this is not typically useful information - you'd normally define a primary key for your records allowing you to locate just the record you're after - for example, an employee number.

5) Implementers choice.

6) ADO is just a "tool set" (term used loosly) used to interface with a database to extract/store/manipulate the data. You can use ODBC directly (or other interfaces provided by the DBMS) to access the data too.

You can get fairly detailed info from the IBM DB2 or IBM Informix Dynamic Server on-line help pages:

http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=/com.ibm.db2.udb.admin.doc/doc/c0004125.htm

http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.adref.doc/adref223.htm

Hope this wasn't too much info :)
Quote:1) What does exactly 'host' a database-file? Is this a single process (application) that is being started every time the OS is being started? Or is it somewhat integrated into the OS?


Database is abstract entity. There is no need for database to have a file - there exist in-memory databases. The term static database is used in some fields to determine extensive configuration templates which are represented by individual files. But ultimately, a set of java hash maps can be with all right called a database.

More commonly, the term database refers to an application that can execute structured query language (or SQL).

Quote:2) How is the interfacing with a 'hosted' database-file actually being done? Socketwise, like with a TCP/IP connection (or UDP, connection-less)? Or is the being done else way, making it possible to send data parallel?


Either through application specific API, or through SQL statements. If database process is running on the same machine as user application, or if user application embeds the database via proprietary API, then this will be done through database specific API.

More commonly, the database will be accessed via means of most suitable inter-process communication, TCP or pipes. The communication protocol here is simply SQL statements, possibly encoded to reduce space.

Quote:3) How is 'database-host' using the database-file? Does this go via the use of a single filestream, or does it use multiple filestreams to the same database-file? Does it even use a filestream, or has it a more advanced way of using the file?


A database can be fully in memory, or it can be distributed across a cluster. DHT (distributed hash table) for example is used in peer-to-peer applications and the data resides on thousands, if not millions of hosts.

Common SQL databases will cache some data in memory, page large blocks of data to and from disk, and maintain several indexes in memory to speed up searches. In addition, some forms of files can be used to replicate operations across multiple databases, perform logging, backup, etc.

Quote:4) A more technical question that maybe be more depending on what database type is being used (like SQL, Oracle), how do they map new records? Is this in a linked-list style or more like linking blocks of records?


SQL is a language. MySQL and Oracle Database are two applications that use it.

SQL based databases will commonly use some form of trees to maintain indexes mapping keys into underlying storage. It depends a lot on the size and structure of the data.
Whichever algorithm is used, it must consider that if something is stored on file, sequential block access will generally perform better than full random access, while random access is optimal for data stored in RAM. So it will be a combination of aproaches.

Quote:5) Are there any protocols on how a database-file should be mapped, or is that just database-type dependent?


Database file is any file stored on disk. The file system itself is a database. It comes down to how you access it.

SQL databases use proprietary formats which are irrelevant to the user. All access is done solely through SQL. But the real value of using existing implementations and products comes from the optimizations and design done by the database implementation. Having lots of data is easy. Accessing it effectively is where the value lies.

The structure of these files is tied closely to how each database is implemented, to the system it runs on, and to ammount of data. But structure, even if it's published somewhere, is a matter of decades of studies and optimizations, and has little value for re-use outside of that particular database application.

Quote:6) Whats up with ADO (in ActiveX Data Objects)? Is this a protocol or more a sort of API like OpenGL and DirectX or interface itself? Are there any equivalents, like Direct3D is toward OpenGL and visa-versa?


ADO is one of many abstractions for access to various databases.


Anyway, sorry for some generalizations, but I feel it's important to understand that databases go well beyond MySQL, and that some concepts are applicable outside of SQL world.

As a side note, given the nature of the board, simplified versions of SQL are used in some MMORPG implementations to route the messages, manage areas of interest, or, in some experimental systems, to provide fully distributed peer-to-peer based world, that can run on arbitrary number of hosts.
So many different answers - perhaps because there's so many different implementations. I'd recommend you start here.
Here's a page about the internals of data pages for MS SQLServer: http://www.akadia.com/services/sqlsrv_data_structure.html

--- krez ([email="krez_AT_optonline_DOT_net"]krez_AT_optonline_DOT_net[/email])
There are 2 main reasons that a database is _not_ just a map backed by a file.

The first is that a database has to satisfy transactional properties. The main point here is that if you perform a complex operation on a database, the operation needs to completely succeed, or completely fail. For instance, if I want to transfer money from my savings account to my checking account, but the computer crashes in the middle, it would be quite undesirable for my savings account to be lowered but with no effect on my checking account.

The second point is that a database needs to be able to do its work in a performant manner when accessed by a large number of users concurrently. (While still maintaining correctness).

Take a database course at your local university for the basics. Get a PhD in Computer Science if you want to understand all the intricacies. :-)

Geoff

This topic is closed to new replies.

Advertisement