[web] Database Design - Stock Exchange Data

Started by
10 comments, last by Elovoid 12 years, 10 months ago
I'm currently trying to bend my mind around the database design for a stock exchange database.

I'll go with a table for the actual stocks (ID/ISIN, Name, Details, etc) and a table for the current price and time of the last update. The thing that I'm unsure about is the layout of the price data history.



a) My current approach would be to store the price data divided by period. One Table for intraDay values (maybe from the last 5 - 7 days) and one table for older data as I think I don't need to store all the intraDay data for the last n* years. Do you think a table with stockID, date, dayHighest, dayLowest and dayAverage would be sufficient? In my simulation, there won't be a time where the stock exchange is closed, so I won't have a starting price and "close of the exchange" price to store. I'll use an average price for each day as a substitute.

By caching subsets (last week, last month, 3 months, 6 months, last year, 5 years) it shouldn't matter if the history data table gets a bit bigger but I need a good performance for the intraDay price table. I only have a basic knowledge of database design. Therefor I have no idea about the layout of the supplied data.



B) Alternatively I'd keep several data entries per day in the history table and calculate highest and lowest price on demand (while creating the subset views as mentioned above). This would be more precise but I'm unsure about the effect on the performance of the cronjob that creates the subsets.




Do you see downfalls with my approach (A and B) and possible performance problems I might run into?
There will be about 15 - 25 sets of stocks ranging from 10 to 50 stocks per set to start with but the numbers might increase in the future.

Thanks in advance,
Elovoid
Advertisement
As always, do the simplest thing first, then optimize IF NECESSARY. It is much easier to optimize a correct design than it is to correct a broken 'optimized' design.

Do you think a table with stockID, date, dayHighest, dayLowest and dayAverage would be sufficient? [/quote] If you defer the optimization decision, you can decide whether to implement it this way or not when you actually KNOW whether it's sufficient or not. I'd start with what you actually need right now, which is a table recording stock price changes.

By caching subsets (last week, last month, 3 months, 6 months, last year, 5 years) it shouldn't matter if the history data table gets a bit bigger but I need a good performance for the intraDay price table. [/quote] As you familiarize yourself with moden SQL database implementations, you'll learn that the performance of queries on a dataset is basically independent of its size if you can create an appropriate index. if you've created a structure more or less like this:


create table stock (
id integer primary key,
name string not null
);

create table stock_price (
stock_id integer references stock (id);
stamp timestamp not null,
value integer not null,
constraint stock_price_pk primary key (stock_id, stamp)
);

create index stock_price_stamp_idx on stock_price (stamp);


your sql query optimizer (assuming it was programmed after the first world war and is not MS Access) will consider a backward index scan for queries like "select * from stock_price where stamp > $yesterday". This means you can dump millions of rows in stock_price and they won't noticeably affect that query's runtime.

SQL implementations put a lot of work into performing optimizations themselves so you can be free to focus on more important stuff.

Later, if your aggregate calculations are too slow (if you use internal sql mechanisms like groupby they can be surprisingly quick) you can build additional tables for optimization, and deal with the headaches that safely synchronizing your cache to your raw data usually brings, and the consequent loss of flexibility. But until (or if) that becomes necessary, why bother?
I think you are right.

As it's for a fun simulation I'm building more for myself right now, it wouldn't be so bad if I had to redesign a few loose ends in the future. I'll listen to something a wise person once told me. "Start small buddy and make it work. Don't worry about what's next, before you got the first step right, or you'll fall running before you learned to walk!" or something like that.

Thank you for your time and answer.

If you want to keep current and historical data separate for performance reasons, look into partitioning the table by date. This allows the data to be split by day/week/month etc. - a query referring to a particular date range will only have to hit the relevant partitions, which can also be indexed individually. Partitioning can also make your job easier if you need to archive or purge old data - partitions can be dealt with as almost self-contained units (e.g. dropping the oldest monthly partition at the end of each month).

Traditionally your live, transactional database will not contain many pre-calculated aggregations of data, but data in a normalized form - minimizing redundancy, not storing repeated information or data which can be calculated/rebuilt from other data. De-normalized data, pre-calculated summaries and so on are usually found in a data warehouse.

A primary motivation for keeping these things separate is that it is much easier to tune one database for a highly transactional workload (OLTP) and one for analytics/reporting than it is to tune a single database for both. You want your OLTP database to be as responsive as possible, which is difficult if heavy reporting/analytical work may be going on at the same time. I mention all of this as you are talking about performance - it may well be the case that initially the load on the database is not enough to become a problem, but it can't hurt to keep principles such as normalization in mind from the outset.

The "just wing it to get it working" mindset doesn't really make sense with a database. In the vast majority of a situations, apps are made to interact with databases instead of the other way around. Changing your database structure can cause a massive ripple effect in your application code so you really should put some forethought into the meat of what you're trying to do first.
Always strive to be better than yourself.
What if I don't access the database directly with my classes but map them out and use those alias names to access the database. If I happen to change the database, I only have to adjust the mapping I created earlier to match the new table layout. If I happen to change a data type of a column, I can search my project for possible conflicts by searching the whole solution for the mapped name I created.

With a little bit of foresighted planning and good naming conventions it should be possible to avoid at least a good bit of problems while "winging it out to get it working" (which - as I realize - stands in contrast to foresighted planning).
I'm not sure you need much detail of "intraday" prices and trades. What do you do with these prices? Do you use opening, closing, maximum, minimum prices? Do you care for traded amounts? Are you keeping semi-official permanent and complete records, or only data you need temporarily for a specific purpose?

I see only three plausible tables: security descriptions (key: ISIN; rarely edited), daily prices of a security (key: ISIN & date; containing all prices you care for; records added daily; never deleted or modified); and maybe all trades (key: synthetic; containing ISIN, price, amount, maybe parties; populated from some stock exchange firehose in real time; records possibly deleted when you're done).

There's nothing to "wing" in a simple database like this, except for the black arts of optimizing indexing and partitioning; adding or removing columns is a perfectly normal change.

Omae Wa Mou Shindeiru


I'm not sure you need much detail of "intraday" prices and trades.


I want those "intraDay" prices to use them for charts for the current day. When checking the charts at 13:15 I want all the prices of the current day until (now) in the chart(s), possible with a 15min interval. As mentioned before, I'm not sure if I need them later on. I think it's save to assume that I won't look at Friday the 13. in September 1929 and check the price of wood at 13:15.

As I mentioned, I won't have an opening and closing time which negates the open and close price. Therefor I planned to use several values of the day instead (every 3 or 6 hours for example) + min and max.



As a side note, I won't be working with real stock data. The system generates the data based on made up stocks and commodities with an algorithm I'm designing right now. I'll try to mirror the real world exchange as much as possible but I have a certain freedom in my implementation.


Thanks for your insight on a possible database layout. I'll keep that in mind when I start with the implementation.

What if I don't access the database directly with my classes but map them out and use those alias names to access the database. If I happen to change the database, I only have to adjust the mapping I created earlier to match the new table layout. If I happen to change a data type of a column, I can search my project for possible conflicts by searching the whole solution for the mapped name I created.

With a little bit of foresighted planning and good naming conventions it should be possible to avoid at least a good bit of problems while "winging it out to get it working" (which - as I realize - stands in contrast to foresighted planning).

If all you're doing is renaming a column or adding/removing one that's fine. However, if you change your superstructure at all (as in decide a few tables need to be reworked from the ground up) you're going to have massive rewrites. That's a problem with the wing it strategy. What works now may not work an hour from now. If you're using .Net look into the entity framework. It's a pretty robust data layer built into the .Net Framework.
Always strive to be better than yourself.

If you're using .Net look into the entity framework. It's a pretty robust data layer built into the .Net Framework.


Thanks for the tip. I'll be indeed working with ASP.Net (either 3.5 or 4) and C#.

This topic is closed to new replies.

Advertisement