Free Crash DB Course for Pierre and everyone :-)
Posted: Tue Jun 25, 2013 7:55 am
Because nowadays we speak a lot about DB optimisation I thought that it would be better to gather in a small compendium some things tied to Databases. Feel free to comment & improve. And if someone from admins thinks that this is usefull, pls make it sticky.
About the page size
The disk is slow. So, any DB engine tries to minimize the access time to disk. Because any IO request besides the actual, useful time spent in reading/writting data has a certain overhead (but not only), from the most ancient ages of HDDs, the file systems (and other high-performance storage systems - for ex. DB engines) and the HDD's firmware itself store and read the data from disks in chunks of data called pages, clusters, sectors etc.
But how big the page size should be?
The things are more complicated, but let's take a simplistic approach. We have thumbnails ~ 15 kb (if we use High quality JPGs - with WebP we will reach 10 kb). With a page size of 1024 bytes (1k). We need ~ 15 IO operations to bring that. Way too much. If we raise the page size to 4k, we need only 4 IO operations. Much better. If we have 8k page size, we'll need only 2 operations and with a page size of 16384 bytes we succeed to bring the entire thumb in just one IO operation.
With greater page sizes (ie. 32k) we reach in the other extreme: the data is much bigger than what we need and the time which we loose with the unuseful read time is bigger than the time gain from reduced IO overhead.
The same stands for other data which we must read from the disk: indexes, tables etc. Usually, nowadays the page sizes are arround 8-16k.
About the cache
Also, is worth mentioning here that the data which is read from the disk goes into the cache. That's why is important to have a big cache. Lightroom is very greedy in this regard. SQL Server eats the entire memory. I mention this because if you try to do a "benchmark" to measure the benefits of the page size, make sure that the data doesn't exist in the cache (ie. you didn't use it before) and be sure to fetch the records (for ex. to display the thumbs).
About the queries
If you read carefully above, I stressed that you need to fetch the records in order to see what's really happening. Why? Because when we send a text to DB engine saying "SELECT * from T1" actually it happens a bunch of things (I pray to not forgot something):
- Parsing & syntax checking. Rather Fast
- Generating the expression tree. So and so. Ok, let's say that's rather fast.
- Analysis. Well, this is a huge, complicated thing. The Query Optimizer does it's best to return in the smallest amount of time the first row of your query (there are other types of optimizers - let's forget them for now). It reads a bunch of things in the DB (and not only - for ex. DB2's excellent optimizer benchmarks also the speed of your disks) and with a complicated logic choses the PLAN of query. For more info (if you have courage) see
https://www.simple-talk.com/sql/sql-tra ... optimizer/
http://www.codeproject.com/Articles/555 ... ideo-expla
The above steps are called "Query preparation". It is very important to have as few calls as possible to the Preparation stage because it is a significant overhead in loops. That's why, almost all modern DB engines have the feature of "prepared queries".
- Databinding
- Allocating the internal data structures (Field Buffers, etc.)
- Execution. Slooow. The engine try to scan the indexes, the tables - check if the desired part(s) are in cache - if yes then it fetches the data from there (it is called Logical Read) or from pages on disk (Physical Read), assembles the first record, fill the Field Buffers, set the pointers in all joined tables and indexes for the next record and yells outside "Hey! The first record is ready!".
Usually it is by far the slowest single part from the entire process. That's why, by far the biggest optimization one can do is to move from singleton queries to set queries. For example from
to
- Fetch. Now with all the indexes read (cached), all the pointers set, the DB engine will advance through the data and send upon request, one by one, the records to the caller (your program). This part is fast. And anyway (see above) you cannot live without. And also, singletons (the queries which return only one record) usually have one more condition (for example: "WHERE Folder=888 and FILENAME='blah.jpg'" vs "WHERE Folder=888") which usually raises significantly the execution time (it doubles it or more - because the string comparison is slower than integer comparison).
- Unprepare. When the query is closed, the internal data structures (Field buffers, different pointers, Join lists etc.) are freed (deallocated) from memory. Fast part also. But if you do it 100000 times in a loop, one will start noticing.
As a conclusion, do as much set queries you can and when you cannot (nowadays, this is limited usually only to Stored Procedure calls) have prepared queries (the well known Prepared Stored Procedure calls).
----
Something is unclear?
Did I forget something?
Something is wrong?
PS: When I said in the title "Free Crash DB Course" it doesn't mean to show a way to crash the DB but to express the draft character of this paper. Forgive me for any mistake.
About the page size
The disk is slow. So, any DB engine tries to minimize the access time to disk. Because any IO request besides the actual, useful time spent in reading/writting data has a certain overhead (but not only), from the most ancient ages of HDDs, the file systems (and other high-performance storage systems - for ex. DB engines) and the HDD's firmware itself store and read the data from disks in chunks of data called pages, clusters, sectors etc.
But how big the page size should be?
The things are more complicated, but let's take a simplistic approach. We have thumbnails ~ 15 kb (if we use High quality JPGs - with WebP we will reach 10 kb). With a page size of 1024 bytes (1k). We need ~ 15 IO operations to bring that. Way too much. If we raise the page size to 4k, we need only 4 IO operations. Much better. If we have 8k page size, we'll need only 2 operations and with a page size of 16384 bytes we succeed to bring the entire thumb in just one IO operation.
With greater page sizes (ie. 32k) we reach in the other extreme: the data is much bigger than what we need and the time which we loose with the unuseful read time is bigger than the time gain from reduced IO overhead.
The same stands for other data which we must read from the disk: indexes, tables etc. Usually, nowadays the page sizes are arround 8-16k.
About the cache
Also, is worth mentioning here that the data which is read from the disk goes into the cache. That's why is important to have a big cache. Lightroom is very greedy in this regard. SQL Server eats the entire memory. I mention this because if you try to do a "benchmark" to measure the benefits of the page size, make sure that the data doesn't exist in the cache (ie. you didn't use it before) and be sure to fetch the records (for ex. to display the thumbs).
About the queries
If you read carefully above, I stressed that you need to fetch the records in order to see what's really happening. Why? Because when we send a text to DB engine saying "SELECT * from T1" actually it happens a bunch of things (I pray to not forgot something):
- Parsing & syntax checking. Rather Fast
- Generating the expression tree. So and so. Ok, let's say that's rather fast.
- Analysis. Well, this is a huge, complicated thing. The Query Optimizer does it's best to return in the smallest amount of time the first row of your query (there are other types of optimizers - let's forget them for now). It reads a bunch of things in the DB (and not only - for ex. DB2's excellent optimizer benchmarks also the speed of your disks) and with a complicated logic choses the PLAN of query. For more info (if you have courage) see
https://www.simple-talk.com/sql/sql-tra ... optimizer/
http://www.codeproject.com/Articles/555 ... ideo-expla
The above steps are called "Query preparation". It is very important to have as few calls as possible to the Preparation stage because it is a significant overhead in loops. That's why, almost all modern DB engines have the feature of "prepared queries".
- Databinding
- Allocating the internal data structures (Field Buffers, etc.)
- Execution. Slooow. The engine try to scan the indexes, the tables - check if the desired part(s) are in cache - if yes then it fetches the data from there (it is called Logical Read) or from pages on disk (Physical Read), assembles the first record, fill the Field Buffers, set the pointers in all joined tables and indexes for the next record and yells outside "Hey! The first record is ready!".
Usually it is by far the slowest single part from the entire process. That's why, by far the biggest optimization one can do is to move from singleton queries to set queries. For example from
Code: Select all
for i=1 to 100000 do
{
cField:=SQLExec(Format('Select Foo from T1 where FolderID=888 and Filename=%s', Filenames[i])); // 1000000 calls. Costs 100000 x execution time. 100000 x fetch time.
//....process cField
}
Code: Select all
myDataSet:=SQLExec('Select Foo from T1 where FolderID=888'); //one call. One execution time.
for i=1 to 100000 do //ok, this should be while not EOF...
{
//....scan & process myDataSet // it costs 100000 x fetch time
}
- Unprepare. When the query is closed, the internal data structures (Field buffers, different pointers, Join lists etc.) are freed (deallocated) from memory. Fast part also. But if you do it 100000 times in a loop, one will start noticing.
As a conclusion, do as much set queries you can and when you cannot (nowadays, this is limited usually only to Stored Procedure calls) have prepared queries (the well known Prepared Stored Procedure calls).
----
Something is unclear?
Did I forget something?
Something is wrong?
PS: When I said in the title "Free Crash DB Course" it doesn't mean to show a way to crash the DB but to express the draft character of this paper. Forgive me for any mistake.
