Page 1 of 1

Free Crash DB Course for Pierre and everyone :-)

Posted: Tue Jun 25, 2013 7:55 am
by m.Th.
Because nowadays we speak a lot about DB optimisation I thought that it would be better to gather in a small compendium some things tied to Databases. Feel free to comment & improve. And if someone from admins thinks that this is usefull, pls make it sticky.


About the page size

The disk is slow. So, any DB engine tries to minimize the access time to disk. Because any IO request besides the actual, useful time spent in reading/writting data has a certain overhead (but not only), from the most ancient ages of HDDs, the file systems (and other high-performance storage systems - for ex. DB engines) and the HDD's firmware itself store and read the data from disks in chunks of data called pages, clusters, sectors etc.

But how big the page size should be?

The things are more complicated, but let's take a simplistic approach. We have thumbnails ~ 15 kb (if we use High quality JPGs - with WebP we will reach 10 kb). With a page size of 1024 bytes (1k). We need ~ 15 IO operations to bring that. Way too much. If we raise the page size to 4k, we need only 4 IO operations. Much better. If we have 8k page size, we'll need only 2 operations and with a page size of 16384 bytes we succeed to bring the entire thumb in just one IO operation.

With greater page sizes (ie. 32k) we reach in the other extreme: the data is much bigger than what we need and the time which we loose with the unuseful read time is bigger than the time gain from reduced IO overhead.

The same stands for other data which we must read from the disk: indexes, tables etc. Usually, nowadays the page sizes are arround 8-16k.

About the cache

Also, is worth mentioning here that the data which is read from the disk goes into the cache. That's why is important to have a big cache. Lightroom is very greedy in this regard. SQL Server eats the entire memory. I mention this because if you try to do a "benchmark" to measure the benefits of the page size, make sure that the data doesn't exist in the cache (ie. you didn't use it before) and be sure to fetch the records (for ex. to display the thumbs).

About the queries

If you read carefully above, I stressed that you need to fetch the records in order to see what's really happening. Why? Because when we send a text to DB engine saying "SELECT * from T1" actually it happens a bunch of things (I pray to not forgot something):

- Parsing & syntax checking. Rather Fast
- Generating the expression tree. So and so. Ok, let's say that's rather fast.
- Analysis. Well, this is a huge, complicated thing. The Query Optimizer does it's best to return in the smallest amount of time the first row of your query (there are other types of optimizers - let's forget them for now). It reads a bunch of things in the DB (and not only - for ex. DB2's excellent optimizer benchmarks also the speed of your disks) and with a complicated logic choses the PLAN of query. For more info (if you have courage) see

https://www.simple-talk.com/sql/sql-tra ... optimizer/
http://www.codeproject.com/Articles/555 ... ideo-expla

The above steps are called "Query preparation". It is very important to have as few calls as possible to the Preparation stage because it is a significant overhead in loops. That's why, almost all modern DB engines have the feature of "prepared queries".

- Databinding
- Allocating the internal data structures (Field Buffers, etc.)
- Execution. Slooow. The engine try to scan the indexes, the tables - check if the desired part(s) are in cache - if yes then it fetches the data from there (it is called Logical Read) or from pages on disk (Physical Read), assembles the first record, fill the Field Buffers, set the pointers in all joined tables and indexes for the next record and yells outside "Hey! The first record is ready!".
Usually it is by far the slowest single part from the entire process. That's why, by far the biggest optimization one can do is to move from singleton queries to set queries. For example from

Code: Select all

for i=1 to 100000 do
{
  cField:=SQLExec(Format('Select Foo from T1 where FolderID=888 and Filename=%s', Filenames[i])); // 1000000 calls. Costs 100000 x execution time. 100000 x fetch time.
//....process cField
}
to

Code: Select all

myDataSet:=SQLExec('Select Foo from T1 where FolderID=888'); //one call. One execution time.
for i=1 to 100000 do //ok, this should be while not EOF...
{
//....scan & process myDataSet // it costs 100000 x fetch time
}
- Fetch. Now with all the indexes read (cached), all the pointers set, the DB engine will advance through the data and send upon request, one by one, the records to the caller (your program). This part is fast. And anyway (see above) you cannot live without. And also, singletons (the queries which return only one record) usually have one more condition (for example: "WHERE Folder=888 and FILENAME='blah.jpg'" vs "WHERE Folder=888") which usually raises significantly the execution time (it doubles it or more - because the string comparison is slower than integer comparison).

- Unprepare. When the query is closed, the internal data structures (Field buffers, different pointers, Join lists etc.) are freed (deallocated) from memory. Fast part also. But if you do it 100000 times in a loop, one will start noticing.


As a conclusion, do as much set queries you can and when you cannot (nowadays, this is limited usually only to Stored Procedure calls) have prepared queries (the well known Prepared Stored Procedure calls).

----

Something is unclear?
Did I forget something?
Something is wrong?

PS: When I said in the title "Free Crash DB Course" it doesn't mean to show a way to crash the DB but to express the draft character of this paper. Forgive me for any mistake. :D

Re: Free Crash DB Course for Pierre and everyone :-)

Posted: Tue Jun 25, 2013 8:33 am
by xnview
Thanks for this course :)
m.Th. wrote:

Code: Select all

for i=1 to 100000 do
{
  cField:=SQLExec(Format('Select Foo from T1 where FolderID=888 and Filename=%s', Filenames[i])); // 1000000 calls. Costs 100000 x execution time. 100000 x fetch time.
//....process cField
}
to

Code: Select all

myDataSet:=SQLExec('Select Foo from T1 where FolderID=888'); //one call. One execution time.
for i=1 to 100000 do //ok, this should be while not EOF...
{
//....scan & process myDataSet // it costs 100000 x fetch time
}
It what you call Table scan? I need to make bench...

Re: Free Crash DB Course for Pierre and everyone :-)

Posted: Tue Jun 25, 2013 8:48 am
by m.Th.
xnview wrote:Thanks for this course :)
m.Th. wrote:

Code: Select all

for i=1 to 100000 do
{
  cField:=SQLExec(Format('Select Foo from T1 where FolderID=888 and Filename=%s', Filenames[i])); // 1000000 calls. Costs 100000 x execution time. 100000 x fetch time.
//....process cField
}
to

Code: Select all

myDataSet:=SQLExec('Select Foo from T1 where FolderID=888'); //one call. One execution time.
for i=1 to 100000 do //ok, this should be while not EOF...
{
//....scan & process myDataSet // it costs 100000 x fetch time
}
It what you call Table scan? I need to make bench...
Yes. Ref. benchmark: Do it. And report back. But first I'd advise to raise the memory cache (if you didn't it already) and change the page size of your DB. I've posted an answer to you how to change the page size there. Also, for testing purposes, if you want to quicky change the page size you can do it in an admin tool for the time being. Also, be sure to have the admin tool handy in order to cross check if you changed correctly the page size in your code.

Re: Free Crash DB Course for Pierre and everyone :-)

Posted: Tue Jun 25, 2013 2:10 pm
by xnview
m.Th. wrote: Yes. Ref. benchmark: Do it. And report back. But first I'd advise to raise the memory cache (if you didn't it already) and change the page size of your DB. I've posted an answer to you how to change the page size there. Also, for testing purposes, if you want to quicky change the page size you can do it in an admin tool for the time being. Also, be sure to have the admin tool handy in order to cross check if you changed correctly the page size in your code.
The main problem is that file list to load is often not in the same folder (categories, ...). So this tweak is only when we browse one folder...

First test: for loading Images table (19000 files), one query per one query => 4400ms, only one query => 400ms

Re: Free Crash DB Course for Pierre and everyone :-)

Posted: Wed Jun 26, 2013 6:09 am
by m.Th.
xnview wrote:
m.Th. wrote: Yes. Ref. benchmark: Do it. And report back. But first I'd advise to raise the memory cache (if you didn't it already) and change the page size of your DB. I've posted an answer to you how to change the page size there. Also, for testing purposes, if you want to quicky change the page size you can do it in an admin tool for the time being. Also, be sure to have the admin tool handy in order to cross check if you changed correctly the page size in your code.
The main problem is that file list to load is often not in the same folder (categories, ...). So this tweak is only when we browse one folder...

First test: for loading Images table (19000 files), one query per one query => 4400ms, only one query => 400ms
Image

Yes!

No! We can tweak the tweak in order to work with any other field(s).

Pierre, users: Where do you want to do your search(es)?: Ratings, Colors, Categories, ImagesInfo (needs refactoring), EXIF (what fields?), IPTC (what fields?) ...something else?

I think that we need feedback for the most searched EXIF and IPTC fields in order to make some tricks to boost search performance on these fields.

The one who can answer to the above questions raise his hand.

Re: Free Crash DB Course for Pierre and everyone :-)

Posted: Thu Jun 27, 2013 8:21 am
by m.Th.
While waiting for the most searched fields I will answer to...
We can tweak the tweak in order to work with any other field(s).
For example (the selected fields are only for example; of course you can change them)

How to select Rated and/or Colored photos:

Code: Select all

Select ImageID, Filename From Images where Rating in (1,2,3,4,5) and Color in (1,2,3);

How to select categories:

Code: Select all

Select ImageID, Filename from Images where ImageID in (Select TT.ImageID from TagsTree TT where TT.TagID in (888, 999, 1111));
Same for tagged files:

Code: Select all

Select ImageID, Filename  from Images where ImageID in (Select AL.ImageID from AlbumsLink AL where AL.AlbumID = -1)); -- "-1" is the magic value for the tagged files
...and same for any other album:

Code: Select all

Select ImageID, Filename  from Images where ImageID in (Select AL.ImageID from AlbumsLink AL where AL.AlbumID in (7,8,9));
I leave as homework to combin the WHERE clauses in order to make an aggregate search :D...

Re: Free Crash DB Course for Pierre and everyone :-)

Posted: Thu Jun 27, 2013 12:26 pm
by JohnFredC

Code: Select all

Select ImageID, Filename from Images where ImageID in (Select TT.ImageID from TagsTree TT where TT.TagID in (888, 999, 1111));
Just curious: why not a join here?

Re: Free Crash DB Course for Pierre and everyone :-)

Posted: Fri Jun 28, 2013 5:45 am
by m.Th.
JohnFredC wrote:

Code: Select all

Select ImageID, Filename from Images where ImageID in (Select TT.ImageID from TagsTree TT where TT.TagID in (888, 999, 1111));
Just curious: why not a join here?
As I said: "I leave as homework to combine..." :D

Hint: Not a join because with a join it is much harder (impossible?) to combine the WHERE clauses.
More hint: You can have cBaseQuery == 'Select ImageID, Filename from Images ' and according with what your user selects you can concatenate the WHERE clauses with an AND (or OR) between them. :)