Is table Union very expensive?

jwu1023

Member
Joined
Jan 9, 2006
Messages
13
Location
Memphis, TN
Programming Experience
3-5
Dear all,

I have a VB.Net application, which allows users to enter lots of data into a event table. Now the table is not larger. But later on it will become a huge table. My project now is creating a exactly same table as the current event table, and we will move more than 2-year-old data from current event table to the history table. I have to alter all the related programs to allow searching data in either one table. I have two strategies about the searching as following:

1. First search the current event table, if nothing found, then search the history event table.
I thought it is faster because most time our users only care about the recent data. And also it doesn't matter which year data is history table.
The drawback is when users search using wildcard, I have to union the searching results from both table.

2. Union the current event table and history event table, and then search the union table.
Because of the same reason that users most likely care about the recent data, most time searching the union table is costly and unnecessary. There is no problem for wildcard search issue.

Do you think my thought is right or not? I prefer to use the first strategy. Or do you have other better solution? Thanks a lot in advance.

jwu1023
 
Your table doesnt seem to be updated, so I recommend that you do the following:


Assuming your table is called lookup_data


Move the old data to a separate table with the same structure and the name lookup_data_aged
Identify the columns used for lookup and index them
Ensure the same columns are indexed in the "new" table too
Rename the table with the current data to lookup_data_current
Create a view that union-alls the two tables, the view will take the name of the original table
VB.NET:
  CREATE OR REPLACE VIEW lookup_data AS
  SELECT * FROM lookup_data_current
  UNION ALL
  SELECT * FROM lookup_data_aged

Youre done!
You dont need to alter the program; it will carry on SELECTing from lookup_data as it always did. Be careful to avoid duplicate rows (same row in both tables) -> UNION ALL is faster than UNION because it doesnt remove duplicate rows.

The execution plan for such a statement when I tested it on a pair of 100000 row suitably indexed (with a unique index) tables on my oracle instance, looked like:

VB.NET:
SELECT STATEMENT
  VIEW
    UNION-ALL PARTITION
      TABLE ACCESS BY INDEX ROWID
        INDEX UNIQUE SCAN
      TABLE ACCESS BY INDEX ROWID
        INDEX UNIQUE SCAN
this means that each row was probed by index scan to find matching rows; this operation is very fast. The limtied results were then unioned together.

Probing each table for matching rows using the index, and combining the low number of results is faster than combining the tables then searching 200 000 rows. If your rdbms of choice prefers to union the tables first, then search them you should return here and seek further assistance on ho0w to get your rdbms to behave, or upgrade the rdbms
 
I must ask though, why you are splitting the tables? What do you think "a lot of rows" actually is? 10 000? 100 000?
I have a few properly indexed table in my Oracle instance with more than a million rows, and they still return the results I want in fractions of a second because the indexes are set up correctly. If you think your rdbms cannot handle the data sizes (because it is something like MS Access) then you should consider upgrading to a better rdbms rather than prouring effort into shuffling round to keep with the existing point of failure. Oracle and SQL Server have free versions and are competent with greater than million-row tables.
 
Reply to cjard's answer

Thanks for your answers, cjard. It is very helpful.

I will use your "union all with SQL" strategy. The database I am using is IBM DB2. There are several indices built on the table "lookup_data", according to your strategy I will build the same indices on the table "lookup_data_aged".

Since in my SQL I will do search in both tables first by using certain criteria and then union all of these two tables, does it mean the indices that are used in the execution plan are those in the searching criteria instead of "rowid" in your example? When will the "rowid" index be used?

The reason why we split the table:
The application related with the table is used by more and more users across our different locations, which means that table size will become larger and larger in few years. Even though now we only have around 70,000 rows in the current "lookup_data" table. What I will do is for our future.

Thank you very much, cjard.
 
I doubt DB2 will have difficulty in performance with as small a table as 70,000 rows, and even growing it a hundredfold to 7 million shouldnt pose much of a problem; it's an enterprise class database system after all :)


Oracle's ROWID is the fastest way to access a table; every row has an automatic hidden ID number. The indexes in oracle are built so as to reference the rowid. If we have a table of 50 columns and 2 are indexed, several things can happen when we search.

The index is not very selective and hits a wide range of rows
The index is selective and returns few rows
The index is unique and returns a single row

I set up my tables so that they had unique indexes - i.e. i was replicating primary key functionality (but the PK is other columns in my case and im not allowed to change that) so oracle will use unique index scans

From the index it retrieves the rowid, then using the rowid it calculates the position on disk and directly retrieves the row. This is one of the fastest queries you can write. Table access in oracle is either by rowid or by full - there isnt an index called rowid, its just oracle's way of saying "directly accessing this row"


So rowid is an intermediary step peculier to oracle. The plan shown is read by indent level. most indented items are performed first. equal level indents are performed in parallel.

To cope with a query of the form:

CREATE VIEW blah AS
SELECT * FROM tbl1 WHERE x = c AND y = c
UNION ALL
SELECT * FROM tbl2 WHERE z = c

which is then used l;ike:

SELECT * FROM blah WHERE q = ?


x and y in table1 should be part of the index
z in table2 should be indexed
the indexes should also take on a participatory entity q in both cases

t1 index x y q
t2 index z q


the dbms optimizer should then be able to work out through internal query rewriting that both tables can be accessed in parallel using indexes


I still think splitting it isnt necessary! :)
 
Reply to cjard's answer - 2

Thank you so much for the detail answers. Now I totally understand what you said. Thanks again:)

Regarding whether to split the table or not, after reading your answers, I agree with you that we may not need to split table. If one year we have around 30,000 rows in that table with 46 columns, it may take us about 33 years to reach million rows in the table. As what you said, million rows in a table won't a problem for an enterprise class database.

Wish you have a nice weekend,

jwu1023
 
Back
Top