Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

Correlation between column predicates

Posted by Riyaj Shamsudeen on March 21, 2008

Multi column correlation

We all have encountered this situation many times before: Cost based optimizer assumes no correlation between two columns (until 11g) and this has the effect of reducing cardinality of a row source erroneously. Incorrect cardinality estimates
are one of the many root causes for SQL performance issues.

Consider following example.

Following code creates a table and populates data with four columns. First two columns n1 and n2
has

create table t_vc as
select mod(n, 100) n1, mod(n, 100) n2 ,
mod(n, 50) n3 , mod(n, 20) n4
from (select level n from dual connect by level <= 10001);

There is a strong correlation between n1 and n2 above. N1 is always equals to N2.
Let’s collect statistics with histograms on all columns.
begin
dbms_stats.gather_Table_stats( user, ‘T_VC’, estimate_percent => null, method_opt => ‘for all columns size 254′);
end;
/

Let’s explain plan for query specifying one predicate. Exactly 100 rows with a value of 10.

explain plan for select count(*) from t_vc where n1=10;


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T_VC |   100 |   300 |     9   (0)| 00:00:01 |
---------------------------------------------------------------------------

Yes, optimizer is able to estimate cardinality to be 100 from row source T_VC at step 2.

Now, let’s add another predicate n2=10. Since there is a strong correlation between n1 and n2,
adding a predicate such as n2=10 should not alter row source cardinality. Right?

explain plan for select count(*) from t_vc where n1=10 and n2=10;

select * from table(dbms_xplan.display);


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T_VC |     1 |     6 |     9   (0)| 00:00:01 |
--------------------------------------------------------------------------

Nope, Optimizer cardinality is far off and 100 times off the correct cardinality.

Notice rows column at step 2 above and it is just 1. Essentially, CBO estimate boils down to
simplified formula:
# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)
= 10000 * (1/100) * (1/100) =1 row.
where selectivity is (1/NDV) in simple case, without considering histograms.

Far from the truth! There are at least 100 rows returned for this SQL.

Extended stats

Oracle 11g introduces extended stats to relieve some pain. In 11g, an extended stats can be added between columns, enabling CBO to consider correlation between these column values.


SELECT dbms_stats.create_extended_stats(
ownname=>user, tabname => ‘T_VC’,
extension => ‘(n1, n2)’ ) AS n1_n2_correlation
FROM dual;
N1_n2_correlation
—————————————————
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS

Let’s collect stats again on this table and check the SQL plan.
exec dbms_stats.gather_Table_stats( user, ‘T_VC’, estimate_percent => null, method_opt => ‘for all columns size 254′);

explain plan for select count(*) from t_vc where n1=10 and n2=10;


--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time               |
--------------------------------------------------------------------------
| 0  | SELECT STATEMENT |        | 100 | 1200 | 9 (0)| 00:00:01          |
|* 1 | TABLE ACCESS FULL| T_VC   | 100 | 1200 | 9 (0)| 00:00:01
--------------------------------------------------------------------------

At last, optimizer truly understood the correlation in 11g.

Under the wrap

1. Adding an extended stats adds a new virtual column to the table. Here is the line from sqltrace.

alter table "CBQT"."T_VC" add (SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
       as (sys_op_combined_hash(n1, n2)) virtual BY USER for statistics);

Virtual column name is cryptic and seems to have been derived from table_name, column name combinations.

2. A new deterministic hash function sys_op_combined_hash called by optimizer to populate
this virtual column values. This function returns unique value for unique combination of values passed.

col h1 format 99999999999999999999999999
select sys_op_combined_hash (1,1) h1 from dual;
H1
—————————-
7026129190895635777

select sys_op_combined_hash (1,2) h1 from dual;
H1
—————————-
298332787864732998

Collecting histograms on all columns collects histograms on this virtual column also.

Trace lines

Using the above histogram, CBO is able to find that there is a strong correlation between these two columns.

This is visible in the 10053 output.


SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T_VC[T_VC]
  Column (#1):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  Column (#2):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  Column (#5):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
    Col#: 1 2     CorStregth: 100.00
  ColGroup Usage:: PredCnt: 2  Matches Full: #0  Partial:  Sel: 0.0100
  Table: T_VC  Alias: T_VC
    Card: Original: 10001.000000  Rounded: 100  Computed: 100.00  Non Adjusted: 100.00
  Access Path: TableScan
    Cost:  9.11  Resp: 9.11  Degree: 0
      Cost_io: 9.00  Cost_cpu: 2404620
      Resp_io: 9.00  Resp_cpu: 2404620
  Best:: AccessPath: TableScan
         Cost: 9.11  Degree: 1  Resp: 9.11  Card: 100.00  Bytes: 0

Notice the colgroup line and corStrength field above. It is set to 100. corStrength is calcualted
using histograms for virtual column and Final cardinality estimates are multipled by corStrength.

# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)*corStrength
= 10000 * (1/100) * (1/100)*100 =100 rows.

Cardinality estimates are matching with reality, at last.

In the next section, we will discuss this further ;-)

Well, okay, it is time to introduce myself. I specialize in Oracle performance tuning,
oracle internals and E-business suite. I have over 15 years of experience as an Oracle DBA, currently,
I am employed with pythian [ Thanks Paul]. I am regular presentor in major conferences such as
hotsos, ukoug, rmoug etc. I am also an Oaktable member.

Some of my papers can be found in my personal blog
my papers and presentations

2 Responses to “Correlation between column predicates”

  1. Kumar said

    Hi Riyaz:
    The presence of an index on n1, n2 is making a difference for cpu %cost column CREATE INDEX T_VC_IDX ON T_VC (N1, N2);. The cardinality estimate is still incorrect but which would be the weighing criteria in this case the %cpu cost, bytes or the cardinality.

    SQL-458> explain plan for select count(*) from t_vc where n1=10 and n2 = 10;

    Explained.

    SQL-458> select * from table(dbms_xplan.display);

    PLAN_TABLE_OUTPUT
    ——————————————————————————————————————————————————
    Plan hash value: 1953293106

    ——————————————————————————
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    ——————————————————————————
    | 0 | SELECT STATEMENT | | 1 | 6 | 1 (0)| 00:00:01 |
    | 1 | SORT AGGREGATE | | 1 | 6 | | |
    |* 2 | INDEX RANGE SCAN| T_VC_IDX | 1 | 6 | 1 (0)| 00:00:01 |
    ——————————————————————————

    Predicate Information (identified by operation id):
    —————————————————

    2 – access(“N1″=10 AND “N2″=10)

    14 rows selected.

    SQL-458> explain plan for select count(*) from t_vc where n1=10;

    Explained.

    SQL-458> select * from table(dbms_xplan.display);

    PLAN_TABLE_OUTPUT
    ——————————————————————————————————————————————————
    Plan hash value: 1953293106

    ——————————————————————————
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    ——————————————————————————
    | 0 | SELECT STATEMENT | | 1 | 3 | 2 (0)| 00:00:01 |
    | 1 | SORT AGGREGATE | | 1 | 3 | | |
    |* 2 | INDEX RANGE SCAN| T_VC_IDX | 100 | 300 | 2 (0)| 00:00:01 |
    ——————————————————————————

    Predicate Information (identified by operation id):
    —————————————————

    2 – access(“N1″=10)

    14 rows selected.

  2. […] (и к сожалению в версии 10.2  ещё недоступна процедура dbms_stats.create_extended_stats для сбора корелированной статистики для значений […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 219 other followers

%d bloggers like this: