Multi column correlation
We all have encountered this situation many times before: Cost based optimizer assumes no correlation between two columns (until 11g) and this has the effect of reducing cardinality of a row source erroneously. Incorrect cardinality estimates
are one of the many root causes for SQL performance issues.
Consider following example.
Following code creates a table and populates data with four columns. First two columns n1 and n2
has
create table t_vc as
select mod(n, 100) n1, mod(n, 100) n2 ,
mod(n, 50) n3 , mod(n, 20) n4
from (select level n from dual connect by level <= 10001);
There is a strong correlation between n1 and n2 above. N1 is always equals to N2.
Let’s collect statistics with histograms on all columns.
begin
dbms_stats.gather_Table_stats( user, ‘T_VC’, estimate_percent => null, method_opt => ‘for all columns size 254’);
end;
/
Let’s explain plan for query specifying one predicate. Exactly 100 rows with a value of 10.
explain plan for select count(*) from t_vc where n1=10;
--------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 3 | 9 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 3 | | | |* 2 | TABLE ACCESS FULL| T_VC | 100 | 300 | 9 (0)| 00:00:01 | ---------------------------------------------------------------------------
Yes, optimizer is able to estimate cardinality to be 100 from row source T_VC at step 2.
Now, let’s add another predicate n2=10. Since there is a strong correlation between n1 and n2,
adding a predicate such as n2=10 should not alter row source cardinality. Right?
explain plan for select count(*) from t_vc where n1=10 and n2=10;
select * from table(dbms_xplan.display);
--------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 6 | 9 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 6 | | | |* 2 | TABLE ACCESS FULL| T_VC | 1 | 6 | 9 (0)| 00:00:01 | --------------------------------------------------------------------------
Nope, Optimizer cardinality is far off and 100 times off the correct cardinality.
Notice rows column at step 2 above and it is just 1. Essentially, CBO estimate boils down to
simplified formula:
# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)
= 10000 * (1/100) * (1/100) =1 row.
where selectivity is (1/NDV) in simple case, without considering histograms.
Far from the truth! There are at least 100 rows returned for this SQL.
Extended stats
Oracle 11g introduces extended stats to relieve some pain. In 11g, an extended stats can be added between columns, enabling CBO to consider correlation between these column values.
SELECT dbms_stats.create_extended_stats(
ownname=>user, tabname => ‘T_VC’,
extension => ‘(n1, n2)’ ) AS n1_n2_correlation
FROM dual;
N1_n2_correlation
—————————————————
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
Let’s collect stats again on this table and check the SQL plan.
exec dbms_stats.gather_Table_stats( user, ‘T_VC’, estimate_percent => null, method_opt => ‘for all columns size 254’);
explain plan for select count(*) from t_vc where n1=10 and n2=10;
-------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 1200 | 9 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| T_VC | 100 | 1200 | 9 (0)| 00:00:01 --------------------------------------------------------------------------
At last, optimizer truly understood the correlation in 11g.
Under the wrap
1. Adding an extended stats adds a new virtual column to the table. Here is the line from sqltrace.
alter table "CBQT"."T_VC" add (SYS_STUBZH0IHA7K$KEBJVXO5LOHAS as (sys_op_combined_hash(n1, n2)) virtual BY USER for statistics);
Virtual column name is cryptic and seems to have been derived from table_name, column name combinations.
2. A new deterministic hash function sys_op_combined_hash called by optimizer to populate
this virtual column values. This function returns unique value for unique combination of values passed.
col h1 format 99999999999999999999999999
select sys_op_combined_hash (1,1) h1 from dual;
H1
—————————-
7026129190895635777
select sys_op_combined_hash (1,2) h1 from dual;
H1
—————————-
298332787864732998
Collecting histograms on all columns collects histograms on this virtual column also.
Trace lines
Using the above histogram, CBO is able to find that there is a strong correlation between these two columns.
This is visible in the 10053 output.
SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T_VC[T_VC] Column (#1): NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100 Column (#2): NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100 Column (#5): NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100 ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS Col#: 1 2 CorStregth: 100.00 ColGroup Usage:: PredCnt: 2 Matches Full: #0 Partial: Sel: 0.0100 Table: T_VC Alias: T_VC Card: Original: 10001.000000 Rounded: 100 Computed: 100.00 Non Adjusted: 100.00 Access Path: TableScan Cost: 9.11 Resp: 9.11 Degree: 0 Cost_io: 9.00 Cost_cpu: 2404620 Resp_io: 9.00 Resp_cpu: 2404620 Best:: AccessPath: TableScan Cost: 9.11 Degree: 1 Resp: 9.11 Card: 100.00 Bytes: 0
Notice the colgroup line and corStrength field above. It is set to 100. corStrength is calcualted
using histograms for virtual column and Final cardinality estimates are multipled by corStrength.
# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)*corStrength
= 10000 * (1/100) * (1/100)*100 =100 rows.
Cardinality estimates are matching with reality, at last.
In the next section, we will discuss this further 😉
Well, okay, it is time to introduce myself. I specialize in Oracle performance tuning,
oracle internals and E-business suite. I have over 15 years of experience as an Oracle DBA, currently,
I am employed with pythian [ Thanks Paul]. I am regular presentor in major conferences such as
hotsos, ukoug, rmoug etc. I am also an Oaktable member.
Some of my papers can be found in my personal blog
my papers and presentations