Oracle database internals by Riyaj

Discussions about Oracle performance tuning, RAC, Oracle internal & E-business suite.

RAC, parallel query and udpsnoop

Posted by Riyaj Shamsudeen on June 20, 2009

I presented about various performance myths in my ‘battle of the nodes’ presentation. One of the myth was that how spawning parallel query slaves across multiple RAC instances can cause major bottleneck in the interconnect. In fact, that myth was direct result of a lessons learnt presentation from a client engagement. Client was suffering from performance issues with enormous global cache waits running in to 30+ms average response time for global cache CR traffic and crippling application performance. Essentially, their data warehouse queries were performing hundreds of parallel queries concurrently with slaves spawning across three node RAC instances.

Of course, I had to hide the client details and simplified using a test case to explain the myth. Looks like either a)my test case is bad or b) some sort of bug I encountered in 9.2.0.5 version c) I made a mistake in my analysis somewhere. Most likely it is the last one :-(. Greg Rahn questioned that example and this topic deserves more research to understand this little bit further. At this point, I don’t have 9.2.0.5 and database is in 10.2.0.4 and so we will test this in 10.2.0.4.

udpsnoop

UDP is one of the protocol used for cache fusion traffic in RAC and it is the Oracle recommended protocol. In this article, UDP traffic size must be measured. Measuring Global cache traffic using AWR reports was not precise. So, I decided to use a dtrace tool kit tool:udpsnoop.d to measure the traffic between RAC nodes. There are two RAC nodes in this setup. You can read more about udpsnoop.d. That tool udpsnoop.d can be downloaded from dtrace toolkit . Output of this script is of the form:

PID        LADDR           LPORT           DR         RADDR           RPORT                 SIZE
---------- --------------- --------------- ---------- --------------- --------------- -----------
15393      1.1.59.192      38395           ->         2.1.59.192      40449                 8240
...

In the output above, PID 15393 sent an UDP packet of size 8240 from IP address 192.59.1.1 to 192.59.1.2 with local port as 38395 and remote port as 40449. As UDP traffic is flying between nodes, udpsnoop.d will print UDP traffic in to a file. So, I start collecting udpsnoop output before and end that collection immediately after our script is complete. Of course, we need to aggregate this data and play with it little bit, and so, will create an external table based upon this output file too.

- This is to read the file as an external table
drop table  external_udpsnoop;
 
create table external_udpsnoop
(
  c_uid varchar2(10),
  pid varchar2(10),
  laddr varchar2(15),
  lport varchar2(15),
  dr   varchar2(10), 
  raddr varchar2(15),
  rport varchar2(15),
  c_size   varchar2(10),
  cmd  varchar2(15)
)
organization external (
  type oracle_loader
  default directory UTL_FILE_DIR
  access parameters (
      records delimited by newline
     badfile APPS_DATA_FILE_DIR:'xtern_rpt.bad'
      logfile APPS_DATA_FILE_DIR:'xtern_rpt.log'
      discardfile APPS_DATA_FILE_DIR:'xtern_rpt.dsc'
      fields terminated by whitespace 
      missing field values are null
   )
  location ('udpsnoop_ra_join_2_8th_iter.lst')
 )
  reject limit 1000
/
REM Reject limit is high since there are few packets with some junk outputs, might be due unstable fbt in dtrace. 

Test case #1: Hash join – slaves from all instances

First, let’s test for an hash join to show how UDP traffic is flowing between these ports. In this test case below, we use a big table and join a 10 Million rows to another 10 Million rows table. rownum is used so that script will complete in decent time, other wise, this ran for few hours before running in to errors. This selects few non-indexed columns so that SQL must do full table scan. SQL also has hint for 16 slaves from 2 instances.

Both instances will participate in this PQ operation as parallel_instance_group is set to ALL at session level.

alter session set parallel_instance_Group='ALL';
select /*+ parallel ( t1, 8,2)   parallel (t2, 8, 2)  */
min (t1.CUSTOMER_TRX_LINE_ID +t2.CUSTOMER_TRX_LINE_ID ) , max ( t1.SET_OF_BOOKS_ID+t2.set_of_books_id  ), avg(t1.SET_OF_BOOKS_ID +t2.set_of_books_id), 
	avg( t1.QUANTITY_ORDERED + t2.QUANTITY_ORDERED ), max(t1.ATTRIBUTE_CATEGORY  ), max(t2.attribute1) , max(t1.attribute2)
from 
  (select * from BIG_TABLE where rownum <=100000000)t1 , 
  (select * from BIG_TABLE where rownum <=100000000)t2
where t1.CUSTOMER_TRX_LINE_ID = t2.CUSTOMER_TRX_LINE_ID
;

PQ in operation

We will also use yet another script to make sure that SQL is indeed getting 8 slaves in each instance. I don’t remember, where I got this SQL to pull slaves information (may be Doug burns, Thanks!), but anyway, I modified that little bit for RAC.

   select
      s.inst_id,
      decode(px.qcinst_id,NULL,s.username,
            ' - '||lower(substr(s.program,length(s.program)-4,4) ) ) "Username",
      decode(px.qcinst_id,NULL, 'QC', '(Slave)') "QC/Slave" ,
      to_char( px.server_set) "Slave Set",
      to_char(s.sid) "SID",
      decode(px.qcinst_id, NULL ,to_char(s.sid) ,px.qcsid) "QC SID",
      px.req_degree "Requested DOP",
     px.degree "Actual DOP", p.spid
   from
     gv$px_session px,
     gv$session s, gv$process p
   where
     px.sid=s.sid (+) and
     px.serial#=s.serial# and
     px.inst_id = s.inst_id
     and p.inst_id = s.inst_id
     and p.addr=s.paddr
  order by 5 , 1 desc
SQL> /
   INST_ID Username     QC/Slave   Slave Set  SID    QC SID Requested DOP Actual DOP SPID
---------- ------------ ---------- ---------- ------ ------ ------------- ---------- ------------
         1 SYS          QC                    10931  10931                           7366
         1  - p000      (Slave)    1          10925  10931             16         16 24762
         1  - p001      (Slave)    1          10956  10931             16         16 24764
         1  - p002      (Slave)    1          10955  10931             16         16 24766
         1  - p003      (Slave)    1          10918  10931             16         16 24768
         1  - p004      (Slave)    1          10941  10931             16         16 24778
         1  - p005      (Slave)    1          10916  10931             16         16 24781
         1  - p006      (Slave)    1          10945  10931             16         16 24787
         1  - p007      (Slave)    1          10922  10931             16         16 24795
         2  - p000      (Slave)    1          10943  10931             16         16 16920
         2  - p001      (Slave)    1          10961  10931             16         16 16923
         2  - p002      (Slave)    1          10920  10931             16         16 16970
         2  - p003      (Slave)    1          10946  10931             16         16 16972
         2  - p004      (Slave)    1          10935  10931             16         16 16974
         2  - p005      (Slave)    1          10934  10931             16         16 16976
         2  - p006      (Slave)    1          10899  10931             16         16 16988
         2  - p007      (Slave)    1          10940  10931             16         16 16991


         1 SYS          QC                    10927  10927                           9476
         2  - pz99      (Slave)    1          10890  10927              2          2 17723
         1  - pz99      (Slave)    1          10912  10927              2          2 25875
20 rows selected.

From the output above 8 slaves are from instance 1 and 8 are from instance 2 allocated, with query coordinator ( 7366) running instance 1. Above sample also captured my own session accessing gv$ views ( Notice pz99, slaves for gv$ access use different PQ slave naming conventions from 10.2 onwards).

Results

We have established that slaves were allocated from multiple instances and udpsnoop is capturing UDP packet size between these instances. We also have external table mapping to that udpsnoop output file so as to query this data. Script completed in 1500 seconds. I mapped output of px slaves query above with UDP external table and here is the table I put together to show PQ slaves and their UDP size.

  INST Username     QC/Slave   Slave SID   QC SID Req. Actual   SPID  LADDR      RADDR      RPORT SIZE
                                 Set              DOP  DOP
------ ------------ ---------- ----- ----  ------ ---  ------- ------ ---------- ---------- ----- ---------
     1 SYS          QC               10931  10931               7366  
     1  - p000      (Slave)    1     10925  10931  16       16 24762
     1  - p001      (Slave)    1     10956  10931  16       16 24764
     1  - p002      (Slave)    1     10955  10931  16       16 24766
     1  - p003      (Slave)    1     10918  10931  16       16 24768
     1  - p004      (Slave)    1     10941  10931  16       16 24778
     1  - p005      (Slave)    1     10916  10931  16       16 24781
     1  - p006      (Slave)    1     10945  10931  16       16 24787
     1  - p007      (Slave)    1     10922  10931  16       16 24795
     2  - p000      (Slave)    1     10943  10931  16       16 16920  2.1.59.192 2.1.59.192 62783 127068484
     2  - p001      (Slave)    1     10961  10931  16       16 16923  2.1.59.192 2.1.59.192 62783 126904080
     2  - p002      (Slave)    1     10920  10931  16       16 16970  2.1.59.192 2.1.59.192 62783 127767353
     2  - p003      (Slave)    1     10946  10931  16       16 16972  2.1.59.192 2.1.59.192 62783 128154145
     2  - p004      (Slave)    1     10935  10931  16       16 16974  2.1.59.192 2.1.59.192 62783 128096875
     2  - p005      (Slave)    1     10934  10931  16       16 16976  2.1.59.192 2.1.59.192 62783 126057311
     2  - p006      (Slave)    1     10899  10931  16       16 16988  2.1.59.192 2.1.59.192 62783 128228830
     2  - p007      (Slave)    1     10940  10931  16       16 16991  2.1.59.192 2.1.59.192 62783 127471579

Few important points to make here:

  1. In this case, all slaves running in node 2 were talking to one UDP port in node 1 (Port 62783). lsof shows that PID 7366 (Query co-ordinator) was listening on that UDP port. Point is that these slaves from node 2 were sending packats to the co-ordinator.
  2. Interestingly, there is no UDP traffic from instance 1 to 2. I think, that looks due to the nature of aggregation in the SQL.

Few minutes later..

Interestingly, few minutes later, while I was watching UDP traffic, few other processes kicked in and started generating UDP traffic. Re-queried the database again to see what these processes are. Query has allocated 16 more slaves, 8 more running from node 1 and 8 more running in node 2 [processes p008-p015 below]. These slaves were talking to a different UDP port 62789 which was also listened by coordinator process 7366 in node 1.

  INST Username     QC/Slave   Slave SID   QC SID Req. Actual   SPID  LADDR      RADDR      RPORT SIZE
                                 Set              DOP  DOP
------ ------------ ---------- ----- ----  ------ ---  ------- ------ ---------- ---------- ----- ---------
     1 SYS          QC               10931  10931               7366  
     1  - p000      (Slave)    1     10925  10931  16       16 24762
     1  - p001      (Slave)    1     10956  10931  16       16 24764
     1  - p002      (Slave)    1     10955  10931  16       16 24766
     1  - p003      (Slave)    1     10918  10931  16       16 24768
     1  - p004      (Slave)    1     10941  10931  16       16 24778
     1  - p005      (Slave)    1     10916  10931  16       16 24781
     1  - p006      (Slave)    1     10945  10931  16       16 24787
     1  - p007      (Slave)    1     10922  10931  16       16 24795

     1  - p008      (Slave)    1     10958  10931  16       16 24798
     1  - p009      (Slave)    1     10938  10931  16       16 24818
     1  - p010      (Slave)    1     10965  10931  16       16 24836
     1  - p011      (Slave)    1     10953  10931  16       16 24838
     1  - p012      (Slave)    1     10946  10931  16       16 24841
     1  - p013      (Slave)    1     10929  10931  16       16 24843
     1  - p014      (Slave)    1     10919  10931  16       16 24853
     1  - p015      (Slave)    1     10942  10931  16       16 24855

     2  - p000      (Slave)    1     10943  10931  16       16 16920  2.1.59.192 2.1.59.192 62783 127068484
     2  - p001      (Slave)    1     10961  10931  16       16 16923  2.1.59.192 2.1.59.192 62783 126904080
     2  - p002      (Slave)    1     10920  10931  16       16 16970  2.1.59.192 2.1.59.192 62783 127767353
     2  - p003      (Slave)    1     10946  10931  16       16 16972  2.1.59.192 2.1.59.192 62783 128154145
     2  - p004      (Slave)    1     10935  10931  16       16 16974  2.1.59.192 2.1.59.192 62783 128096875
     2  - p005      (Slave)    1     10934  10931  16       16 16976  2.1.59.192 2.1.59.192 62783 126057311
     2  - p006      (Slave)    1     10899  10931  16       16 16988  2.1.59.192 2.1.59.192 62783 128228830
     2  - p007      (Slave)    1     10940  10931  16       16 16991  2.1.59.192 2.1.59.192 62783 127471579

     2  - p008      (Slave)    1     10911  10931  16       16 16993  2.1.59.192 2.1.59.192 62989 182053370
     2  - p009      (Slave)    1     10949  10931  16       16 16995  2.1.59.192 2.1.59.192 62989 182490908
     2  - p010      (Slave)    1     10951  10931  16       16 17000  2.1.59.192 2.1.59.192 62989 181899025
     2  - p011      (Slave)    1     10890  10931  16       16 17007  2.1.59.192 2.1.59.192 62989 181858294
     2  - p012      (Slave)    1     10972  10931  16       16 17009  2.1.59.192 2.1.59.192 62989 182104499
     2  - p013      (Slave)    1     10950  10931  16       16 17011  2.1.59.192 2.1.59.192 62989 182334705
     2  - p014      (Slave)    1     10902  10931  16       16 17013  2.1.59.192 2.1.59.192 62989 181611641
     2  - p015      (Slave)    1     10955  10931  16       16 17023  2.1.59.192 2.1.59.192 62989 181816693

In real life..

Summing this up, approximately, 2.4GB of UDP traffic was generated with one parallel query. Can you imagine what will happen if this inter-instance parallelism is allowed in data warehouse queries scanning many tables and partitions with many hash joins? Obviously, this has the effect of saturating Interconnect quickly and so performance will suffer. Our solution was to disallow parallel queries spawning multiple instances. All of them will be running within an instance boundary and effect of this change was immediately visible in the client environment. Back to our test, parallel_execution_message_size was set to 8192. Increasing this parameter will decrease elapsed time little bit, but we are worried about saturating interconnect traffic not just elapsed time of that query.

Further, I ran this query with parallel_instance_group set to one instance and then all instances, few times. Spawning across multiple instances, in fact, increases elapsed time too.

Parallel_instance_group :ALL
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.20       0.26          0         76          0           0
Fetch        2   1481.76    1509.95     701158         76          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4   1481.96    1510.22     701158        152          0           1

parallel_instance_group :INST1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.20       0.23          0         76          0           0
Fetch        2   1321.05    1331.67     701344         76          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4   1321.25    1331.90     701344        152          0           1

What about tha original example ?

Of course, let’s talk about that original example also. In this example, there was just one table and data was aggregated.

select /*+ parallel ( t1, 8,2)  */
min (t1.CUSTOMER_TRX_LINE_ID +t1.CUSTOMER_TRX_LINE_ID ) , max ( t1.SET_OF_BOOKS_ID+t1.set_of_books_id  ), avg(t1.SET_OF_BOOKS_ID +t1.set_of_books_id),
        avg( t1.QUANTITY_ORDERED + t1.QUANTITY_ORDERED ), max(t1.ATTRIBUTE_CATEGORY  ), max(t1.attribute1) , max(t1.attribute2)
from
  BIG_TABLE t1
;

Measuring, UDP traffic, It is visible that for this huge table, Interconnect traffic is kept minimal. It looks like, there are some optimization techniques for this single table aggregation query minimizing cache fusion traffic to a minimal level, just 2152. This convinces that, just the SQL in that myth is a bad example, but that myth is still a myth. I should have used original SQL joining multiple tables with hash join for my presentation, but as a consultant, I have a responsibility to keep clients information confidential and protect. At the end of the day, they pay for my bread.

   INST_ID Username     QC/Slave   Slave Set  SID    QC SID Requested DOP Actual DOP SPID		  Size
---------- ------------ ---------- ---------- ------ ------ ------------- ---------- ------------ -------
         1 SYS          QC                    10933  10933                           3314
         1  - p000      (Slave)    1          10958  10933             16         16 24762
         1  - p001      (Slave)    1          10948  10933             16         16 24764
         1  - p002      (Slave)    1          10953  10933             16         16 24766
         1  - p003      (Slave)    1          10925  10933             16         16 24768
         1  - p004      (Slave)    1          10916  10933             16         16 24778
         1  - p005      (Slave)    1          10938  10933             16         16 24781
         1  - p006      (Slave)    1          10951  10933             16         16 24787
         1  - p007      (Slave)    1          10946  10933             16         16 24795

         2  - p000      (Slave)    1          10949  10933             16         16 16920            2152
         2  - p001      (Slave)    1          10937  10933             16         16 16923            2152
         2  - p002      (Slave)    1          10946  10933             16         16 16970            2152
         2  - p003      (Slave)    1          10956  10933             16         16 16972            2152
         2  - p004      (Slave)    1          10902  10933             16         16 16974            2152
         2  - p005      (Slave)    1          10981  10933             16         16 16976            2152
         2  - p006      (Slave)    1          10899  10933             16         16 16988            2152
         2  - p007      (Slave)    1          10927  10933             16         16 16991            2152


         1 SYS          QC                    10945  10945                           3527
         1  - pz99      (Slave)    1          10942  10945              2          2 25875
         2  - pz99      (Slave)    1          10962  10945              2          2 17723           72344

Summary

In summary, having too many parallel query slaves spawning across multiple instances can cripple interconnect. There are some optimization techniques that seems to help in the case of single table aggregation and of course, that must be considered as an exception. I have modified the presentation little bit below, but will do a second and complete update on this presentation later:

Battle of the nodes RAC performance myths doc
Battle of the nodes RAC performance myths ppt

16 Responses to “RAC, parallel query and udpsnoop”

  1. Greg Rahn said

    Can you post the execution plan for your test case #1?

    • Here you go:

      dbms_xplan.output:
      --------------------------------------------------------------------------------------------------------------------------------------------
      | Id  | Operation                 | Name                      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
      --------------------------------------------------------------------------------------------------------------------------------------------
      |   0 | SELECT STATEMENT          |                           |     1 |   249 |       |  2846K  (4)| 01:59:01 |        |      |            |
      |   1 |  SORT AGGREGATE           |                           |     1 |   249 |       |            |          |        |      |            |
      |*  2 |   HASH JOIN               |                           |   100M|    23G|   762M|  2846K  (4)| 01:59:01 |        |      |            |
      |   3 |    VIEW                   |                           |   100M|    10G|       |  1214K  (5)| 00:50:46 |        |      |            |
      |*  4 |     COUNT STOPKEY         |                           |       |       |       |            |          |        |      |            |
      |   5 |      PX COORDINATOR       |                           |       |       |       |            |          |        |      |            |
      |   6 |       PX SEND QC (RANDOM) | :TQ10000                  |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | P->S | QC (RAND)  |
      |*  7 |        COUNT STOPKEY      |                           |       |       |       |            |          |  Q1,00 | PCWC |            |
      |   8 |         PX BLOCK ITERATOR |                           |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWC |            |
      |   9 |          TABLE ACCESS FULL| BIG_TABLE_NAME_CHANGED_12 |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWP |            |
      |  10 |    VIEW                   |                           |   100M|    12G|       |  1214K  (5)| 00:50:46 |        |      |            |
      |* 11 |     COUNT STOPKEY         |                           |       |       |       |            |          |        |      |            |
      |  12 |      PX COORDINATOR       |                           |       |       |       |            |          |        |      |            |
      |  13 |       PX SEND QC (RANDOM) | :TQ20000                  |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | P->S | QC (RAND)  |
      |* 14 |        COUNT STOPKEY      |                           |       |       |       |            |          |  Q2,00 | PCWC |            |
      |  15 |         PX BLOCK ITERATOR |                           |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWC |            |
      |  16 |          TABLE ACCESS FULL| BIG_TABLE_NAME_CHANGED_12 |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWP |            |
      --------------------------------------------------------------------------------------------------------------------------------------------
      Predicate Information (identified by operation id):
      ---------------------------------------------------
      
         2 - access("T1"."n1"="T2"."n1")
         4 - filter(ROWNUM<=100000000)
         7 - filter(ROWNUM<=100000000)
        11 - filter(ROWNUM<=100000000)
        14 - filter(ROWNUM<=100000000)
      

      tkprof output:

      Rows     Row Source Operation
      -------  ---------------------------------------------------
            1  SORT AGGREGATE (cr=152 pr=701158 pw=701127 time=1510221226 us)
      98976295   HASH JOIN  (cr=152 pr=701158 pw=701127 time=1244490336 us)
      100000000    VIEW  (cr=76 pr=0 pw=0 time=200279054 us)
      100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200279023 us)
      100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=100270084 us)
            0       PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
            0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
            0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
            0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)
      100000000    VIEW  (cr=76 pr=0 pw=0 time=300298770 us)
      100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200298726 us)
      100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=200279954 us)
            0       PX SEND QC (RANDOM) :TQ20000 (cr=0 pr=0 pw=0 time=0 us)
            0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
            0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
            0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)
      

      Of course, I tried different variations of this SQL with pq_distribute hint trying to reduce GC traffic. For example, following increases GC traffic tremendously.

      
      explain plan for
      select /*+ parallel ( t1, 8,2)   parallel (t2, 8, 2) pq_dstribute(t1,hash,hash) */
      min (t1.n1 + t2.n1 ) , max ( t1.n2+t2.n2  ), avg(t1.n3+t2.n3),
              avg( t1.q1 + t2.q1 ), max(t1.a1  ), max(t2.c1) , max(t1.c2)
      from
        BIG_TABLE_NAME_CHANGED_12 t1 ,
        BIG_TABLE_NAME_CHANGED_12 t2
      where t1.n1 = t2.n1
      ;
      
      --------------------------------------------------------------------------------------------------------------------------------------------
      | Id  | Operation                 | Name                      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
      --------------------------------------------------------------------------------------------------------------------------------------------
      |   0 | SELECT STATEMENT          |                           |     1 |    43 |       |  2787K  (2)| 01:56:31 |        |      |            |
      |   1 |  SORT AGGREGATE           |                           |     1 |    43 |       |            |          |        |      |            |
      |   2 |   PX COORDINATOR          |                           |       |       |       |            |          |        |      |            |
      |   3 |    PX SEND QC (RANDOM)    | :TQ10002                  |     1 |    43 |       |            |          |  Q1,02 | P->S | QC (RAND)  |
      |   4 |     SORT AGGREGATE        |                           |     1 |    43 |       |            |          |  Q1,02 | PCWP |            |
      |*  5 |      HASH JOIN            |                           |   416M|    16G|   719M|  2787K  (2)| 01:56:31 |  Q1,02 | PCWP |            |
      |   6 |       PX RECEIVE          |                           |   416M|  6749M|       |  1177K  (2)| 00:49:14 |  Q1,02 | PCWP |            |
      |   7 |        PX SEND HASH       | :TQ10000                  |   416M|  6749M|       |  1177K  (2)| 00:49:14 |  Q1,00 | P->P | HASH       |
      |   8 |         PX BLOCK ITERATOR |                           |   416M|  6749M|       |  1177K  (2)| 00:49:14 |  Q1,00 | PCWC |            |
      |   9 |          TABLE ACCESS FULL| BIG_TABLE_NAME_CHANGED_12 |   416M|  6749M|       |  1177K  (2)| 00:49:14 |  Q1,00 | PCWP |            |
      |  10 |       PX RECEIVE          |                           |   416M|    10G|       |  1178K  (2)| 00:49:15 |  Q1,02 | PCWP |            |
      |  11 |        PX SEND HASH       | :TQ10001                  |   416M|    10G|       |  1178K  (2)| 00:49:15 |  Q1,01 | P->P | HASH       |
      |  12 |         PX BLOCK ITERATOR |                           |   416M|    10G|       |  1178K  (2)| 00:49:15 |  Q1,01 | PCWC |            |
      |  13 |          TABLE ACCESS FULL| BIG_TABLE_NAME_CHANGED_12 |   416M|    10G|       |  1178K  (2)| 00:49:15 |  Q1,01 | PCWP |            |
      --------------------------------------------------------------------------------------------------------------------------------------------
      
  2. For sure I’m missing something.

    But you say that the query get 8 slaves from 2 nodes.
    So we get 8*2 plus 2 QC, that is 18 processes but the result from your query show 20 rows with 9 slaves per node.

    What am I missing?

    That you.

  3. Greg Rahn said

    I’m slightly confused …

    The first plan uses a ROWNUMS line 6 & 13) and will significantly slow the execution as the slaves doing the scanning have to send their all their rows to the QC for the stopkey. I think this is a very synthetic example and is not representative of real-world hash joins.

    The second plan does not have the ROWNUM filter and I would guess that even w/o the pq_dstribute hint, would probably use a HASH distribution because of the HASH JOIN.

    The use of INSTANCES with the PARALLEL clause and hint has been deprecated. If you use it, it basically results in a DOP of DEGREE x INSTANCES, in your case DOP=8*2=16. Using INSTANCES does not force the use of remote instances either – that was only the case with MPP architectures.
    See the Parallel Hint documentation:
    http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/sql_elements006.htm#sthref663

    You also mention 2.6GB of interconnect traffic, but how does that compare to the table size?
    What is the interconnect configuration for this cluster?
    What is the transfer rate you are seeing across the interconnect? The elapsed time of the query is over 1000 seconds, but it would take less than 30 seconds to transfer 2.6GB over a single Gig-E link.

    It would also be nice to see both the single instance and multi instance plans when you provide the trace info. Often times plans can change between single and multi instance as heuristics and optimizations will kick in to localize the PQ slave work.

    In summary, the use of the interconnect is not a bad thing in cross-instance parallel execution, but if the interconnect is not size appropriately for the workload, it can become a bottleneck, just like anything else. I think you are jumping to conclusions on inter-instance parallel query and “saturating interconnect traffic”. I don’t believe your observations cover enough examples to make such a generalization. In most scenarios, too much interconnect traffic is a symptom, not a root cause. Besides, the HP Oracle Database Machine uses 8 RAC nodes and we run queries at DOP 128 leveraging all the nodes w/o issue and the Oracle TPC-H team has been doing it for years. If a DW frequently uses large table to large table joins, then hash partitioning or subpartitioning would yield added gains as partition-wise joins will be used.

    Also, PQ message traffic is not reported in AWR until 11g; it is not part of GES/GCS traffic.

    • Greg
      I researched this little bit further. To avoid rownum predicate, I created a 5Million rows table from that original table in a different schema. This test case is similar to test case #1 except does not use rownum clause.

      
      -------------------------------------------------------------------------------------------------------------------
      | Id  | Operation                 | Name      | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
      --------------------------------------------------------------------------------------------------------------------
      |   0 | SELECT STATEMENT          |           |     1 |   249 | 55304   (2)| 00:02:19 |        |      |            |
      |   1 |  SORT AGGREGATE           |           |     1 |   249 |            |          |        |      |            |
      |   2 |   PX COORDINATOR          |           |       |       |            |          |        |      |            |
      |   3 |    PX SEND QC (RANDOM)    | :TQ10002  |     1 |   249 |            |          |  Q1,02 | P->S | QC (RAND)  |
      |   4 |     SORT AGGREGATE        |           |     1 |   249 |            |          |  Q1,02 | PCWP |            |
      |*  5 |      HASH JOIN            |           |  3760K|  2080M| 55304   (2)| 00:02:19 |  Q1,02 | PCWP |            |
      |   6 |       PX RECEIVE          |           |  3760K|   969M| 27607   (1)| 00:01:10 |  Q1,02 | PCWP |            |
      |   7 |        PX SEND HASH       | :TQ10000  |  3760K|   969M| 27607   (1)| 00:01:10 |  Q1,00 | P->P | HASH       |
      |   8 |         PX BLOCK ITERATOR |           |  3760K|   969M| 27607   (1)| 00:01:10 |  Q1,00 | PCWC |            |
      |   9 |          TABLE ACCESS FULL| BIG_TABLE |  3760K|   969M| 27607   (1)| 00:01:10 |  Q1,00 | PCWP |            |
      |  10 |       PX RECEIVE          |           |  3760K|  1111M| 27614   (1)| 00:01:10 |  Q1,02 | PCWP |            |
      |  11 |        PX SEND HASH       | :TQ10001  |  3760K|  1111M| 27614   (1)| 00:01:10 |  Q1,01 | P->P | HASH       |
      |  12 |         PX BLOCK ITERATOR |           |  3760K|  1111M| 27614   (1)| 00:01:10 |  Q1,01 | PCWC |            |
      |  13 |          TABLE ACCESS FULL| BIG_TABLE |  3760K|  1111M| 27614   (1)| 00:01:10 |  Q1,01 | PCWP |            |
      --------------------------------------------------------------------------------------------------------------------
      
      Predicate Information (identified by operation id):
      ---------------------------------------------------
         5 - access("T1"."n1"="T2"."n1")
      
      

      Test results are actually worse. In test case #1, slaves from instance 2 were sending packets to co-ordinator running in node 1, mostly traffic was unidirectional from 2->1. But, in the new test case[ without rownum clause], GC traffic increased and it was bi-directional. Slaves in slave set #1 running in instance 1 were sending packets to slaves in slave set #2 in instance 2 and vice versa, exactly in-line with my presentation. [Agreed that the test case in my presentation was bad].

      And then, I submitted 2 scripts in each node and was able to generate approximate UDP traffic of size 30MB/sec. Gig-E can do approximately 150-180MB/sec before running in to any sort of bottlenecks. So, my point here is that it is possible to consume all available interconnect bandwidth just for parallel queries and lead to performance issues for other queries. With Solaris IPMP active-passive configuration these are the numbers we get today.

      Again, I am not suggesting that PQ across instances should never be allowed. On the contrary, I am suggesting not to open the flood gate and try to improve the performance by allowing every query to use parallel slaves across instances. I am in agreement that SQLs need to be closely reviewed and Workload must match interconnect bandwidth & latency etc, but, [you probably know this since you work for real world group ] in real world, this is not the case. I think, I have said enough about this matter.

      As a matter of fact, partitioning is in the work. We have been actively working to partition few tables, but application need to be tested thoroughly for that and my hope is that we will have partitioning and mostly partition-wise joins in 6 months-1 year period.

      Of course, I knew that PQ slaves UDP traffic was not captured in AWR report. That’s precisely the reason why in-house DBAs called me to see why average GC response is bad while AWR report is not showing much traffic and sysadmins complaining that there is much traffic in the interconnect from network PORT traffic numbers. I didn’t want to explicitly make a statement about that since I can’t find any documentation that confirms this explicitly. Please point to documentation if any.

      About pq_distribute hint: As I mentioned, I tested with different options for pq_distribute and ( just posted one example ) to see if I can reduce slaves communicating within that instance without much improvement. But, I agree that partition-wise joins are probably the way to go. I still need to confirm that p-wise joins are not going to cause any other issue and I need to schedule another time with the client to do that. [and it is not a critical issue from the client point of view after resetting parallel_instance_group to local instance].

      Cheers
      Riyaj

  4. […] Riyaj Shamsudeen – RAC, parallel query and udpsnoop […]

  5. Greg Rahn said

    Given that this cluster configuration only has the capacity of one Gig-E path for the interconnect, I think you did the right thing in limiting the parallel execution to a single node. BTW, a Gig-E path can only do 1000Mbps or 128MB/s max, not 150-180MB/s as you stated. In all likelihood this is way too small compared to the number of CPUs in the hosts and thus an unbalanced system for cross-instance parallel execution. Exactly how many CPUs/cores are in the hosts?

    In most cases the pq_distribute can not be used to “tune” such queries unless it is doing a BROADCAST of a large number of rows and a HASH distribution would be better, but this is generally the result of poor cardinality estimates. Since there is no hash partitioning or subpartitioning the HASH LOCAL could not be used, neither BROADCAST LOCAL. If there was hash partitioning or subpartitioning then a partition-wise join would all be eliminate traffic for the self-join example. Thus this configuration is basically a worst case scenario: no hash partitioning and a very small interconnect.

    • Thanks Greg.

      That measurement of 150MBps max came from port side of network traffic measurements from network admins and Of course, I realize Gig-E limits. I am not sure, whether it is accurately measured or he projected it either. But, essentially, when he sees that much traffic, then we can almost guarantee that whole cluster will freeze.

      Thanks for reading my blog and providing valuable contributions to this discussion. I am hopeful that readers will have a balanced view and read this discussion and then, apply gained knowledge to their environment.

      Cheers
      Riyaj

  6. […] through it I saw one example that struck me as very odd (Myth #3) and I contacted him about it. After further review Riyaj commented that he had made a mistake in his analysis and offered up a new example. I thought […]

  7. Yasser said

    I have been visiting your post very regularly…and to me seems the best site for analyzing test case scenarios.

    Have you stopped publishing articles in this website, as i didnt found any updates from last month??

    If yes please do not stop your test case publishing….its really excellent…

    • Yasser
      Thanks for visiting and thanks for kind words.
      Sorry, I have been quite busy lately with few issues. Also, it is summer in USA and have been traveling little bit too.

      Thanks

  8. Eric said

    For some reason, in my 4-node RAC 10gR2, the PQ never spawns slaves cross instances. The coordinator is on one node, and all the slaves are on another node (instead of on the other 3 nodes). I still have some unused interconnection bandwidth. How can I encourage Oracle to spawn slaves across instances?

    • Hi Eric
      My guess is that 10gR2 and 11g optimization might be kicking in, but not so sure. Basic philosophy behind this optimization is to reduce interconnect traffic keeping all PQ/PX slaves (both producers and consumers) in one node. Node is chosen based upon the load profile maintained.
      Still, final result will be either assembled by the coordinator which means that there will be some interconnect traffic (depends upon the query execution plan and probably minimal) between slaves and the coordinator process.
      So, question is, how is it affecting the query performance? If you run multiple queries concurrently, do you see all instances in use? If you have configured numerous parallel slaves in an instance and if that server has ample amount of CPU and if the load is minimal in that server, then it is quite possible that algorithm determines that node to be an optimal node for the execution of this query.
      Of course, you should debug to see why PX is behaving this way. This will generate big trace file and usually provides reasons for PX behavior.
      alter session set “_px_trace”=high ,all;
      BTW, Of course, make sure parallel_instance_group and instance_group parameters are properly set. Incorrect setting also can lead to issues such as this.

      Cheers
      Riyaj

  9. One day at a time tv show

    RAC, parallel query and udpsnoop « Oracle database internals by Riyaj

Leave a comment