The Clustering Factor-Indexes

On 09/04/2022 by Robert Corvino

Next, let’s look at some of the information Oracle will use. We are specifically going to look at the CLUSTERING_FACTOR column found in the USER_INDEXES view. The Oracle Database Reference manual tells us this column has the following meaning:

Indicates the amount of order of the rows in the table based on the values of the index:
•\ If the value is near the number of blocks, then the table is very well ordered. In this case, the index entries in a single leaf block tend to point to rows in the same data blocks.
•\ If the value is near the number of rows, then the table is very randomly ordered. In this case, it is unlikely that index entries in the same leaf block point to rows in the same data blocks.

We could also view the clustering factor as a number that represents the number of logical I/Os against the table that would be performed to read the entire table via the index. That is, the CLUSTERING_FACTOR is an indication of how ordered the table is with respect to the index itself, and when we look at these indexes, we find the following:

SQL> select a.index_name,b.num_rows,b.blocks,a.clustering_factor from user_indexes a, user_tables b where index_name in (‘COLOCATED_PK’, ‘DISORGANIZED_PK’ )

Note I used an ASSM-managed tablespace for this section’s example, which explains why the clustering factor for the COLOCATED table is less than the number of blocks in the table. There are unformatted blocks in the upcoming COLOCATED table below the HWM that do not contain data, as well as blocks used by ASSM itself to manage space, and we will not read these blocks ever in an index range scan. Chapter 10 explains HWMs and ASSM in more detail.

So the database is saying, “If we were to read every row in COLOCATED via the index COLOCATED_PK from start to finish, we would perform 1190 I/Os. However, if we did the same to DISORGANIZED, we would perform 99,929 I/Os against the table.” The reason for the large difference is that as Oracle range scans through the index structure, if it discovers the next row in the index is on the same database block as the prior row, it does not perform another I/O to get the table block from the buffer cache. It already has a handle to one and just uses it. However, if the next row is not on the same block, then it will release that block and perform another I/O into the buffer cache to retrieve the next block to be processed. Hence, the COLOCATED_PK index, as we range scan through it, will discover that the next row is almost always on the same block as the prior row.
The DISORGANIZED_PK index will discover the opposite is true. In fact, we can actually see this measurement is very accurate. If we hint to the optimizer to use an index full scan to read the entire table and just count the number of non-null Y values, we can see exactly how many I/Os it will take to read the entire table via the index:
select count(Y) from(select /*+ INDEX(COLOCATED COLOCATED_PK) */ * from colocated)

In both cases, the index needed to perform 209 logical I/Os (cr=209 in the Row Source Operation lines). If you subtract 209 from the total consistent reads and measure just the number of I/Os against the table, then you’ll find that they are identical to the clustering factor for each respective index. The COLOCATED_PK is a classic “the table is well ordered” example, whereas the DISORGANIZED_PK is a classic “the table is very randomly ordered” example. It is interesting to see how this affects the optimizer now.

If we attempt to retrieve 25,000 rows, Oracle will now choose a full table scan for both queries (retrieving 25 percent of the rows via an index is not the optimal plan, even for the very ordered table). However, if we drop down to 10 percent (bear in mind that 10 percent is not a threshold value—it is just a number less than 25 percent that caused an index range scan to happen in this case) of the table data:
SQL> set autotrace traceonly explain
SQL> select * from colocated where x between 20000 and 30000;

Here, we have the same table structures—the same indexes—but different clustering factors. The optimizer in this case chose an index access plan for the COLOCATED table and a full scan access plan for the DISORGANIZED table.

The key point to this discussion is that indexes are not always the appropriate access method. The optimizer may very well be correct in choosing to not use an index, as the preceding example demonstrates. Many factors influence the use of an index by the optimizer, including physical data layout. You might be tempted therefore to run out and try to rebuild all of your tables now to make all indexes have a good clustering factor, but that would be a waste of time in most cases. It will affect cases where you do index range scans of a large percentage of a table. Additionally, you must keep in mind that, in general, the table will have only one index with a good clustering factor! The rows in a table may be sorted in only one way. In the example just shown, if I had another index on the column Y it would be very poorly clustered in the COLOCATED table, but very nicely clustered in the DISORGANIZED table. If having the data physically clustered is important to you, consider the use of an IOT, a B*Tree cluster, or a hash cluster over continuous table rebuilds.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

The Clustering Factor-Indexes

Leave a Reply Cancel reply

Archives

Calendar

Categories