Wednesday, 22 December 2010

Data Distribution Logic on Teradata

Oracle lets the DBA decide on the data distribution. Teradata does this work by itself after the "distribution key" has been decided. Okay, there is no such term as a "distribution key" but that is what it practically is. Here's why.

The data distribution is done on the basis of primary index. What one can decide, as the DBA, is which column or set of columns will contribute to this primary index. More the possible number of values, more are the possible values of the hash function and better is the distribution of the data. In essence if employee Id (which might itself be arbitrary or a linearly increasing long number) is used as the primary index, then chances of a uniform data distribution are high. This is, of course, with the assumption that the hash function is efficient enough. Here is an example: 

Column Value x (assuming a single column)
Hex value of hashrow(x)
Hex value of the first 20 bits of hashrow(x)
1
0x79b69e37
0x79b69
2
0xf3713c6e
0xf3713
134
0xb6ded109
0xb6ded
999999
0x06bf5ee1
0x06bf5

Note that the hashrow() function is the same in all instances of Teradata, so "select hashrow(134)" would always give "0xb6ded109" in all Teradata implementations. So what is the third column here? The first 20 bits (or in some systems the first 16 bits) is the bucket for all the remaining 12 bits (or 16 bits), i.e. 0x79b69 is the bucket for columns whose hash function output is in the range 0x79b69000 to 0x79b69fff. Note that the output of the two functions are dependent only on the algorithm of the hashrow() function which is fixed across all Teradata systems.

So this table is true for all Teradata systems. The AMP assigned for this bucket will house all rows of the database for which the first 20 (or 16) bits match the bucket value, or all rows whose hashbucket(hashrow(row_distribution_key)) is 0x79b69. This AMP will be assigned to several other buckets. How are the AMPs assigned to buckets?

For each system there exists a hashmapbucket which maps AMPs against buckets (or sets of rows). The bucket to amp mapping might be different for different systems having different AMPs but will be same for all systems having the same number of AMPs. Some sources depict this mapping as a matrix. It does not matter so far this discussion is concerned. The only thing that interests us is for a given row (distribution key), there exists a unique hashrow(), a unique hashbucket(hashrow()) and a unique hashamp(hashbucket(hashrow())! Teradata guarantees this.

Bucketvalue
AMP
0x06bf5
0
0x79b69
3
0xb6ded
1
0xf3713
2

Simple? Read again (aloud).

No comments: