There is an interesting question in statistics --
“There are 3 random variables X, Y and Z. The correlation between X and Y is 0.8 and the
correlation between X and Z is 0.8. What is the maximum and minimum correlation between Y and Z?”
1. Geometric illustration
The value of corr(Y, Z) is the COS function of the angle between Y and Z. We already know the corr(X, Y) and corr(X, Z). In this particular case, the angle can be zero, which suggests Y and Z are identical and the max value of corr(Y, Z) is 1. The min value of corr(Y, Z) is caused by the biggest angle between Y and Z, which is 0.28.
2. Positive semi-definiteness property of the correlation matrix
Due to this feature, the determinant of the correlation matrix is greater than or equal to zero. Thus we will be able to construct a quadratic inequality to evaluate the boundaries, which is from 0.28 to 1.
proc fcmp outlib=work.funcs.test1;
function corrdet(x, a, b);
return(-x**2 + 2*a*b*x - a**2 -b**2 +1);
endsub;
function solvecorr(ini, a, b);
array solvopts[5] initial abconv relconv
maxiter solvstat (.5 .001 1.0e-6 100);
initial = ini;
x = solve('corrdet', solvopts, 0, ., a, b);
return(x);
endsub;
quit;
options cmplib = work.funcs;
data one;
* Max value;
upper = solvecorr(1, 0.8, 0.8);
upper_check = corrdet(upper,0.8,0.8);
* Min value;
lower = solvecorr(-1, 0.8, 0.8);
lower_check = corrdet(lower,0.8,0.8);
run;
Generalization
We can generalize the question to all possibilities for corr(X, Y) and corr(X, Z). First we need to create two user-defined functions to solve the maximum and the minimum values. Then we will be able to draw the max values and min values in the same plot. It is very interesting to see that only four points the upper surface and lower surface converge together, which are (1, 1, 1), (-1, 1, -1), (1, -1, 1) and (-1, -1, -1).
A lot other phenomenon can be summarized from this plot, such as that when corr(X, Y) = corr(X, Z) the max value of corr(Y, Z) is always equal to 1.
proc fcmp outlib = work.funcs.test2;
function upper(a, b);
x = 4*(a**2)*(b**2) - 4*(a**2+b**2-1);
if x ge 0 then y = -0.5*(sqrt(x) - 2*a*b);
else y = .;
return(y);
endsub;
function lower(a, b);
x = 4*(a**2)*(b**2) - 4*(a**2+b**2-1);
if x ge 0 then y = -0.5*(-sqrt(x) - 2*a*b);
else y = .;
return(y);
endsub;
quit;
data two;
do xy = -.99 to .99 by 0.01;
do xz = -.99 to .99 by 0.01;
upper = upper(xy, xz);
lower = lower(xy, xz);
output;
end;
end;
run;
proc template;
define statgraph surface001;
begingraph;
layout overlay3d / cube = false rotate = 150 tilt = 30
xaxisopts = (label="Correlation between X and Y")
yaxisopts = (label="Correlation between X and Z")
zaxisopts = (label="Boundaries of correlation between Y and Z") ;
surfaceplotparm x = xy y = xz z = upper;
surfaceplotparm x = xy y = xz z = lower;
endlayout;
endgraph;
end;
run;
proc sgrender data = two template = surface001;
run;
What about for variables that have correlation higher than 1? You cannot use ARCCOS on that...
ReplyDeleteOh... I really don't if there are two variables that have correlation higher than 1.
Delete