Générer des paires de nombres aléatoires uniformément distribués et corrélés

14

Je voudrais générer des paires de nombres aléatoires avec une certaine corrélation. Cependant, l'approche habituelle consistant à utiliser une combinaison linéaire de deux variables normales n'est pas valable ici, car une combinaison linéaire de variables uniformes n'est plus une variable uniformément distribuée. J'ai besoin que les deux variables soient uniformes.

Une idée sur la façon de générer des paires de variables uniformes avec une corrélation donnée?

Onturenio
la source
6
Étroitement liés: stats.stackexchange.com/questions/30526 . Vous voulez également vérifier la balise copule - cliquez simplement sur le lien ici. Une technique rapide et sale consiste à laisser être uniforme etX[0,1]Y=X when Xα and Y=1+αX otherwise. The correlation is ρ=2(α1)3+1, whence α=1((1ρ)/2)1/3 does the trick. But copulas will give you more control... .
whuber
Thanks for the comment, but yes, I think this method is really "dirty"
Onturenio
1
My hope was that in seeing this approach you would recognize that you can (and ought to) provide additional criteria concerning the properties of your pairs of random numbers. If this is "dirty," then precisely what is wrong with the solution? Tell us so that we can provide more appropriate answers for your situation.
whuber
This question was answered incidentally in the response to a closely related question: how to generate pairs of RVs with a linear regression relationship. Because the slope of the linear regression is related in a readily computed way to the correlation coefficient, and all possible slopes can be produced, it gives a way to produce exactly what you want. See stats.stackexchange.com/questions/257779/….
whuber
1
Please also see stats.stackexchange.com/questions/31771, which answers the generalization to three random uniforms.
whuber

Réponses:

16

I'm not aware of a universal method to generate correlated random variables with any given marginal distributions. So, I'll propose an ad hoc method to generate pairs of uniformly distributed random variables with a given (Pearson) correlation. Without loss of generality, I assume that the desired marginal distribution is standard uniform (i.e., the support is [0,1]).

The proposed approach relies on the following:
a) For standard uniform random variables U1 and U2 with respective distribution functions F1 and F2, we have Fi(Ui)=Ui, for i=1,2. Thus, by definition Spearman's rho is

ρS(U1,U2)=corr(F1(U1),F2(U2))=corr(U1,U2).
So, Spearman's rho and Pearson's correlation coefficient are equal (sample versions might however differ).

b) If X1,X2 are random variables with continuous margins and Gaussian copula with (Pearson) correlation coefficient ρ, then Spearman's rho is

ρS(X1,X2)=6πarcsin(ρ2).
This makes it easy to generate random variables that have a desired value of Spearman's rho.

The approach is to generate data from the Gaussian copula with an appropriate correlation coefficient ρ such that the Spearman's rho corresponds to the desired correlation for the uniform random variables.

Simulation algorithm
Let r denote the desired level of correlation, and n the number of pairs to be generated. The algorithm is:

  1. Compute ρ=2sin(rπ/6).
  2. Generate a pair of random variables from the Gaussian copula (e.g., with this approach)
  3. Repeat step 2 n times.

Example
The following code is an example of implementation of this algorithm using R with a target correlation r=0.6 and n=500 pairs.

## Initialization and parameters 
set.seed(123)
r <- 0.6                            # Target (Spearman) correlation
n <- 500                            # Number of samples

## Functions
gen.gauss.cop <- function(r, n){
    rho <- 2 * sin(r * pi/6)        # Pearson correlation
    P <- toeplitz(c(1, rho))        # Correlation matrix
    d <- nrow(P)                    # Dimension
    ## Generate sample
    U <- pnorm(matrix(rnorm(n*d), ncol = d) %*% chol(P))
    return(U)
}

## Data generation and visualization
U <- gen.gauss.cop(r = r, n = n)
pairs(U, diag.panel = function(x){
          h <- hist(x, plot = FALSE)
          rect(head(h$breaks, -1), 0, tail(h$breaks, -1), h$counts/max(h$counts))})

In the figure below, the diagonal plots show histograms of variables U1 and U2, and off-diagonal plots show scatter plots of U1 and U2. enter image description here

By constuction, the random variables have uniform margins and a correlation coefficient (close to) r. But due to the effect of sampling, the correlation coefficient of the simulated data is not exactly equal to r.

cor(U)[1, 2]
# [1] 0.5337697

Note that the gen.gauss.cop function should work with more than two variables simply by specifying a larger correlation matrix.

Simulation study
The following simulation study repeated for target correlation r=0.5,0.1,0.6 suggests that the distribution of the correlation coefficient converges to the desired correlation as the sample size n increases.

## Simulation
set.seed(921)
r <- 0.6                                                # Target correlation
n <- c(10, 50, 100, 500, 1000, 5000); names(n) <- n     # Number of samples
S <- 1000                                               # Number of simulations

res <- sapply(n,
              function(n, r, S){
                   replicate(S, cor(gen.gauss.cop(r, n))[1, 2])
               }, 
               r = r, S = S)
boxplot(res, xlab = "Sample size", ylab = "Correlation")
abline(h = r, col = "red")

enter image description here enter image description here enter image description here

QuantIbex
la source
3
The general method to generate correlated multivariate distributions with given marginal distributions is called a copula.
whuber
@whuber, l'utilisation de copule permet de spécifier une structure de dépendance entre des variables aléatoires. Le problème est que la corrélation (Personne) est influencée à la fois par la structure de dépendance et par les marges. Ainsi, chaque choix de marges nécessitera un choix correspondant de paramètres de copule, sans oublier que certains niveaux de corrélation ne peuvent tout simplement pas être atteints pour des marges données (par exemple, voir ici ). Si vous connaissez une méthode qui permet de «contrôler» le niveau de corrélation pour tout choix de marges, j'aimerais bien le savoir.
QuantIbex
Thanks @QuantIbex. But I don't get why "a) implies that Spearman's rho and (Pearson's) correlation coefficient for random variables with standard uniform margins are approximately equal in large sample"
Onturenio
2
Quantlbex, all you need is to create a continuous path of copulas from the lower to the upper Frechet-Hoeffding bounds. For identical marginals, the correlation coefficient will be a continuous function from that path into the interval [1,1]. My "quick and dirty" example in a comment to the question is one such path, but obviously there are many others: copulas give you the fullest, most general way to create and describe such paths. What this shows is that the original question is (grossly) underdetermined: it ought to stipulate additional criteria for the solution.
whuber
1
@Quantibex J'ai pris la liberté d'ajouter une phrase qui souligne que votre gen.gauss.copfonction fonctionnera pour plus de deux variables avec un ajustement (trivial). Si vous n'aimez pas l'ajout ou si vous souhaitez le formuler différemment, veuillez revenir ou modifier au besoin.
Glen_b -Reinstate Monica
0

Intuitivement, u1 est U(0,1) car u1 équivaut à w1 [lequel est U(0,1)] si je=1, et u1 équivaut à w2 [lequel est U(0,1)] si je=0, donc u1 est U(0,1)dans tous les cas. Pareil pouru2. Quant à la corrélation:

E(u1u2)=E[jew1+(1-je)w2][jew1+(1-je)w3]

En développant cela, notons d'abord que je(je-1)=0, je2=je, et (1-je)2=(1-je) car je est toujours soit 0 ou 1. Notez également queje est indépendant du w, qui sont également indépendants les uns des autres. Donc:

E(u1u2)=E(I)E(w12)+E(1I)E(w2)E(w3) =pE(w12)+(1p)/4

From the fact that V(w1)=1/12, we get E(w12)=1/3, so E(u1u2)=p/12+1/4, that is: cov(u1u2)=p/12. Since V(u1)=V(u2)=1/12, we get finally that cor(u1,u2)=p.

Neal Oden
la source
0

Voici une méthode simple pour une corrélation positive: Soit (u1,u2)=jew1+(1-je)(w2,w3), où w1,w2, et w3 sont indépendants U(0,1) et je est Bernoulli (p). u1 et u2 aura alors U(0,1) distributions avec corrélation p. Cela s'étend immédiatement àk-tuples d'uniformes avec matrice de variance symétrique composée.

Si vous voulez des paires avec une corrélation négative, utilisez (u1,u2)=je(w1,1-w1)+(1-je)(w2,w3), et la corrélation sera -p.

Neal Oden
la source
Can you add a short proof of why this works?
Le Laconic
if your want to be computationally efficient, u1=w1 also produces the same correlation (both positive and negative cases)
Anvit