go back
#######################################################################
####### Build a 3D, 2 var kernel density estimation with R from scratch
#######################################################################
1. Introduction goto
2. A simple histogram function goto
3. A simple kernel density estimation function goto
4. A multi-dimension kernel density estimation function goto
5. Using montecarlo to estimate the optimal bandwidth goto
1. Introduction gotop
Let's have the following variable
set.seed(1)
x=rnorm(30)
To represent the distribution of x, I can use the in-built function to obtain an histogram
and a kernel density estimation function.
par(mfrow=c(1,2))
hist(x)
plot(density(x),t="l")
I will try to build here these functions from scratch and add some features.
2. A simple histogram function gotop
I build a function with, in input, I enter a vector and in output I display an histogram of this
vector. In input, I can also choose the number of bars of the histogram.
f_histo=function(x,bar=7){
a=seq(min(x),max(x),l=bar+1)
o=sapply(1:bar,function(i){sum(if(i==bar){x>=a[i]}else{x>=a[i]&x<a[i+1]})})
plot(NA,xli=c(min(a),max(a)),yli=c(0,max(o)),t="n",yla="Frequency",
xla=substitute(x),m=paste("Histogram of",substitute(x)))
v=sapply(1:bar,function(i){rect(a[i],0,a[i+1],o[i])})
}
f_histo(x)
First I separate the values of x in a finite number of same size ranges. Let's say 7.
bar=7
a=seq(min(x),max(x),l=bar+1)
Then I count the number of x values in each range. The sapply function allows me to use the vector
a I have just created and to count the number of dots in each range. Thanks to this function I
don't need to use a loop or a recursive function.
o=sapply(1:bar,function(i){sum(if(i==bar){x>=a[i]}else{x>=a[i]&x<a[i+1]})})
Now I just have to build a plot with different bar. The width a each bar will be the ranges in the
vector a and the height of each bar will be the numbers in the vector o. First I display an empty
plot. Then I use the function rect to draw the different bars in this plot.
plot(NA,xli=c(min(a),max(a)),yli=c(0,max(o)),t="n",yla="Frequency",
xla=substitute(x),m=paste("Histogram of",substitute(x)))
sapply(1:bar,function(i){rect(a[i],0,a[i+1],o[i])})
3. A simple kernel density estimation function gotop
A kernel density estimation (KDE) is a kind of continuous version of the histogram. For a specific
value, for instance 0, the value of the KDE will increase when many values of x are near 0. To do
that, for each value of x, you apply the normal density function with mean 0 and a specific student
variation. And then you take the mean of all of these values and you obtain the KDE estimation for
0. For instance, if you take a student variation of 1, the KDE of 0 will be:
mean(dnorm(x,0,1))
Now we calculate the KDE estimation for 1 000 different values and we can draw a nice KDE graph:
f_density=function(x,bw=0.5){
a=seq(min(x)-3*bw,max(x)+3*bw,l=1000)
o=sapply(a,function(v){mean(dnorm(x,v,bw))})
plot(a,o,t="l",xla=NA,yla="Density",m=paste("Density of",substitute(x)),
s=paste("N =",length(x),", bandwidth =",bw))
}
f_density(x)
Another way to see KDE is that we add a white noise to our vector x. The vector x become smoother
and almost continous. For each of the 30 values of the vector x, I generate 10 000 random values
following a normal law with a student variation of 0.5 and a mean equals to the value of the x dot.
Then draw an histogram with 1 000 bars using our histogram function. We obtain again our KDE graph.
v=rnorm(length(x)*10000,x,0.5)
f_histo(v,bar = 1000)
4. A multi-dimension KDE function gotop
I improve a bit the function f_density. The input will be a dataframe with one or more column. The
output will be a dataframe with one column for every variables with a range of representative
values. There will be a last column with the KDE for these representative values. There will be
two others input variables: the bandwidth or the degree of smoothness of the KDE and the number of
reprensentative values to display in the output.
f_density2=function(x,bw=0.5,nsize=10000){
n=dim(x)[2]
for(i in 1:n){
o1=seq(min(x[,i])-3*bw,max(x[,i])+3*bw,l=floor(nsize^(1/n)))
if(i==1){o=as.data.frame(o1)}else{o=merge(o,o1)}
names(o)[i]=paste("x",i,sep="")
}
f=function(i){
if(i>=1){f(i-1)+sapply(1:dim(o)[1],function(v){mean(dnorm(x[,i],o[v,i],bw))})/n}
else{0}
}
o=cbind(o,d=f(n))
}
First, I build a dataframe "o" with a range of representative values of the different variables of
x. For every variable I take a range of representative values. The number of representative values,
floor(nsize^(1/n)), is calculated so the final number of representative values in the dataframe o
will be equal to nsize or a bit less.
o1=seq(min(x[,i])-3*bw,max(x[,i])+3*bw,l=floor(nsize^(1/n)))
Then I do a cartesian product of all these vectors to obtain the dataframe o with my range of
representative values. As I do cartesian product, the size of o can increase very quickly. It's why
I restrain it to nsize.
n=dim(x)[2]
for(i in 1:n){
o1=seq(min(x[,i])-3*bw,max(x[,i])+3*bw,l=floor(nsize^(1/n)))
if(i==1){o=as.data.frame(o1)}else{o=merge(o,o1)}
names(o)[i]=paste("x",i,sep="")
}
Then I calculate the KDE for every columns and I do the mean to have the final KDE. I use a
recursive function to do that. I add these column to the dataframe o which will be the output of my
function.
f=function(i){
if(i>=1){f(i-1)+sapply(1:dim(o)[1],function(v){mean(dnorm(x[,i],o[v,i],bw))})/n}
else{0}
}
o=cbind(o,d=f(n))
I can use this function on a data frame x which contains 2 variables x1 and x2. I use then plot3D to
represent the density function on a graph.
set.seed(1)
x1=rnorm(10)
x2=2*x1+rnorm(10)
x=cbind(x1,x2)
y=f_density2(x,bw = 1)
plot3D::points3D(x = y[,1],y = y[,2],z=y[,3],zlab="Density",xlab="x1",ylab="x2",
main="Histogram of x",sub="N = 10, bandwidth = 1")
y=f_density2(x,bw = 0.1)
plot3D::points3D(x = y[,1],y = y[,2],z=y[,3],zlab="Density",xlab="x1",ylab="x2",
main="Histogram of x",sub="N = 10, bandwidth = 0.1")
Game over! gotop