Feature significance is an extension of kernel density estimation
which is used to establish the statistical significance of features
(e.g. local modes). See Chaudhuri and Marronn (1999) for 1-dimensional
data, Godtliebsen et al. (2002) for 2-dimensional data and Duong et
al. (2007) for 3- and 4-dimensional data. The feature
package contains a range of options to display and compute kernel
density estimates, significant gradient and significant curvature
regions. Significant gradient and/or curvature regions often correspond
to significant features. In this vignette we focus on 1-, 2- and
3-dimensional data.
The earthquake
data set contains 510 observations, each
consisting of measurements of an earthquake beneath the Mt St Helens
volcano. The first is the longitude
(in degrees, where a
negative number indicates west of the International Date Line), second
is the latitude
(in degrees, where a positive number
indicates north of the Equator) and the third is the depth
(in km, where a negative number indicates below the Earth’s surface).
For the univariate example, we take the log(-depth)
as our
variable of interest. The kernel density estimate with bandwidth 0.1 is
the orange curve. Superimposed in green are the sections of this density
estimate which have significant gradient (i.e. significantly different
from zero). The rug plot is the log(-depth)
measurements.
library(feature)
data(earthquake)
eq3 <- log10(-earthquake[,3])
eq3.fs <- featureSignif(eq3, bw=0.1)
plot(eq3.fs, xlab="-log(-depth)", addSignifGradRegion=TRUE, addData=TRUE)
Below this is the SiZer plot of Chaudhuri & Marron (1999). In the SiZer plot, blue indicates significantly increasing gradient, red is significantly decreasing gradient, purple is non-significant gradient and grey is data too sparse for reliable estimation. The horizontal black line is for the bandwidth 0.1.
For bivariate data, we look at an Old Faithful geyser data set, in
the MASS
library. The horizontal axis is the
waiting
time (in minutes) between two eruptions, and the
vertical axis is the duration
time (in minutes) of an
eruption. Below is a kernel density estimate with bandwidth (4.5, 0.37)
with the significant curvature regions in blue superimposed.
library(MASS)
data(geyser)
geyser.fs <- featureSignif(geyser, bw=c(4.5, 0.37))
plot(geyser.fs, addSignifCurvRegion=TRUE)
A variation on plotting the significant regions is to plot the data points which fall inside these regions: significant curvature data points are in blue.
For trivariate data, we return to the earthquake data set. Below are the significant curvature regions in blue with bandwidth (0.06, 0.06, 0.05).
data(earthquake)
earthquake[,3] <- -log10(-earthquake[,3])
earthquake.fs <- featureSignif(earthquake, scaleData=TRUE, bw=c(0.06, 0.06, 0.05))
plot(earthquake.fs, addKDE=FALSE, addSignifCurvRegion=TRUE)
The result of featureSignif
is an object of class
fs
which is a list with fields
names(earthquake.fs)
#> [1] "x" "names" "bw" "fhat"
#> [5] "grad" "curv" "gradData" "gradDataPoints"
#> [9] "curvData" "curvDataPoints"
where
x
is the datanames
are the name labels used for plottingbw
is the bandwidthfhat
is the kernel density estimategrad
is the logical matrix indicating signficant
gradient on a gridcurv
is the logical matrix indicating signficant
curvature on a gridgradData
is the logical vector indicating signficant
gradient data pointsgradDataPoints
are the signficant gradient data
pointscurvData
is the logical vector indicating signficant
curvature data pointscurvDataPoints
are the signficant curvature data
points.The function featureSignifGUI
provides interactive
feature significance via tcltk
windows but the latter are
not integrated with rmarkdown
. See
?featureSignifGUI
.
Chaudhuri, P. and Marron, J. S. (1999). SiZer for exploration of structures in curves. Journal of the American Statistical Association, 94, 807-823.
Duong, T., Cowling, A., Koch, I., and Wand, M. P. (2008). Feature significance for multivariate kernel density estimation. Computational Statistics and Data Analysis, 52, 4225-4242.
Godtliebsen, F., Marron, J. S., and Chaudhuri, P. (2002). Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics, 11, 1-21.