Given a sequence of independent identically distributed random variables X1, X2, ..., XN with common probability density function f(x), how can one estimate f(x)? (Old good KDE problem)
Let us consider the following simple question. You are on the road and you cannot go off side the road. Two bombs fall at points A and B in no particular order (for example you can see only two shell holes, but do not know which one was the first). Where is the safest place on the road if you expect the third bomb to fall? Obviously it is as far as possible from the both points. But what about the safest place between A and B? Is it the middle? Is it the place close to the point A or B?
If you think that both bombs fell from the same source and just randomly deviated from each other, then the middle point would be the most dangerous place, because it gives the highest probability that the most probable outcome is in the middle. However, if you think that those two bombs have independent sources, then the middle is the safest place because it is likely that the third bomb is sourced from either of those two. [It is the best guess because we do not have any information at this moment.] Now if you do not know if the sources are independent, somehow dependent or both are the same, then the answer is unclear.
Note that considering both sources sampled completely independent but with the bandwidth deduced from the distance between the points is the usual way for Kernel Density Estimation. The assumption that the sampled points are uncorrelated seems a bit weird, if not illogical at all. There is no relation between making distribution around points smooth and the spatial information. Doing this properly the estimator f(x) would be a sum of delta functions on points. But this is unacceptable, so the estimator is smoothed with Gaussians or some another kernels. Now if the smoothness (bandwidth) is big enough it is possible (depending on the kernel shape) that the middle is less safe than the end points. [For example see how the most dangerous place changes with the bandwidth parameter for the sum of 2 Gaussians.] In other words the probability in the middle adds equally from both sources and it can be greater than maximum probability from the source A plus probability from the source B at the point A.
If the common kernel is taken for both points, then the most probable place for the third point is the middle, which leaves end points safest - with simple shape of the kernel the middle is the most dangerous place and the further you are from the middle the safe you are; regardless whether you stand right inside the shell hole A or B, or next to it.
Personally I would hide in 1/4 of the distance between A and B from the point A or B, i.e. in the middle between the end point A or B and the middle point. What kernel does satisfy this? I do not know.
No comments:
Post a Comment