The best way to think of it is to picture what a pixel looks like when projected into texture space. If the surface is aligned with the camera near plane then the pixel will map to a square in texture space. It could be freely rotated however and look like a diamond. For point sampling the texel closest to this square sample is used. For bilinear its a weighted average of the 4 nearest. For the case of mip mapping a mip level is selected such that the square is about 1 texel unit area. Think about this square projected onto each mip level. As each mip level goes down each texel effectively doubles in each axis.
For anisotropic filtering then the pixel is elongated and forms a thin rectangle. There is no mip level you can use which either fully represents all of the rectangle. Instead the rectangle can be subdivided into smaller rectangles such that they are more square like. Each of this subdivisions have their own texel centre and can be each bilinear filtered and averaged for a result whcih better represents all the texels that touch the rectangle.
Hope this helps. I know pictures will help better explain. I'm sure there is so good visual explanations out there.