Depending upon what environment you're coding in it may be easier or harder. For example, it's almost trivial in Unity, but more challenging if you're coding from scratch.
What you effectively want to do is cast a ray from the camera, through a pixel position, and get the first tile that it hits, correct? Each tile has up to three visible faces, and each face is a simple diamond. Determining if a point is in a diamond is pretty easy.
Determining if the top face is clicked on is particularly easy because you'll notice that the top faces of the grey cubes match perfectly with the grid of the top faces of the green cubes. If you number your grid with the x axis going up and to the right, z axis going up and to the left, and y going straight up, a hit on a top face diamond at (x, y, z) could also be a hit on a cube at (x-1, y+1, z-1), or (x-2, y+2, z-2), etc.
Beyond that, I don't have time to do the maths right now. Good luck!
EDIT: I didn't notice that you said Java. Not sure what the API is like for 3D.
Edited by jefferytitan, 03 May 2012 - 04:11 PM.