You could also just count the number of frames (samples) of the output stream, and count how many frames (samples) your local sound card has played, and keep the two in sync. You don't need to actually play the stream to do this.
If you don't have a sound card on the server, you can use a real-time clock instead of a sound card, to count how fast to send the packets.
On the receiving side, you probably want to buffer a bit before you start playing back -- say, you require 4 packets to be available before you start playing back. This will protect against jitter.
If the buffer runs dry (0 packets available) you stop playback until you have 4 packets again. If the buffer overflows (say, 8 packets available) then you drop 4 packets to avoid using too much memory.
And, yes, these skews will happen, because the sending and receiving computers do not run perfectly in sync. Each sound card or real-time clock will be sourced from a different electronic crystal. For good-quality crystals, a de-sync will happen perhaps once a day. For cheap consumer crystals, you could end up de-syncing a couple of times in a single song (yeah, that sucks.)
If you REALLY care, then you would slightly modulate the playback sample rate, so that it speeds up when the buffer is longer, and slows down when the buffer is shorter. That way, you don't have to play an audible crack in the stream, but instead just get a pretty much imperceptible "wow" in the playout. (For audio technology history, try googling "wow and flutter" as it pertains to tape recorders some time :-)