How Common are Data Errors in TCP?

Started by
9 comments, last by tufflax 12 years, 10 months ago
Hi!

How common are data corruption/errors in TCP? I mean errors that slip by the correction facilities.

I had my game server running 30 minutes, during that time had 2 clients sending about 150 KB of data to the server. After about 30 minutes, my server crashed because it read an illegal value (my server does not yet deal well with illegal client data). The server ran on the same computer as the clients, and although it is possible that I have made a mistake, I find it to be a bad explaination, because for those 30 minutes basically the same messages were sent over and over. It has happened 3 times in the last 2 days. So it's kind of rare, too rare for me to think that it is a coding mistake, yet common enough to make me wonder what it can be.

I was under the impression that TCP is very reliable, especially when just sending to 127.0.0.1.

Do you have any ideas of what might be wrong?
Advertisement
You say you dont handle illegal client data do you know if anything else could possibly be connecting to it and sending some data your not expecting? What port are you using?
Think of it this way.

Unaugmented TCP is basically what is used to download .EXE files and such from the web.

When was the last time you downloaded something and discovered it was corrupted? When was the last time you found a spelling mistake on a web page that magically went away when you refreshed, because it was actually a TCP transmission error?


TCP doesn't corrupt your data.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]


You say you dont handle illegal client data do you know if anything else could possibly be connecting to it and sending some data your not expecting? What port are you using?


Good point! Maybe that's what's happening. I'll look into this. I was using port 2345 or 1300 something like that. Just typing 4-digit numbers... Although I'm beind a firewall... Where does standard ports that might be open end?



Think of it this way.

Unaugmented TCP is basically what is used to download .EXE files and such from the web.

When was the last time you downloaded something and discovered it was corrupted? When was the last time you found a spelling mistake on a web page that magically went away when you refreshed, because it was actually a TCP transmission error?


TCP doesn't corrupt your data.


Yeah that's what I thought...

Hi!

How common is data corruption/errors in TCP? I mean errors that slip by the correction facilities.

I had my game server running 30 minutes, during that time had 2 clients sending about 150 KB of data to the server. After about 30 minutes, my server crashed because it read an illegal value (my server does not yet deal well with illegal client data). The server ran on the same computer as the clients, and although it is possible that I have made a mistake, I find it to be a bad explaination, because for those 30 minutes basically the same messages were sent over and over. It has happened 3 times in the last 2 days. So it's kind of rare, too rare for me to think that it is a coding mistake, yet common enough to make me wonder what it can be.

I was under the impression that TCP was very reliable, especially when just sending to 127.0.0.1.

Do you have any ideas of what might be wrong?


TCP data basically doesn't get corrupted. Each little segment is protected with a separate CRC, as well as generally by physical-layer systems (depending on specifics). Consider: When verifying md5sums of downloaded files, I've never had a download fail, even if the download is a 4 GB ISO file.

My guess is that your client or server code is wrong, or that someone else connected to your server and sent it garbage. For example, a memory corruption bug could cause random data to get corrupted.
enum Bool { True, False, FileNotFound };

[quote name='jeff8j' timestamp='1308860314' post='4826979']
You say you dont handle illegal client data do you know if anything else could possibly be connecting to it and sending some data your not expecting? What port are you using?


Good point! Maybe that's what's happening. I'll look into this. I was using port 2345 or 1300 something like that. Just typing 4-digit numbers... Although I-m beind a firewall... Were does standard ports that might be open end?


Think of it this way.

Unaugmented TCP is basically what is used to download .EXE files and such from the web.

When was the last time you downloaded something and discovered it was corrupted? When was the last time you found a spelling mistake on a web page that magically went away when you refreshed, because it was actually a TCP transmission error?


TCP doesn't corrupt your data.


Yeah that's what I thought...
[/quote]
Hm, just tried again; I ran the server for about 20 minutes before the same problem occurred again. I was keeping track of all the connected clients, no new one ever connects. The message is from one of my own clients. But I'm using asserts in the client to make sure the client never sends illegal data. Moreover, I'm sending the exact same message about 30000 times or so before the problem occurs. I just connected the client and let it sit there sending and receiving updates. This is starting to annoy me. I find it totally believable that I have made a mistake, I just find it strange that it just happens all of a sudden after I have done the exact same thing ten thousand times before.

Any ideas?

Some code (in Clojure, but using Java socket channels. If you know Java and are curious you can probably figure out what I'm doing). The error occurs in this first file when I'm reading from the channel, in this piece (I get something negative out from getShort on the buffer:

Edit: Darn the indentation is messed up...


(zero? (.remaining rbuf)) (let [len (.. rbuf (flip) (getShort))
test (if (neg? len)
(println (.getInetAddress (.socket chan))))
new-netmap (assoc netmap :read-tmp (ByteBuffer/allocate len))]


which is part of this next file

Shared networking code:

;;;; Remember the following while working with this code:
;;;;
;;;; * You must use channels and buffers when using select, not
;;;; the socket streams.
;;;;
;;;; * When using select, you have to call functions in this order:
;;;; select
;;;; selectedKeys
;;;; do stuff with keys in the set
;;;; remove keys from the key set
;;;;
;;;; You cannot rely only on the flags of selection keys.
;;;;
;;;; * When writing to buffers, first write, then flip, then read, then clear.
;;;;
;;;; * Newly created buffers have (.remaining buffer) = their length,
;;;; so if you want to create a buffer used for writing, it can appear to
;;;; have data in it from the beginning. Use (doto new-buff (.flip)).


(ns game.networking
(:import java.nio.ByteBuffer
(java.io ObjectOutputStream ByteArrayOutputStream
ObjectInputStream ByteArrayInputStream
IOException)))



(defn- to-byte-array
"Turns obj into a byte[]."
[obj]
(with-open [baos (ByteArrayOutputStream.)
oos (ObjectOutputStream. baos)]
(.writeObject oos obj)
(.toByteArray baos)))

(defn- from-byte-array
"Turns the byte[] ba into an object."
[ba]
(with-open [bais (ByteArrayInputStream. ba)
ois (ObjectInputStream. bais)]
(.readObject ois)))



(defn- channel-operation
"Calls op (a read/write fn) on chan and returns
true if it filled/emptied the buffer."
[op chan buffer]
(op chan buffer)
(zero? (.remaining buffer)))


(defn- read-into-buffer
"Reads from chan into buffer. If the buffer becomes full,
returns true, else false."
[chan buffer]
(channel-operation (memfn read buf) chan buffer))


(defn- write-from-buffer
"Writes from buffer into chan. If the whole buffer was written,
returns true, else false."
[chan buffer]
(channel-operation (memfn write buf) chan buffer))



(defn read-from-channel
"Reads data from a channel found in netmap.
The netmap needs to contain chan, rbuf and read-tmp.

If netmap has been changed as a result of reading, calls
new-netmap-f with the new netmap as argument.

If a message has been completely read, calls new-msg-f
on the message.


Operates in the following manner:

1) Checks if there is an incomplete message in read-tmp and,
if so, tries to complete it.

2) If not, checks if the length of the next message has
been read, and if so, creates a new ByteBuffer of the
correct length and assocs it with read-tmp in netmap.

3) Otherwise, try to read the length of the next
message into rbuf."
[{:keys [chan rbuf read-tmp] :as netmap} new-netmap-f new-msg-f]
(cond
; incomplete msg: read more
read-tmp (if (read-into-buffer chan read-tmp)
(let [msg (from-byte-array (.array read-tmp))
new-netmap (assoc netmap :read-tmp nil)]
(new-netmap-f new-netmap)
(new-msg-f msg new-netmap)
(recur new-netmap new-netmap-f new-msg-f)))
; have read length: prepare buffer for msg
(zero? (.remaining rbuf)) (let [len (.. rbuf (flip) (getShort))
test (if (neg? len)
(println (.getInetAddress (.socket chan))))
new-netmap (assoc netmap :read-tmp (ByteBuffer/allocate len))]
(.clear rbuf)
(new-netmap-f new-netmap)
(recur new-netmap new-netmap-f new-msg-f))
; new msg: read 2 bytes from chan into rbuf (the length of the next message)
; (tries to fill rbuf, so if it can only read 1 byte,
; will come back to here later)
:else (if (read-into-buffer chan rbuf) (recur netmap new-netmap-f new-msg-f))))




(defn write-to-channel
"Writes data to a channel, found in netmap. The
netmap needs to contain chan, wbuf, write-tmp and write-q.

If netmap has been changed as a result of writing, calls
new-netmap-f with the new netmap as argument."
[{:keys [chan wbuf write-tmp write-q] :as netmap} new-netmap-f]
(cond
; length not written: write it
(pos? (.remaining wbuf)) (if (write-from-buffer chan wbuf)
(recur netmap new-netmap-f))
; msg not written: write it
write-tmp (if (write-from-buffer chan write-tmp)
(let [new-netmap (assoc netmap :write-tmp nil)]
(new-netmap-f new-netmap)
(recur new-netmap new-netmap-f)))
; no incomplete msg: see if there's one in write-q
write-q (let [test (assert (= (class "a") (class (last write-q))))
msg (to-byte-array (last write-q))
len (alength msg)]
(.clear wbuf)
(assert (pos? (short len)))
(.putShort wbuf (short len))
(.flip wbuf)
(let [new-netmap (assoc netmap
:write-tmp (ByteBuffer/wrap msg)
:write-q (butlast write-q))]
(new-netmap-f new-netmap)
(recur new-netmap new-netmap-f)))))


Module on the client side. Everything is sent with send-msg, and every msg is a String.

(ns game.networking.client
(:import java.net.InetSocketAddress
java.nio.channels.SocketChannel
java.nio.ByteBuffer)
(:require [game.networking :as net])
(:use game.utility))


(def- netmap (atom nil))


(def- msgs (atom nil))

(defn get-msgs []
(let [ms (reverse @msgs)]
(reset! msgs nil)
ms))



(defn send-msg [msg]
(let [new-netmap (update-in @netmap [:write-q] conj msg)]
(reset! netmap new-netmap)))


; make private later!
(defn update-netmap [nm]
(reset! netmap nm))

; make private later!
(defn handle-msg [msg _]
(swap! msgs conj (read-string msg)))


; make private later!
(defn create-netmap [ip port]
{:chan (let [sc (doto (SocketChannel/open (InetSocketAddress. ip port))
(.configureBlocking false))]
(.. sc (socket) (setTcpNoDelay true))
sc)
:write-q nil
:read-tmp nil
:write-tmp nil
:rbuf (ByteBuffer/allocate 2)
:wbuf (doto (ByteBuffer/allocate 2) (.flip))})


(defn connect-to-server [ip port]
(let [nm (create-netmap ip port)]
(update-netmap nm)))


(defn update-network []
(net/write-to-channel @netmap update-netmap)
(net/read-from-channel @netmap update-netmap handle-msg))



A networking module on the server side:
(ns game.networking.server
(:import java.nio.ByteBuffer
java.net.InetSocketAddress
(java.nio.channels Selector ServerSocketChannel SocketChannel
SelectionKey)
java.io.IOException)
(:use game.utility
clojure.contrib.repl-utils)
(:require [game.networking :as net]))



; ska vara private sen!
(def- clients (atom {}))
; ska vara private sen!
(def- id-counter (atom 0))
; ska vara private sen!
(def- msgs (atom nil))

(def- connected-addresses (atom nil))


(def- disconnected-clients (atom nil))

; debug
(def- bugg (atom []))



(defn get-msgs []
(let [ms (reverse @msgs)]
(reset! msgs nil)
ms))



(defn send-msg [ids msg]
(dorun (map (fn [id]
(let [new-clients (update-in @clients [id :write-q] conj msg)]
(reset! clients new-clients)))
ids)))



(defn create-server
"Creates a non-blocking TCP server that listens for incoming
connections and, when a client connects, calls f on the new
client's SelectionKey.

The server uses select to check for incoming connections and
to check for incoming data on established connections. Returns
a function that should be called periodically to call select.
The returned function returns the set of selected SelectionKeys."
[port f]
(let [selector (Selector/open)
ssc (doto (ServerSocketChannel/open) (.configureBlocking false))
ss (doto (.socket ssc) (.bind (InetSocketAddress. port)))
selection-key (.register ssc selector SelectionKey/OP_ACCEPT)]
(fn []
(swap! bugg conj (count (.keys selector)))
(.select selector)
(let [keys (.selectedKeys selector)]
; One has to check if the selected keys contains a key, not just the keys flags
(if (and (.contains keys selection-key) (.isAcceptable selection-key))
(let [sc (doto (.accept ssc) (.configureBlocking false))
client-key (doto (.register sc selector (bit-or SelectionKey/OP_READ
SelectionKey/OP_WRITE))
(.attach (swap! id-counter inc)))]
(show (.socket sc))
(flush)
(swap! connected-addresses conj (.getInetAddress (.socket sc)))
(.. sc (socket) (setTcpNoDelay true))
(f client-key)
(.remove keys selection-key)))
keys))))


(defn- new-client
"Takes a SelectionKey as input and returns a map
of various player-related things."
[selection-key]
{:s-key selection-key
:chan (.channel selection-key)
:write-q nil
:read-tmp nil
:write-tmp nil
:rbuf (ByteBuffer/allocate 2)
:wbuf (doto (ByteBuffer/allocate 2) (.flip))
:id (.attachment selection-key)})


(defn add-new-client
"Adds a new client to players."
[selection-key]
(let [client (new-client selection-key)]
(swap! clients assoc (.attachment selection-key) client)))



(defn- collect-new-msg [msg netmap]
(swap! msgs conj (conj (read-string msg) (netmap :id))))


(defn- update-client-netmap [{id :id :as netmap}]
(swap! clients assoc id netmap))


(defn- disconnect-client [{:keys [id chan]}]
(swap! disconnected-clients conj id)
(.close chan)
(swap! clients dissoc id))


(defn get-disconnected-clients []
(let [dcs @disconnected-clients]
(reset! disconnected-clients nil)
dcs))

(defn update-network
"Calls the fn server, which performs a select operation on the server,
and accepts new incoming connctions. Also reads messages from the clients,
and collect them, in the form of lists, in msgs."
[server]
(let [keys (server)]
(doseq [key keys]
(let [client (@clients (.attachment key))]
(try
(if (.isReadable key)
(net/read-from-channel client update-client-netmap collect-new-msg))
(if (and (client :write-q) (.isWritable key))
(net/write-to-channel client update-client-netmap))
(catch IOException e (disconnect-client client)))))
(.clear keys)))


TCP doesn't corrupt your data.


I believe google published an article on errors in their data centers. Data corruption is in the order of 1 per petabyte (maybe even messages) or so. It's absurdly low, but it's there. Only application-level checksums detected those.

It's also highly unlikely it happened here, there are way too many other more likely factors to consider.



However, nio used to have some obscure bugs and unexpected or undefined behavior, some even conflicting official documentation. My bet in this case would be that networking errors are caused by those. They were never fixed even in standard library and most ended up with WONTFIX. They also depend on individual platform stacks, so they are not universally reproducible. Problems are caused by inconsistent implementation of Berkeley socket API and incorrect interpretation of something or another, I forgot.


Ideally, you would go rummaging through Sun's bug tracker for nio related issues and try to ensure they don't occur in Clojure bindings. But that's rather tedious task. Mina (IIRC) project did a framework on top of nio which might contain some more insight into that.

I also seem to remember there being a race condition or synchronization bug with nio selector.

Beside that, there could be a lot wrong in Clojure bindings. The selectors work fine most of the time, but I do recall there being some obscure edge cases that simply aren't handled. I think even most Java examples are broken in this way.

I'm sending the exact same message about 30000 times or so

(zero? (.remaining rbuf)) (let [len (.. rbuf (flip) (getShort))



Something to consider: The maximum number in a signed short is 32767. If you use a message sequence number or similar, and store it as a short, you would get an error after that value.

Sounds like you have a reproducible case, though. That's good! Print out the number of messages you have gotten every 100 messages or so, and check what the value is after crash. Then set a breakpoint after that number of messages, and re-run the case, so you can debug when the crash happens. Or, if you have access to VMWare Workstation, try using Replay Debugging.

You can also log all the data to a big file after receiving it, for later analysis. You may be able to then pipe that file back into the server to repeat the behavior that clients already had, to reproduce the crash faster so you can debug it.

I'm not that familiar with Clojure (been 20 years since I did Scheme :-) so I didn't take the time to read through all your code, sorry. Maybe someone else on the board?
enum Bool { True, False, FileNotFound };

[quote name='ApochPiQ' timestamp='1308861358' post='4826991']
TCP doesn't corrupt your data.


I believe google published an article on errors in their data centers. Data corruption is in the order of 1 per petabyte (maybe even messages) or so. It's absurdly low, but it's there. Only application-level checksums detected those.

It's also highly unlikely it happened here, there are way too many other more likely factors to consider.



However, nio used to have some obscure bugs and unexpected or undefined behavior, some even conflicting official documentation. My bet in this case would be that networking errors are caused by those. They were never fixed even in standard library and most ended up with WONTFIX. They also depend on individual platform stacks, so they are not universally reproducible. Problems are caused by inconsistent implementation of Berkeley socket API and incorrect interpretation of something or another, I forgot.


Ideally, you would go rummaging through Sun's bug tracker for nio related issues and try to ensure they don't occur in Clojure bindings. But that's rather tedious task. Mina (IIRC) project did a framework on top of nio which might contain some more insight into that.

I also seem to remember there being a race condition or synchronization bug with nio selector.

Beside that, there could be a lot wrong in Clojure bindings. The selectors work fine most of the time, but I do recall there being some obscure edge cases that simply aren't handled. I think even most Java examples are broken in this way.
[/quote]

Clojure does not use bindings. They are the exact same Java classes.

But OK, that sounds bad...

This topic is closed to new replies.

Advertisement