I noticed that I was sending a lot of data to my GPU and I was wondering if sending a lot of data (into a vertex buffer) could be very slow? For each vertex, I send 48 bytes (position, normal, texture and color). Also, I wanted to send one more byte to the GPU, but because HLSL doesn't support bytes, I'm obligated to send a short (2 bytes). I found a way to store my normal, my color and my new byte into one float (they all share the same variable) so that it would only use 4 bytes instead of 18. Is it worth to do binary operations both on the CPU and the GPU to compress/decompress my normal, my color and my new byte into one float or it's better to send more data? Which choice is the best to increase my game's performance?
In general with GPU's you want to prefer math over memory access. Usually dedicated GPU's are quite disproportionate with regards to their ALU count vs. bandwidth + memory hardware. But of course in reality it depends on the hardware and the workload, so you should be careful not to jump to any conclusions without profiling.
However my bigger concern would be precision. How are you going to store a normal, color, and another byte in just 4 bytes? Typically you'll want at least 16 bit per component for normals.
In general the memory transactions will happen in the background. They're also very fast, don't pollute the CPU cache, don't cause cacheline fighting and can happen while your 3D card is busy doing something else; as the CPU can be doing.
The CPU is pretty fast at doing tight operations loops, but there will still be tons of branches, cache misses, hyperbus communications and other assorted friction. In addition it can't parallelise the work. Mapping more of the GPU memory, doing simple writes into it and then leaving both devices to get on with more work while other (specialised and faster) parts of the computer hardware deal with the shifting around of memory is definitely the way to go on desktop systems.
If some of the data is constant or not updated often, consider using two buffers -- a frequent and an infrequently changing and use two accesses in the shaders to combine the data. This will reduce the amount of memory which needs to be moved between the devices across the memory busses. This is usually less of a problem in the shaders because the shader cores will each have memory controllers (often one for each buffer), and you're still processing the vertices in linear order so you'll still get good cache read-ahead on them.
 An example would be a skinned model. The texture posn/colour at each vertex will typically not change, it's just the vertex posn that gets updated, so by using two buffers, the memory transmission size can easily be reduced by 2/3.