Firstly, remove the branch from the for loop. Iterate over all 4 weights regardless of them being 0 or not. Negative weights should not be allowed by the CPU end.
Secondly, you only need to upload as many bones are as referenced by the part of the model you are rendering. For instance, a mech-machine will likely be broken into 1 mesh for each leg, 1 or 2 or so for the body, some for the weapons, etc.
You aren’t rendering the entire model all in one pass, but in multiple passes in which smaller parts of the model are rendered at a time. If you are rendering the front-left leg, there is no reason to send bone information for the back-right leg. Reducing the number of bones you send reduces bandwidth heavily and will be one of the largest gains in performance you will see.
The rest of my suggestions may be exactly the same performance or may be faster, so you would have to test. The shader compiler will likely be smart enough not to perform array look-ups every time, but you can be sure by storing Input.Weights[ i ] to a temporary and using that instead of repeated array access. Same thing with Input.Indices[ i ] and possibly even BoneMatrices[ Input.Indices[ i ] ].
Try various combinations of storing these to temporaries, benchmark, and repeat.
L. Spiro