Do you happen to know a good book about it? I ordered "audio programming book" and I already know a fair bit of DSP and audio stuff thanks to my degree. I think I will lack the math and some programming mostly.
Again this is not for a full career or for a degree, it's just to learn on the side.
I have little idea about books here.
I mostly just learned stuff by writing code, mostly using some amount of guesstimation and trial-and-error and similar, and in some cases trying to scavenge information off the internet.
but, yeah, as noted by another poster, relevant information is hard to find...
even then, one may still end up with issues, which need to be solved in slightly inelegant ways.
one example was recently noting while recording in-game video, that the audio and video were out of sync.
the video was frames as seen by the renderer, and the audio was whatever was coming out of the in-game mixer at that particular moment. however, the audio was slightly ahead of the video. the solution was basically to insert an audio delay into the video recording, then tuning the delay-values until they matched up. why? because it apparently takes a little bit of time between when audio is mixed in-game, and when it comes out of the speakers.
but, yeah, the major thing about audio I suspect is mostly about knowing basic programming stuff, and being generally familiar with working with arrays.
for example, your audio data will typically be in the form of arrays of "samples" at a particular "sample rate".
if you want to produce output samples, typically it will consist of a loop, which will calculate each sample and put it into the output-array.
typically, the input audio is also in the form of arrays of samples, so the position of the current sample being mixed may be used to calculate the position of the input samples you want to mix, ...
however, often the desired input sample doesn't land exactly on a sample, so then we interpolate. for example, a common strategy is linear interpolation, or "lerp" (invoking math here):
lerp(a, b, t)=(1-t)*a+t*b
where a and b are the adjacent input samples, and t is the position between a and b.
another option is using a spline, for example, one possible spline function:
splerp(a, b, c, d, t)=lerp(lerp(b, 2*b-a, t), lerp(2*c-d, c, t), t)
where a,b,c,d are the adjacent input samples, and the desired value is between b and c.
2*b-a and 2*c-d are what are known as linear extrapolation, where c'=2*b-a (c' being the value of c as predicted by extrapolating from a and b).
we effectively then form a pair of predictions, and then interpolate between these predictions to get an answer.
the idea here is basically that with a series of points, you might have a curve which passes through these points, and it may make sense to be able to answer a question "given these points, where will the value be, approximately?".
this works pretty well if the input and output sample rates are "similar", but if there are considerable sample rate differences (such as in a MIDI synth), then the audio quality may suffer (due to "aliasing" or similar).
one strategy that exists is to start with audio at a higher sample-rate (say, 48kHz or 44.1kHz), and then recursively downsample it by factors of 1/2 (say, by averaging pairs of samples), for example, we create versions of the audio at various sample rates:
44.1kHz, 22.05kHz, 11.025kHz, 5.513kHz, ...
then you can calculate approximately which target sample-rate you need, interpolating the sample position from adjacent sample-rates, and then interpolating between these rates to get the desired sample. (if familiar with the idea of mipmapping in graphics, this is very similar...).
most of this can be wrapped up in a function though, as normally you don't want to deal with all this stuff every time you want a sample.
example:
float patchSamplerInterpolate(patchSampler patch, double sampleBase, double targetRate);
where patchSampler here may represent a given piece of audio (a patch or waveform or whatever term is used for this).
other various thoughts:
consider picking some unit other than samples as your unit of time measure, for example, it may make some sense to do a lot of calculations in terms of seconds or similar (when dealing with lots of audio at different sample-rates or with a lot of scaling/... seconds may make more sense as a basic time-unit);
use double (and not float) for audio sample positions and time-based calculations (when it comes to sub-sample accuracy over time-frames of many minutes or more, float doesn't really hold up well);
it may be useful to consider what happens when time values are before the start or after the end of a given patch/waveform, for example, does it loop or is it followed by silence?, ... so, for example, the interpolation function might use a flag, which indicates whether the sound is discrete (non-looping) or continuous (looping), and then generate sane values for out-of-range sample positions.
potentially, various effects may be implemented in terms of functions, either using raw sample arrays, or by building on top of abstracted interpolation functions (there are tradeoffs here, raw arrays can be faster, but tend to be a little more hairy/nasty).
another tradeoff is, for example, whether to store audio data in terms of 16-bit shorts or similar, or use floats.
in my case, I tend to use 16-bit PCM (or sometimes compressed representations) for storing raw audio data (sound-effects, ...), but floating-point arrays for intermediate audio data (stuff currently being mixed, ...).
but, yeah, otherwise dunno...