Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 25 Mar 2007
Offline Last Active Yesterday, 09:17 PM

Posts I've Made

In Topic: Matrix 16 byte alignment

21 October 2016 - 06:23 PM

Your problem was resolved by making your code run slower (potentially a _lot_ slower, depending on the particular CPU and how badly it deals with unaligned SSE loads/stores).


Not in my experience, _mm_loadu_ps() was only a few % slower (maybe 1 cycle at most) than _mm_load_ps() when I did the benchmarks on Intel i7, and that extra cost is not even measurable when the address is aligned. Use aligned loads whenever you can ensure alignment, but it seems like more of a microptimization. You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

In Topic: Audio System

09 October 2016 - 12:32 PM

You're on the right track. In my audio system I have a big update( const Scene& scene, float dt ) method that copies the current state of all listener/source/objects into internal data structures. I'm also simulating sound propagation effects using path tracing, so that computation must be executed as a task on a separate thread. Once the sound propagation impulse responses are computed, I have to update the audio rendering thread with the new data. I do this by atomically swapping the IRs in a triple-buffered setup. You can use a similar strategy - copy your parameters into one end of a triple-buffer, then use an atomic operation to rotate through them. One set of parameters is the current rendering thread interpolation state, another set is the target interpolation state, and the third is where the main thread writes the next set of parameters. The key is to only rotate through the buffers once the rendering thread has finished the previous interpolation operation (requires another atomic variable to signal completion). If the main thread updates the parameters more often than that, the update is just ignored. You only need to update audio information at 10-15 frames/second anyway. Anything faster is overkill perceptually.

In Topic: Stretch audio without distortion, nor pitch-shift

13 September 2016 - 12:54 PM

There's two main ways to accomplish time stretching and both will cause some artifacts:

  • Interpolation in time domain - stretch the audio samples by the stretch factor by interpolating the nearby ones (with sinc, cubic, or linear interpolation). This will cause pitch shifting because the waveform is stretched out/compressed.
  • Frequency-domain stretching - the input audio is windowed into chunks of power of two size, converted to frequency domain using an FFT. Then you can repeat every chunk twice to stretch by a 2x factor, or maybe something more sophisticated to handle non integer stretch factors. This method doesn't do any pitch shifting but will lead to smearing of transients in your input audio (with big FFT size) or loss of low frequency resolution (with small FFT size). This also introduces latency if you are doing real-time processing of streaming audio.

Generally, some artifacts will always be introduced due to the limitations of math/signal processing/causality. The challenge in designing these sorts of DSP algorithms is finding the sweet spot in terms of quality/performance for your intended application. There's no silver bullet as far as I am aware.


If you take a recording at 96kHz and reduce the speed by 2x, what happens is that ultrasonic frequencies in the 22-44kHz range are shifted to the 11-22kHz range and become audible (though probably quiet because most audio gear/microphones are not designed for recording ultrasound). Normal frequencies are shifted an octave lower.

In Topic: How to compute % of target visibility/cover?

30 August 2016 - 10:32 AM

You listed my suggestion. Shoot many different rays at your target and see how many hit. One thing I would add would be to add some randomization to it. Don't shoot the same rays every frame but randomly pick new ones. You can then smooth out the jitter in the %hit signal using a low pass filter.

Exactly this. Trace rays uniformly randomly distributed within the cone that contains the bounding sphere, you can find code to generate those rays online. To get the true % visible, you need to test each ray against the object neglecting occluders, then test the ray with the occluders (but only if the first ray hit the target object). This handles the case where the object isn't close to spherical and so some of the rays will miss the target regardless of any occluders.


You can trace many fewer rays (like 10x fewer) if you smooth the resulting output over many frames using a technique like Exponential smoothing. You probably can get away with using only 10-20 rays per frame with decent results. Another thing you might add would be to change the number of rays based on the size of the cone - wider angle cones need more rays. The number of rays should be proportional to the angular area covered by the cone to keep a constant density of rays.

In Topic: Threadpool with abortable jobs and then-function

26 August 2016 - 12:45 PM

If you have such large tasks that the time to execute them is noticeable to the user, then you should probably just break those tasks into smaller units (e.g. <100ms to execute), so that if you do need to abort/restart a computation you can just remove the unexecuted task segments from the queue. You'll also get better parallelism by making your tasks smaller (but not too small), since the threads in the pool can be kept busy continuously.


Aborting jobs will either be intrusive (job has to continually poll to see if it should stop), or destructive (killing the thread). Neither are great options.


When you talk about desiring a responsive user interface, this can be most easily achieved by having the UI run on a separate thread at a high update rate (e.g. 60Hz), so that it never has to wait on the completion of any complex tasks. The UI thread responds to events from the user and then quickly adds jobs to the thread pool as needed. When the jobs finish they can call a completion handler that locks a mutex and updates the UI display based on the new computation, so the waiting is minimal.