• Content count

  • Joined

  • Last visited

Community Reputation

120 Neutral

About stefanbanev

  • Rank
  1.   C/C++ is actually a perfect tool to learn assembly. You may write methods within C++ source using an assembly functions; or write assembly function is separate *.asm file and link with C++ project (need to understand well the convention to transfer parameters). The problems where writing in assembly makes a practical sense is some bit-juggling algorithms where each CPU cycle is absolutely critical to ensure max performance; once you deal with bytes/words etc... the algorithms implemented in C is often more efficient, the more complex algorithm the more advantage C code may have; modern C compilers do a marvelous job to optimize... BUT to write a performance optimal C algorithms you must have a good idea how it is translated to assembly so assembly listing of C++ code is a paramount to have during such performance-critical-development. You may gain a way more performance improving/inventing algorithms; it is a major method to make things run faster. Once algorithm is perfected you may consider to implement/optimize some tiny parts in assembly... Another area where assembly makes sense is to write some performance optimized standard calls; for example, setjump/longjump to return back to root level from a deep recursion can be times faster then standard equivalents (as soon as the compromises/limitations of such fast implementation are well understood). Anyway, writing/understanding in/assembly is really a fun and very useful (actually, it's a critical skill) to get maximum performance; but writing all project in assembly makes no practical sense... 
  2.   If job relates to 3D graphic or/and simulation take it as soon as compensation is up to your needs; do not expect to do a core development, it's most likely you will deal with already existing engine 99.98%. Meanwhile, if you have ambitions to develop your own engine to be later on in position to hire people to build API around it, then continue you private development and make sure it is explicitly excluded from your NDA with your current employer and be very careful to keep it clean (own computers/compilers etc...). The area to apply efforts should be out of pack; apparently, rasterization/GPGPU is the most crowded area and the least lucrative area to apply efforts unless you are fine to remain an employee for rest of your life. In computer graphics, one of area I would consider interesting is the adaptive ray tracing on MIMD; it could be a classical multi core CPUs hardware or may be Xeon-Phi SIMD/MIMD hybrid. Have fun...
  3. Artificial volumetric data usually contain a way more homogeneous regions than data from CT/MRI scanner thus, adaptive algorithms could provide a quite dramatic performance boost; in fact, the typical speedup for CT/MRI data is around 20..30 times for the CT/MRI data for the similar quality. Such result for adaptive volumetric ray casting has been demonstrated only for multi-core CPU hardware so to have a similar development for GPU should be quite an impressive breakthrough in the area.   Btw, "marching cubes" has nothing to do with volume rendering it is just a mesh generation from volumetric data. In fact, marching cubes is the way how to avoid volume rendering to retreat to render polygons what GPU does the best. It's correct that the multi shelled mesh generation for different levels of isosurfaces may allow to approximate a rendering integral to have a true volume rendering output but it is excessively more computationally expensive to sustain the same level of integral precision then to do it directly especially if Transfer Function interactive modification is a requirement. The sampling rate ~x16 samples per cell for Interpolation Classification VR is the sampling density to ensure a high quality rendering for medical applications; apparently so many levels of meshes makes this technique impractical for general VR application BUT it may be useful to visualise a small volumetric objects for games (besides, the high precision is not a requirement for game). 
  4. kd-tree in volume rendering

    [quote name='mahendragr' timestamp='1312466563' post='4844486'] Hi all, I have a weird situation with kd-trees. I have implemented a kd-tree for volume rendering, which is also used for load balancing in a cluster-like environment. For example if there are two machines, each machine gets a part of the tree starting from the root and renders its portion. However, when i detect a load imbalance, i move the subdivision plane in the kdtree so that the machine which is slower gets less number of leaves(which contain the data), the problem now is, when i move the sub-division plane, i have two new bboxes, which are inturn divided again(which i dont want to happen as i save the old leaves based in its bbox and reuse them to prevent data reloading). My question is, how should i stop my tree from being subdivided again into smaller chunks, which are less than the leaf size [/quote] Well, there is nothing weird here; you see exactly what you should see, "moving subdivision plane" at some level you consequently ruin/invalidate the correspondent sub-trees. It is apparently possible to try "re-utilize" some sub-trees but it requires some trimming/expansion of all effected sub-divisions what clearly must cause the domino effect and at the end you just rebuild the effected sub-trees. No offence, but such "balancing" approach is silly (from my point of view), as soon as each cluster node may satisfy a single client with interactive high fidelity volume rendering it makes no sense to distribute workload of single client across cluster farm, in such case the load distribution should be balanced among clients. Dual X5650 node is totally up to provide an interactive high fidelity volume rendering for 2K cube and may scale up logarithmically so modern 4 socket Xeon (40-cores) may render interactively 4K cube at super-sampling x16 with 1.5....2MP view ports. --sb
  5. [quote name='smasherprog' timestamp='1311312624' post='4838784'] Not sure who is who here? stefanbanev == vereor ? At any rate, OP suggests a depth of 27, which would mean the number of nodes in the octree would be in excess of 8^27 power, which I wont bother to write out. Vereor, I think you might want to lay out what you want and see which techniques line up and their implications. For example, if you want to display a large area from a zoomed out view, you couldn't hold everything in ram because the costs would far exceed your available memory. As a person zooms out, objects from the scene should be removed, and as a person zooms in, the opposite should happen. So, your scene would need to be dynamic, loading objects as they come into view and then discarding them after some viewing threshold. Octrees and other tree type structures are used alot in many video games, but they generally only complicate things, not make then easier. The point of a spacial tree such as an octree is simple, right? It divides the visible area into neat chunks, it sure sounds good to me --I like organization. But, whats the point? What I mean is, what does this give you? How does this make your job easier? I have seen many people want to use a spacial tree just because they thought it would benefit them to divide their world up --and everyone else does it, right? I thought the same and wrote an [url=""]octree[/url] that has extremly fast find/insert/remove, but i stopped writing the code after seeing it just complicated everything. Could you use a quadtree? Or make up a different type of way to divide your viewing area? If you allow zooming in and out perhaps there can be fixed zoom distances. For example, if a person hits the zoom out, it will zoom out to the solar system view, if he or she hits zoom in again, it would zoom in close to an orbital view of a planet. If a person zooms out of the solar system, it would show the galaxy. In this manner you could manage the number of objects in the scene and provide yourself some consistency --and an easier job programming. [/quote] > [color=#1C2837][size=2]At any rate, OP suggests a depth of 27, which would mean the number[/size][/color] [color=#1C2837][size=2]> of nodes in the octree would be in excess of 8^27 power,[/size][/color] [color=#1C2837][size=2]> which I wont bother to write out.[/size][/color] [size="2"][color="#1c2837"] [/color][/size] [size="2"][color="#1c2837"]For such sparsely populated volume as space 99.99% of volume is empty the number of leafs is very low. [/color][/size] [size="2"][color="#1c2837"] [/color][/size] [size="2"][color="#1c2837"]However, 27 levels are not enough to represent solar system if lowest level represent 0.1km then 44 light-second volume can be indexed (8 light-minutes is the distance from Earth to Sun), besides the diff data structures for diff levels are advisable (see my post above). [/color][/size] [size="2"][color="#1c2837"] [/color][/size] [size="2"][color="#1c2837"]Stefan[/color][/size] [size="2"][color="#1c2837"] [/color][/size]
  6. [quote name='Hodgman' timestamp='1311329587' post='4838833'] [quote name='Vereor' timestamp='1310030946' post='4832183']Traversing the octree to the players depth, and collecting a list of potentially visible octree nodes along the way, doesn't satisfy my above requirement as very far away objects (planets) will not necessarily be in that traversal path[/quote]I don't quite understand this. Usually when you traverse an oct-tree, you visit all 8 children of a node, provided that they are visible, and repeat. So starting at the root, visiting it's 8 children, and their 8 (times 8) children, and so on, performing a visibility test for each node, how do some 'visible' nodes [i]not[/i] end up being visited? Can you explain this part of the problem? [/quote] >Usually when you traverse an oct-tree, you visit all 8 children of a node Well, apparently it is not necessarily for the ray-octree-traversal since ray may cross some subset of 8 sub-volumes so no need to validate all 8 of them. If you just scan octree (to find min,max, to update octree etc...) then it makes sense to look through all 8 sub-volumes. Stefan
  7. [quote name='smasherprog' timestamp='1311281566' post='4838645'] Have you thought about the memory requirements? Or, perhaps the millions of calculations needed to maintain an octree of this magnitude? There isnt a reason to keep something this big in memory at all times. Normally, you would keep in memory only what is needed and dynamically load new data as needed. [/quote] >Have you thought about the memory requirements? Fanny... Does my post really indicate that I have never thought about memory requirements? >Or, perhaps the millions of calculations needed to maintain an octree of this magnitude? Octree is just an index structure referencing to an actual data (polygons, voxels... etc); actual data should be stored in conventional flat/array form and it takes majority of memory, octree is a minor part unless developer with special abilities does the job. Once data is modified, octree update can be very fast as soon as data has reference back to correspondent octree leafs, in this case the time complexity of octree update can be T(n,N) ~= n*Log2(N) where "N" is the size of all data and "n" is the size of modified part of data - quite minor overhead. Implied logic it too complex to be implemented effectively solely on GPU. In fact, once data size is getting above some threshold multi-core CPU trashes GPU badly. Stefan
  8. >I think you'd need three octrees. No offence but it sounds quite silly to me - all the point to use octree is to provide the hierarchical structure where each node points to child octree as deep as necessary; once node reaches the level of max-detalization or empty space it stops (becomes a leaf). As a rule the different levels of octree may need different structures to be memory efficient, usually the structure representing the leaf-level-nodes differs from nodes above but generally such "structure-tuning" may be applied for any octree level. Also, often it makes sense to have pointer-based-octree down to some level and below to have pointer-free-octree representation (flat octree); the pointer-based-octree addresses the child explicitly via pointers so empty sub-volumes does not have child sub-octrees, flat-octree requires some extra math to compute the position of child node and it can be done only if all sub-volume is represented as whole (there is no dead ends because of empty sub-volumes). Stefan
  9. >[color=#1C2837][size=2]1024x1024x64. Ideally it would twice those dimensions. >I search for leads me to acceleration structures and other advanced concepts[/size][/color] [color=#1C2837][size=2] [/size][/color] [size="2"][color="#1c2837"]For such "thin" data the [/color][/size][color=#1C2837][size=2]acceleration structures are overkill, just [/size][/color][color="#1c2837"][size="2"]GPU brute force texture mapping VR would be a better choice; it is not well scalable for thicker data though. There are several "free" GPU based engines suitable for such data... If you go to fat volumes GPU is obsolete multi-core CPU is the way to go...[/size][/color] [color="#1c2837"][size="2"] [/size][/color] [color="#1c2837"][size="2"] [/size][/color]
  10. Any fast voxel rendering algorithms?

    [quote name='spinningcube' timestamp='1300623229' post='4788201'] [quote name='stefanbanev' timestamp='1300560186' post='4788007'] Currently, the best multi-core cpu VR ray-caster outperforms the best gpu by factor 10 (in the similar hi-end price range) >I have searched the internet but I didn't find any good articles about rendering voxels that don't involve octrees or cuda. There is none, just CUDA propaganda to make you invest in GPU, it works great for brute force SIMD friendly algorithms but apparently sucks once code path depends on data in array what is a mandatory to get a logarithmic t-complexity. [/quote] This. CUDA is great for brute force algos but as soon as you branch, need more flexibility, memory etc... Forget it. Pablo [/quote] The branching does not impact performance in significant way as soon as ALL threads follow the SAME code path - means the same-single instruction for all threads; CUDA API allows divergence happen so if warp from 32 threads has even single thread going different path the speed goes down 2 times etc... Apparently such limitation makes GPU not practical for flexible general purpose ray-tracing since ray coherency is rather exception then a rule. CUDA API fits fine to massive MIMD architecture but current GPU is really good only for SIMD friendlily algos. It should be noted that memory/PCI-Ex limitations are relatively minor and not principal GPU handicaps besides it's steadily improving - THE major handicap of GPU is its SIMD architecture. Stefan
  11. Any fast voxel rendering algorithms?

    Currently, the best multi-core cpu VR ray-caster outperforms the best gpu by factor 10 (in the similar hi-end price range)[list][*]>Aliasing artifacts[/list] keep adaptive sampling along ray; asses the error for each step based on current ray energy opacity/rgb contribution current etc... the logarithmic time complexity for low-opacity uniform fog or/and hit and stop high opacity cases are obtainable.[list][*]>Lots of empty voxels are processed even if the ray doesn't hit anything[/list] use octree or others subdivision techniques >I have searched the internet but I didn't find any good articles about rendering voxels that don't involve octrees or cuda. There is none, just CUDA propaganda to make you invest in GPU, it works great for brute force SIMD friendly algorithms but apparently sucks once code path depends on data in array what is a mandatory to get a logarithmic t-complexity. >So, do you know of any good voxel rendering algorithms? This question is similar to "So, do you know of any good chess algorithms?" any good one is very complex and you will not find a complete description just some general ideas and concepts which are anyway a banal. If your goal is to have fan writing your own engine then indeed you will have plenty of fun but do not expect to master a competitive engine unless you are going to invest years solely in this project... Stefanbanev
  12. multithreading software renderer

    Quote:Original post by Noggs A modern GPU is more like 1000 awesome workers than one awesome worker. Plus dedicated hardware to do various bits o f the rendering a general purpose CPU cannot get anywhere close to this performance. >A modern GPU is more like 1000 awesome workers than one awesome worker. For what kind of work? GPU is only good for rasterizing graphics and X1000 is an appropriated ratio. Once algorithms are not SIMD friendly the modern GPU is totally inferior to the modern multi-core CPU. If adaptive algorithms use the data itself to choose the optimal strategy (code path) the moder multi-core CPU beats badly ANY GPU machine by several orders of magnitude in the (same price range) - adaptive volumetric ray casting is one of such example. Even for SIMD friendly algorithms CPU is very often a better choice; you may try the best FFT implementations from two camps to see it. Please show me any commercial application where FFT is computed on GPU (besides examples setup by GPU promoters as NVIDIA/ATI) and it is the domain very favorable for SIMD - the outside this very tight domain GPU is totally bogus.
  13. Render to 3d texture with glsl?

    Quote:Original post by ruysch > I would not advise you to get in “volume rendering” development mess I wouldnt be discourged, once you get your mind wrapped around the problem its all just a matter of understanding the algorithm you wish to implement and writting the code. Well, it's apparently true for known_public algorithms. In case of volume rendering the practically relevant known_public algorithms have a cubical time complexity; even Fourier volume rendering has N^2*Log2(N) time complexity, it is not by chance remains in PhD domain. GPU is good for brute force VR therefore its scalability sucks; smart adaptive algorithms is not well suited for GPU due to its SIMD limitations and it is one of reasons why adaptive VR algorithms remain in an uncharted public territory besides, such algorithms are really difficult to implement. There are several proprietary CPU based VR renderings with logarithmic time complexity so they take over GPU VR above some size threshold and this threshold is rapidly going down with multi-core AMD/Intel war. To keep up GPU should maintain cubic grow of number transistors or eventually to become an efficient MIMD machine as i7/Opteron. Stefan
  14. Render to 3d texture with glsl?

    "The more I think about it, though, the more unsure am I about whether this is the right approach. My data is largely sparse and I only need rendering of the frontmost isosurface." Well, your doubts are valid: If you need volume rendering just to visualize the result of your modeling and it is not the prime focus of your research I would not advise you to get in “volume rendering” development mess. It is very complex and time consuming to master a descent VR engine unless the crappy & slow one is fine with you. There is no descent GPU VR (according to my standard), once data size and size of projection plane is getting bigger and/or interactive rendering quality is set higher the interactive speed is rapidly deteriorate down, way below of interactive rate. Bottom line, GPU VR is not scalable at all, mostly because GPU SIMD architecture is incapable to adapt/change the code path to take advantage from local property of data; once each ray has unique code path driven by local property of data along ray GPU is totally inferior comparatively modern multi-core CPU. Therefore, modern GPU may provide descent VR performance only for relatively small volumetric data. Probably, the best GPU based volume rendering, I'm aware about, is Voreen. There are new multi-core CPU based VR engines with excellent scalability; the major performance differentiation between CPU based engines is the threshold it takes over GPU. Currently, this threshold for the best from two camps is around 512x512x512 (16bit data) 700x700 view-port and frame rate ~8 FPS for IC with interactive sampling density along ray 8+ samples per cell and rendering hardware Dual E5620 4GB vs, desktop with dual SLI GeForce GTX 480 (Fermi) (this estimation is very conservative and precocious; probably 256x256x256 cube is more accurate). The best CPU based VR has logarithmic scalability once GPU suffers from cubic dependency for brute force texture mapping (even it's greatly speeded up by hardware circuits yet it can not change its cubical nature); admittedly, it can be significantly improved via adaptive sampling of small texture bricks but complexity of such development mess is very high thus practically so-far it remains in domain of research not in domain of product development (I would love to be proved wrong). Stefan [Edited by - stefanbanev on July 3, 2010 9:37:58 PM]
  15. >The problem is that there is a discontinuity between the sphere and >its complement in the volume cube. And it is using linear >interpolation to make it continuous, which is not right for >the setup I want, because I do want a discontinuity. I guess >the problem is with my transfer function. Well, apparently it is matter of semantic, you use the word "problem" for the arrangement you set up; within frame of your arrangement you are getting consistent result. If you reluctant to reconsider the setup then the only option to visualize the smooth iso-surface for your data layout is to go to impractically high order of interpolations. The other options may arise only and only if you change the problem definition; for example, do not show iso-surface at all but rather to show "cloudy" transaction from transparent to opaque which is a composition of multiple iso-surfaces contributions. Still, in case of linear interpolation you have only one/single layer of cell with values between 0 to 128 and this layer is the only "blanket" you may use as a "cloudy" transaction to reduce "cubical" appearance. In this case set opacity ramp/slope from 0 to 128. To ensure an equal contribution of each iso-surface the ramp shape should not be linear but has to be logarithmic (the base depends on size of opacity quanta). For this arrangement the "issue" will be less profound but still quite blocky... Another way is to redefine the data itself (see my prev post) >Can you explain what you mean by supersampling here? Is this referring >to CPU ray tracing where you cast multiple rays per pixel? No, the "supersampling" for volumetric ray casting used to reference to sampling density along each ray "number samples per cell along ray". This measure increases the accuracy of rendering integral. The increasing number of rays per each pixel of projection plain apparently can not effect/improve the accuracy of rendering integral... ;o) An excessive number of rays per pixel just helps to smooth edges on 2D image (known as anti-aliasing what is a lame term anyway) >to CPU ray tracing where you You could reference to GPU here as well, it makes no difference from math point of view but it does make a huge practical difference from side of quality and performance. --sb