Optimizing the remesh modifier

The past couple days I’ve been looking into the performance of the remesh modifier. Although I’m committed to getting dynamic-topology sculpting working, the remesh modifier is also a useful approach to clay-like sculpting. Ideally it should support quite high density levels at fast speeds. Below I’ll get very detailed about what I’ve found, but spoiler alert: it turns out there’s quite a lot we can do to make remesh faster.

Methods

To guide optimization it’s important to measure. I usually start with Sysprof. This is a handy tool with a simple GUI that does system-wide profiling via a kernel module. On Ubuntu it’s very easy to get: sudo apt-get install sysprof, then sudo sysprof. Sysprof is enough to get a high-level view of which functions are taking the most time, although the data gets a bit confused once compiler optimization is enabled.

Using Sysprof’s output as a guide, I added some timing printouts to various functions. This code takes a simple form:

 double start, finish;
 start = omp_get_wtime();
 blah();
 end = omp_get_wtime();
 printf("blah took %f secondsn", end - start);

Notes:

  • I only profiled code called from MOD_remesh.c, so this ignores the draw code. Once the remesh depth is high enough to output millions of polygons, a non-trivial amount of time might be used to build VBOs (if enabled) and draw to the screen. That’s outside the scope of optimizing remesh though.
  • I bumped up the remesh modifier’s maximum octree depth from ten to twelve. Longer runs gives more stable timing output. That said, timing info below like “1.23456” should really be read more like “1.2”, it’s not much more precise than that.
  • So far I’ve tested only on my desktop system, which has an Intel quad core i7 (2.93 GHz) and 12 GB RAM.

Initial Timing Results

Note that all of these have the remesh modifier in Smooth mode:

  • Cube test (default cube, 6 faces, with remesh modifier)
    • Depth 10, Remove Disconnected Pieces disabled:
      • MOD_remesh.c applyModifier time (excluding edge and normal calculation): 14.392200 seconds
      • CDDM_calc_edges time: 4.568653 seconds
      • CDDM_calc_normals time: 0.753950 seconds
    • Depth 10, Remove Disconnected Pieces disabled:
      • applyModifier time: 58.752615 seconds
      • CDDM_calc_edges time: 19.040228 seconds
      • CDDM_calc_normals time: 2.963763 seconds
    • Depth 11, Remove Disconnected Pieces enabled:
      • applyModifier time: 78.423068 seconds
      • CDDM_calc_edges time: 19.214748 seconds
      • CDDM_calc_normals time: 2.983682 seconds
  • Suzanne test (default Suzanne mesh, 500 faces, with remesh modifier)
    • Depth 10, Remove Disconnected Pieces disabled:
      • MOD_remesh.c applyModifier time (excluding edge and normal calculation): 8.305103 seconds
      • CDDM_calc_edges time: 1.934146 seconds
      • CDDM_calc_normals time: 0.402963 seconds
    • Depth 11, Remove Disconnected Pieces disabled:
      • applyModifier time: 32.458871 seconds
      • CDDM_calc_edges time: 8.153740 seconds
      • CDDM_calc_normals time: 1.590266 seconds
    • Depth 11, Remove Disconnected Pieces enabled:
      • applyModifier time: 42.088119 seconds
      • CDDM_calc_edges time: 8.145864 seconds
      • CDDM_calc_normals time: 1.589513 seconds
  • Subdivided Suzanne test (default Suzanne subdivided twice, 8000 faces, with remesh modifier):
    • Depth 10, Remove Disconnected Pieces disabled:
      • MOD_remesh.c applyModifier time (excluding edge and normal calculation): 8.963344 seconds
      • CDDM_calc_edges time: 1.907123 seconds
      • CDDM_calc_normals time: 0.406367 seconds
    • Depth 11, Remove Disconnected Pieces disabled:
      • applyModifier time: 34.063501 seconds
      • CDDM_calc_edges time: 8.144841 seconds
      • CDDM_calc_normals time: 1.629981 seconds
    • Depth 11, Remove Disconnected Pieces enabled:
      • applyModifier time: 44.334208 seconds
      • MOD_remesh.c CDDM_calc_edges time: 8.291207 seconds
      • MOD_remesh.c CDDM_calc_normals time: 1.654197 seconds

Right away we learn some interesting things. First: if you don’t need it, turn Remove Disconnected Pieces off! It’s adding 10-20 seconds at level 11.

Second, we see that the number of input faces doesn’t seem to affect the time very much. In fact, a simple cube is much slower. This is just due to perfectly fitting the octree’s shape though. The same cube rotated (in edit mode) gives similar timing as Suzanne. So for the remainder I’ll focus on just one test case, the regular 500-face Suzanne.

Third, we can see the importance of CDDM_calc_edges and (relative) unimportance of CDDM_calc_normals. Normals calculation time is low enough that I’m ignoring it for now, but CDDM_calc_edges looks like an opportunity for optimization.

Unnecessary Calculations

One of the functions that shows up at the top of Sysprof’s output is Octree::computeMinimizer. This function is solving a linear equation using the Eigen3 matrix library. It’s called for every vertex that Remesh outputs, and its purpose is to calculate where the vertex should go in order to preserve sharp edges. But note that we’re testing remesh in smooth mode, so why do we need this? Turns out that we really don’t. If the remesh mode is smooth, we can simply calculate the mass point (average of edge intersection points.) With this simple change in place, let’s see what happens to the timing:

  • Suzanne test
    • Depth 10, Remove Disconnected Pieces disabled:
      • applyModifier time: 4.371019 seconds
    • Depth 11, Remove Disconnected Pieces disabled:
      • applyModifier time: 17.721956 seconds

At level 11, we’ve gone from 32 seconds to 17 seconds. That’s a rare speed-up there, almost half the time removed.

There’s another function call that I think may be unnecessary, but I have only verified this experimentally with a few meshes, it needs further checking still. This is the trace() function, which is currently called twice. Removing the second call seems to have no ill effects; I suspect the second call is just verifying that the first call worked correctly. trace() has something to do with patching holes, but further than that I don’t know anything about it.

Here’s the new timing:

  • Suzanne test
    • Depth 10, Remove Disconnected Pieces disabled:
      • applyModifier time: 3.852286 seconds
    • Depth 11, Remove Disconnected Pieces disabled:
      • applyModifier time: 15.192465 seconds

Edge Data Structure

Now that the performance has changed quite drastically, another trip to Sysprof shows that edge calculation is one of the longest-running functions.

CDDM_calc_edges uses an edge hash to identify unique pairs of vertices. This should indeed be quite efficient for meshes with many isolated vertices or vertices of very high degree (many connected edges.) However, I question whether those cases are actually very common, especially for meshes with many vertices. Certainly in the case of remesh all vertices have a low degree, typically four give or take an edge.

With this assumption in mind, we can rework CDDM_calc_edges as a standard adjacency list, with an array sized to match the number of vertices. Each element of the array is itself an array, in this case storing indices of adjacent vertices (as well as an index identifier for the edge.) The modified function gives these timings:

  • Suzanne test
    • Depth 10, Remove Disconnected Pieces disabled:
      • CDDM_calc_edges time: 0.806422 seconds
    • Depth 11, Remove Disconnected Pieces disabled:
      • CDDM_calc_edges time: 2.038443 seconds

Again, quite a speed-up. At depth 11, it takes less than a quarter of the old time to calculate edges. As noted before though, there are cases where this optimization could make things slower. That should be investigated further, but if necessary there could be more than one version of the function, to optimize various cases.

OpenMP

Now we get to the fun concurrency stuff. OpenMP is a way to quickly parallelize code (if your compiler supports it.) Typically it’s as simple as adding a pragma around a loop to indicate it should be run in parallel. Of course, as with all concurrency, you have to be careful about mutable shared data. Ideally you either don’t share data, or the shared data is read-only.

It turns out that the octree used in the dual contouring code is pretty well set up for multithreading. The only links between nodes go from parent to child, so we’re pretty free to play with independent chunks of it. In general the approach I am trying is to run calculations on each octant of the octree’s root node independently. Of course, this limits the scalability to eight threads, but for typical CPUs that’s not much of a limitation.

So far I have added OpenMP parallel regions in four places in octree. The most complicated is the initial octree build, where each triangle of the input mesh is added to the octree. For that case I create eight fake roots which cover the eight octants of the real octree root. These fake roots are built independently, then attached to the real root at the end. This optimization is only used when the octree is at least three levels deep, to ensure that the fake roots can safely be made internal nodes.

And final results:

  • Suzanne test
    • Depth 10, Remove Disconnected Pieces disabled:
      • applyModifier time: 2.180901 seconds
    • Depth 11, Remove Disconnected Pieces disabled:
      • applyModifier time: 8.518875 seconds

Again, a healthy speed-up: depth 11 has gone from 15 seconds to 9 seconds. The combined time of remeshing and edge calculation now takes less than a quarter of the time that trunk’s remesh does.

What’s Next

There are still more things I’d like to try for better performance: additional OpenMP tweaks, maybe SSE calculations. Preemptive note: the first person to say I should “add CUDA” or “put more OpenCL in it” gets a rotten tomato thrown at them :)

The code is currently available from the remesh-perf branch of my Blender repository. Feel free to give it a try and see if the results are similar on your system. Some form of these optimizations will of course go into trunk, but not until after 2.64 is released (whenever that may be.)

Edit: I forgot to mention, everyone should check out Roberto Roch’s cool sculpting screencasts. Seeing his use of remesh for sculpting was the incentive for working on remesh more.

8 thoughts on “Optimizing the remesh modifier”

  1. If you don’t know much about trace() I guess you should not remove it at this stage. Remesh is already producing non-manifold meshes here and there :/

    Maybe run a few tests with meshes that have overlapping faces?

  2. Glad you follow Roberto Roch’s sculpt work – I was actually going to tweet to you that you should look at them, but I saw you already were following him. I’ve learned a lot from his screencasts, but his thinking outside the box in using booleans with remesh was my latest aha!

    Thanks for the hard number crunching Nick – impressive stuff.

  3. I remember once spending several hours speeding up some bit-munging code on MATLAB under Linux, only to discover that my changes made no difference under Microsoft Windows.

    1. Just to be clear, I didn’t implement dual contouring, I only modified it to incorporate it into Blender.

      I am aware of the many, many papers on other methods of remeshing. My opinion is that it would be great to have some alternate methods in Blender. As you note, some of them can provide better topology. You might not realize though: most of those methods are vastly slower — so you’d want them as a complement to dual contouring, not a replacement.

      Also, most of these techniques rely on being able to (numerically) solve equations that are well outside my knowledge of math :)

Leave a Reply to Nicholas Bishop Cancel reply

Your email address will not be published. Required fields are marked *