1. Introduction
In this work, we propose Dynamic Plane Convolutional Occupancy Networks, an implicit representation that enables accurate scene-level reconstruction from 3D point clouds. Instead of learning features on three pre-defined canonical planes as in [28], we use a fully-connected network to learn dynamic planes, on which we project the encoded per-point features. We systematically investigate the use of up to 7 learned planes and demonstrate progressive improvements by increasing the number of learned planes in our experiment.
- feature 학습에서 deep neural network를 활용하여 unoriented point cloud로 부터 3D surface reconstruction task에 가장 적합한 plane을 예측하는 것
- object 와 scene level의 3D surface reconstruction task에 대해 기존 sota 방식 대비 우월성
- 실험에서 dynamic plane의 분포에 대한 다양한 관찰 제공
In addition, we exploit the use of positional encoding proposed in [24], which maps the low dimensional 3D point coordinates to higher-dimension representations with periodic functions under various frequencies. Additionally, we formulate a similarity loss function to govern the orientations of dynamic planes to orient in diverse directions.
3. Method
Our goal is to reconstruct 3D scenes with fine details from noisy point clouds. To this end, we first encode the input point clouds into 2D feature planes, whose parameters are predicted by a fully-connected network. These feature planes are then processed using convolutional networks and decoded into occupancy probabilities via another shallow fully-connected network. Fig. 1 illustrates the overall workflow of the proposed method.
3.1. Encoder

- Figure.2 설명 : point 별 feature를 추출하기 위해 ResNet PointNet을 사용한다. 윗단에 simple PointNet 으로 구성된 plane predictor network를 추가하여 dynamic plane parameter를 예측한다.
Point cloud encoding:
Given a noisy point cloud, we first form a feature embedding for every point with ResNet PointNet [23], in which we perform local pooling according to the predefined plane resolution [28]. We are applying a rather simple network here for the proof of concept, but other advanced feature extractors, e.g., PointNet++ [31] or Tangent Convolution [33], can also be used.
Dynamic plane prediction:
Having the point-wise features, we can then construct the planar features. Mathematically, a plane is defined by a normal vector n = (a, b, c) and a point (x, y, z) which a plane passes through: ax + by + cz = d. Peng et al. [28] simply project features onto canonical planes, i.e., 3 planes aligned with the axes of the coordinate frame, x = 0, y = 0, z = 0. Unlike [28], we introduce another shallow fully-connected network to regress the plane parameter (a, b, c).
We perform max pooling globally on all points in the point cloud because a global context is needed to search for the proposal of the best possible planes. Since different input point clouds might be predicted with different planes, we call this process dynamic plane prediction. Note that we directly set the intercept of the plane d to 0 because the shifts along the normal direction do not change the feature projection process.
After the prediction of plane parameters, we pass it through one layer of FC to obtain a feature for every dynamic plane. This feature is expanded, matching the number of input point cloud and summed up with the last layer of ResNet PointNet with respect to the individual dynamic plane, and thus we call it the plane-specific feature. Our main intention is to allow backpropagation into the plane predictor network, but we also empirically find that this individual summation operation improves the reconstruction quality. One possible reason is that it allows the networks to learn varying emphasis over the feature dimension of the last layer of ResNet PointNet with respect to the individual dynamic plane.
Once having the predicted plane parameters, we project the summed-up features onto the dynamic planes with a defined size of H × W grids and apply max pooling for the features falling into the same grid cell.
Planar projection:
In order to project the encoded features to the dynamic planes, whose normals can point to any directions, we sequentially apply basis change, orthographic projection, and normalization to always keep them inside H × W grids. Denoting the three basis vectors of canonical axes, i, j, k, where k is the basis vector of the ground plane and n is the learned plane normal, those operations are detailed as follows and illustrated in Fig. 3.

To perform basis change, we normalize n into a unit vector n̂ and obtain the rotation matrix R that aligns k with n̂. Let v = = k × n̂, the rotation matrix R is defined as:

where [v] is the skew matrix:

With the rotation matrix R, we rotate the axes i and j to obtain and . Now, the vectors , and n̂ are orthogonal to each other, serving as the basis of the predicted plane coordinate system.
Next, we convert point coordinates from the world coordinate to the plane coordinate system and project the features orthographically to the predicted plane (”new ground plane”). However, as our dynamic plane can orient to any direction, the orthographic projection of a point far from the centroid of 3D space might fall outside the H × W grids.
To ensure all possible points after orthographic projection are inside the grids, we divide the coordinates after projection by a normalization constant c ≥ 1. To find c, we first convert and to be inside the positive octant by taking their absolute values and . Next, we obtain orthogonal projections of the vector 1 = [1, 1, 1] to and :
Subsequently, we set c to be the maximum value between the lengths of these two projected vectors, c = max(||, ||). The point coordinates under the plane coordinate system are divided by c so that all points lie inside the dynamic plane, where the point features are stored.
3.2. Decoder
The goal of the decoder is to obtain the occupancy prediction of any point p ∈ given the aggregated planar features. Similar to how we project features in the encoder, we project p onto all dynamic planes. Next, we query the feature through bilinear interpolation of the planar features encoded at the four neighboring plane grids.
Occupancy Prediction:
Given an input point cloud x, we predict the occupancy of p based on the feature vector at point p, denoted as ψ(p, x):

3.3. Training and Inference
3.4 Positional Encoding
4. Experiments
4.1. Object-Level Reconstruction
We follow [28] and uniformly sample 2048 points. We use a plane resolution of 64 2 and U-Net with a depth of 4. We run experiments with different combinations of canonical and dynamic planes. The results are summarized in Table 1.
Observation on plane distribution:
Here, we discuss our observations on the distribution of the predicted dynamic planes. In the case of 3 dynamic planes, our network predicts three canonical planes for all objects. In the case of 5 and 7 dynamic planes, there are combinations of flipping sets of normals (e.g. one normal pointing upward and the other downward). Such flipping sets of normals are equivalent to applying a horizontal flip on the projected encoded features.
Similarity loss:
To test whether having diverse plane normals that are neither aligned nor flipping to each other can have a significant impact on the model performance, we try another variant where we restrict the learned plane normals to be diverging by adding a pairwise cosine similarity loss among plane normals.
- 본 논문에서는 C를 10 x M 으로 두었다. 3개의 평면 쓸때는 canonical axes와 거의 동일한 plane set이다.
Adding more planes, e.g. in 5 and 7 dynamic planes, gives predicted planes whose normals diverge from the canonical axes without any flipping or redundant set where slight variations between objects are observed.
- similarity loss를 사용하면 dynamic plane을 많이 사용할 경우에 일반화가 더 잘됨. 위 figure는 object를 회전시켰을 때의 결과임.
4.2. Scene-Level Reconstruction
For the scene-level experiment, we uniformly sample 10,000 points from the ground truth meshes as input and apply Gaussian noise with a standard deviation of 0.05. During training, we query the occupancy probability of 2048 points.
5. Conclusion
In this work, we introduced Dynamic Plane Convolutional Occupancy Networks, a novel implicit representation method for 3D reconstruction from point clouds. We proposed to learn dynamic planes to form informative features. We observe that 3 canonical planes are always predicted, and the symmetric property of objects are implicitly encoded. We also find that enforcing a similarity loss on the predicted plane normals considerably improves the performance on unseen object poses. In future work, we plan to assess the theoretical support for the dynamic plane prediction.
