Received 13 March 2023, accepted 28 April 2023, date of publication 11 May 2023, date of current version 17 May 2023. Digital Object Identifier 10.1109/ACCESS.2023.3275442 DC-DFFN: Densely Connected Deep Feature Fusion Network With Sign Agnostic Learning for Implicit Shape Representation ABOL BASHER AND JANI BOUTELLIER School of Technology and Innovation, University of Vaasa, 65200 Vaasa, Finland Corresponding authors: Abol Basher (abol.basher@uwasa.fi) and Jani Boutellier (jani.boutellier@uwasa.fi) This work was supported in part by the Academy of Finland Projects Compact and Efficient Deep Neural Networks for Ubiquitous Computer Vision (CoEfNet) and Robust and Efficient Perception for Autonomous Things (REPEAT). ABSTRACT Reconstructing 3D surfaces from raw point cloud data is still a challenging and complex problem in computer vision and graphics. Recently emerged neural implicit representations model 3D surfaces implicitly in arbitrary resolution and diverse topologies. In this domain, most of the studies have so far used a single latent code-based variational auto-encoder (VAE) or auto-decoder (AD) architectures, or architectures similar to UNets. Due to the deep architectures of the existing approaches, gradients and/or input information can vanish while passing through the layers, which can cause suboptimal learning at training time and consequently low performance at test time. As a countermeasure, skip connections and feature fusion have been used in related application fields of convolutional neural networks. In this study, we embrace this idea and propose a novel densely connected deep feature fusion network, DC- DFFN, architecture for implicit shape representation. In the experimental results we show that DC-DFFN outperforms baseline approaches in terms visual reconstruction quality and quantitatively based on several measures. In addition, the proposed approach provides faster convergence during training compared to the baseline approaches. The DC-DFFN architecture has been implemented in PyTorch and is available as open source. INDEX TERMS Convolutional neural network, implicit representation, dense feature fusion, zero-label set, surface reconstruction, ShapeNet, D-Faust. I. INTRODUCTION Recent advances in learning-based data driven approaches [1], [2], [3], [4], [5], [6], [7], [8] for reconstructing surfaces from raw un-oriented point clouds, and triangle soups are showing huge potential for several practical application fields, for example AR/VR technology, 3D printing, computer-aided design, and robotics. Recently emerged neural function-based implicit representations [1], [2], [3], [5], [8], [9], [10] can reconstruct a surface with infinite resolution and arbi- trary topology compared to classical 3D presentations such as voxels, octrees, point clouds, and meshes, which have various ingrained issues. For example, voxel-based repre- sentations have problems related to resolution (memory The associate editor coordinating the review of this manuscript and approving it for publication was K. C. Santosh . requirement increases cubically with resolution), whereas point cloud-based representations do not have connectivity among the points, and meshes can have self-intersection issue [9], [11] and are restricted to a fixed topology. The recently emerged implicit representations of 3D visual data to a great extent solve the problems related to classical representations, but pose new challenges related to com- plexity and computation time of the involved neural net- works. Moreover, most of the implicit representation-based surface reconstruction works [3], [4], [5], [6], [7], [8], [9] focus only on reconstruction quality, use of novel activation functions, and optimization methods, hardly paying atten- tion on network size [12] or training and inference time. In this study, we propose a densely connected feature fusion- based encoder-decoder neural architecture to ensure maxi- mum input information flow through the network for better VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 46399 https://orcid.org/0000-0002-6383-493X https://orcid.org/0000-0001-7606-3655 https://orcid.org/0000-0003-4176-0236 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 1. Illustration of the proposed architecture’s improved capability of capturing small detail, compared to the state-of-of-the art work SALD [1]. Zooming in reveals that SALD causes wing and wheel parts to become disconnected, and renders extra holes to the airplane’s hull. learning and consequently, more expressive reconstruction quality. In deep neural architectures, the gradients or input infor- mation can vanish or wash out while passing through the layers, which has already been addressed in several works of 2D image domain [13], [14], [15] applications, such as object recognition, object detection, and localization. So far, according to the best knowledge of the authors, none of the previous studies have considered dense features simultane- ously to learn the surface and address vanishing gradient problem in the network architectures used for neural implicit representations. Besides mitigating the vanishing gradient problem, fusing features densely will allow themodel to learn low (blob, edge) and high (object) level features simulta- neously, when being extracted from different layers of the network. To this end, we propose a dense feature fusion-based encoder-decoder network architecture to achieve high fidelity surface reconstruction. The main contributions of this study are: • A novel deep feature fusion-based variational auto- encoder architecture1 for implicit surface reconstruc- tion, which • Ensures better learning and shorter convergence time due to improved information flow through the network, and • Better robustness and generalization to unseen shapes, and • Reconstruction of high fidelity continuous surfaces and obtaining state-of-the-art quantitative results. The proposed study is a natural extension of previous approaches [1], [8], however transformed into a densely con- nected feature fusion-based network architecture. The rest of this paper is organized in the following manner: recently proposed related studies on explicit and implicit image rep- resentations are discussed in Section II; the proposed densely connected deep feature fusion-based encoder and decoder architectures are illustrated in Section III; qualitative and 1source code available: https://github.com/basher8488881/DC-DFFN quantitative comparisons with the baseline approaches are shown in Section IV, and Section VI concludes the paper. II. RELATED WORK 3D surface reconstruction approaches can be categorised based on their inherent ways of representing visual data: (a) explicit representations or classical representations, such as voxels, point clouds, and meshes, (b) implicit repre- sentations. In this section, we review traditional analytic priors-based reconstruction methods, classical and implicit representation-based approaches. Additionally, we review a few feature fusion-based studies from the 2D image domain to illustrate commonly used strategies for constructing effi- cient network architectures, which are the backbone of this study. A. TRADITIONAL RECONSTRUCTION APPROACHES There are a number of existing methods that are based on ana- lytic priors for surface reconstruction, for example: Screened Poisson Surface Reconstruction (SPSR) [16], Moving Least Squares (MLS) [17], Ball Pivoting Algorithm (BPA) [18] and Radial Basis Functions (RBF) [19]. SPSR was developed on top of the previously proposed Poisson Surface Recon- struction (PSR) algorithm [20], which works based on global surface smoothness priors, and addresses the limitations asso- ciated with PSR, for example tendency of over-smoothing the data. This approach casts the surface reconstruction task as a spatial Poisson problem and performs reconstruction in the frequency domain [21]. However, SPSR requires ori- ented normals of the input points. Similar to SPSR, RBF also works based on global surface smoothness priors and performs the reconstruction using radially symmetric basis functions. MLS is a mesh-independent approach of surface interpolation. It works considering surfaces in differential geometry, which includes a local mapping function and a local reference system for each points of the surface, and uses the moving least squares concept. BPA, on the other hand, reconstructs the surface through computing triangles by interpolating a given point cloud. In BPA, considering a triangle formation of three points from a point cloud, a sphere 46400 VOLUME 11, 2023 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation with a predefined radius is rotated around the edges until it touches another point. B. CLASSICAL REPRESENTATIONS OF 3D VISUAL DATA Voxel-based representations are one of the most popular and earliest representations for learning-based 3D reconstruction of shapes and scenes [22], [23], [24]. In this representation, the 3D space is discretized into a regular grid, making it an intuitive extension for learning-based algorithms that have been developed for the 2D image domain, such as deep (convolutional) neural networks. In its simplest form, voxels can be used to learn the dense occupancy grid (where each voxel is occupied or unoccupied), and utilize this information to render a mesh surface [2], [5], [25]. However, due to cubi- cally increasing memory requirements and lacking fidelity of rendered shapes, the usage of occupancy grids is limited to specific use cases [26], [27], [28]. Point clouds are another classical 3D data representation that expresses the 3D visual information by sparse data points and provide several advantages over voxels, for example their capability of better representing large spaces with fine details. Point clouds also serve as one possible output representation of implicit surface modeling [3]. Drawbacks of point clouds include lack of connectivity information, and high memory footprint of large surfaces that need to be represented densely. In contrast, the mesh representation bears more informa- tion than point cloud-based representations by expressing connectivity among 3D points. The vertices and faces of a mesh can be directly regressed using a neural network [29], [30]. Meshes have a wide range of applications, for exam- ple in classification and segmentation [31], [32], [33]. More recently, mesh-based representations have also been used as the output representation for implicit 3D surface reconstruc- tion [1], [5], [8], [9], [10], [12]. C. IMPLICIT REPRESENTATIONS Recently emerged implicit representations express the 3D surface S implicitly using (zero) label sets (Equation 1), S = {x ∈ R3 |f (x;w) = 0} (1) of a neural function f : R3 −→ R, where x ∈ R3 is the input data (sampled from a point cloud or triangle soup, X ∈ R3), and w are the neural network weights that approximate the surface to X . There are mainly two types of supervised approaches [1] commonly used to train the neural network to become an implicit function representation: (I) regression of known or pre-computed occupancy values f (p, z) : R3 × Z −→ [1, 0] using an occupancy func- tion [5], [9] or signed distances f (p, z) : R3 × Z −→ R using a signed distance function [1], [8], [10] or unsigned distances f (p, z) : R3 × Z −→ R+ 0 using an unsigned distance function [3], [34], and (II) regression of raw 3D data using sign agnostic losses [1], [8] by relating points on the level sets to the neural network model parameters [35] or supervision with partial differential equations approximating the signed distance functions [36]. In this study, we adopted the second the approach of training the proposed network using a sign agnostic loss function [1], [8]. Our proposed network outperforms the recently proposed state-of-the-art method SALD [1] and shows that dense connections in the network provide improved information flow through the lay- ers and faster convergence, consequently generating high fidelity shapes that preserve small details. In addition to proposing dense connectivity and feature fusion, our archi- tecture employs 1D convolutional layers with 1 × 1 kernels that generalize better on complex shapes than the dense layers used by previous works [1], [8], [36]. D. FEATURE FUSION NETWORK ARCHITECTURES Feature fusion and skip connections are used to enhance the performance of (convolutional) neural network models by mitigating the vanishing gradient problem in deep net- works [13], [14], [15], [37], [38], [39]. In this concept, fea- tures from previous layer(s) are fused in the next layer(s) either by performing summation or concatenation. For exam- ple in the ResNet architecture [14] the previous layer features were simply added to the next layer’s output. In contrast, in the DenseNet architecture [13], all features of the preced- ing layers are concatenated in the next layer’s output. Finally, attention-based feature fusion [40] fuses point-wise features and local features to compensate the loss caused by order- invariant max-pooling on point clouds, and to improve the 3D semantic segmentation accuracy of point clouds. In the context of learning from point clouds, feature fusion and skip connections have been used in a few studies [41], [42], [43] to improve learning of shapes/scenes. However, the way features are fused in previous studies is significantly different from the proposed architecture; previous studies perform feature fusion similar to UNets [44], whereas in the proposed architecture the features are fused in a fashion similar to DenseNet [13] to fuse the features of our proposed variational auto-encoder architecture. III. PROPOSED ARCHITECTURE In the following, we present our densely connected deep feature fusion network architecture, DC-DFFN, for implicit 3D representation. DC-DFFN is directly trainable on raw input data, for example raw (un-oriented) point clouds or triangle soups. Our proposed variational auto-encoder con- sist of an encoder and a decoder, which are constructed from novel mPEFE (encoder) and mNSDA (decoder) con- volutional modules. The proposed feature fusion concept is applied within the decoder, and within the encoder, but not between them. We describe the proposed network in the following sections. A. ENCODER The DC-DFFN encoder essentially consists of three multi-layered permutation equivariant feature extraction (mPEFE)modules. ThemPEFEmodule includes of two layer types: (a) a modified PointNet [45] layer (Conv1D-MaxPool) VOLUME 11, 2023 46401 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 2. The proposed DC-DFFN encoder and decoder architectures consist of 3 mPEFE modules, and 5 mNSDA modules, respectively. In the network figures, DsL{1-3}, B, F{1-3}, M{1-3}, FFL{1-5} (decoder) and N stand for DeepSet layer constructed with densely connected features, batch size, the features after 1D convolutional operation, extracted features after max pooling operation, densely connected deep feature fusion layer (decoder), and the number of input points to the network, respectively. In addition, Concat([F1-5], axis=1) states that the features (F1, F2, F3, F4, F5) are concatenated along channel dimension. The encoder receives raw input data points (N, 3)/(N, 6) (the latter, if surface normals are available) and unsigned distances as ground truth and outputs a latent vector, µ, and diagonal covariance matrix, 6 = diag exp η, which are later used to form a probability measure N (µ, 6) to construct the latent code that represents the input object shape. The decoder uses the encoded latent vector as the input to the network and predicts the signed distances, which are later used to mesh the shape using, e.g., the Marching Cubes algorithm. and (b) a DeepSet [46] layer. The PointNet layer extracts permutation invariant global features using the max-pooling layer as a symmetric function for high dimensional feature embedding learned from unstructured raw point cloud data. Our PointNet layer implementation uses 1D convolutional layers, where the original PointNet [45] layers rely on 2D convolution. The DeepSet layer performs the amalgamation of global features with high dimensional embedding of local features extracted by the convolutional layers. A similar implementation pattern can also be achieved by fully con- nected layers (used by SAL [8] and SALD [1]) instead of the 1D convolutional layers (see Appendix A for a more information). Through the dense interconnectivity between mPEFE modules the proposed architecture aggregates multi- layered local features with order invariant global features within the in-built DeepSet layer. In contrast, the architec- tures [1] and [8] concatenate only single-layer local features with an order invariant global features. Within the mPEFE layers, the 1D convolutional layers are followed by a pooling layer, a feature fusion layer (modified DeepSet layer) and a ReLU activation function [47]. The last convolutional layer, outside mPEFEmodules, is followed by a pooling layer and a ReLU [47] activation function. Two fully connected layers are used at the end of the net- work to formulate the probability measures N (µ, η), where µ is the latent vector, and η is used to compute the diag- onal covariance matrix, 6 = diag exp η. Therefore, the encoder (µ, η) = g(X ,w1) takes X ∈ R3 as input data and outputs the two 256 dimensional vectors, µ ∈ R256, and η ∈ R256. 46402 VOLUME 11, 2023 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation B. DECODER Our decoder consists of five symmetric multi-layered neu- ral signed distance approximation (mNSDA) modules. The mNSDA module has two main components: (a) signed dis- tance extraction layers (Conv1D-Softplus with β = 100), and (b) signed distance blending, i.e. feature blending layer. The l th module of the decoder receives input data from all the preceding (l − 1)th modules, fused by a concatenation operation in the channel dimension. The mNSDAmodule has some resemblance to the DeepSDF [10] decoder, however DeepSDF uses a fully connected layer instead of our con- volutional layer, and a ReLU activation function instead of our SoftPlus. The mNSDA module promotes maximal infor- mation flow through the network layers and also prevents the vanishing gradient problem, mentioned in the DeepSDF [10] work, from reducing performance. The high level archi- tectural structure of our decoder is also similar to that of DeepSDF, except for the dense connectivity introduced in our architecture. Withing the decoder there are a total of seven 1D convolu- tional layers (five of which are inside the mNSDA modules). The input size of the first layer is (256 + 3/6, 512, N), the following ([dout(0) , . . . , dout(l−2) ], 512, N), and in the last layer ([dout(0) , dout(1) , . . . , dout(l−1) ], 1, N). Here, doutl is the l th layer output of the decoder, and [., .] stands for the concatenation operation, which concatenates the previous (l − 1)th layer features of the decoder for the next layer input. N stands for the number of input points, which is in this case 1282. The decoder’s input is [x, z] where x ∈ R3, and z ∈ R3 is the latent vector. C. DATA PREPARATION The unsigned distances of given raw input data X are pre-computed for 500k sample points using the CGAL library [48] to speed up the training. Moreover, the SALD loss is computed over the data points and their corresponding unsigned distance derivatives sampled from distributions D and D′. Following [1], we set D = D1 ∪ D2, where D1 is set to be uniformly distributed sampling points {y} from X , putting two isotropic Gaussians,N (y, σ 2 1 I ) andN (y, σ 2 2 I ) for each y. Here, σ1 depends on each sampled point y and is the distance of the 50th closest point to y, however, σ2 is set to be a fixed value of 0.3. On the other hand,D2 is estimated by projecting D1 to surface S. D. TRAINING AND INFERENCE We used the SALD loss proposed in [1] with the Adam optimizer [49] to train our proposed DC-DFFN architecture. The SALD loss requires gradient incorporation in a differ- ential manner, which is done based on automatic differen- tiation [50] forward mode by constructing similar network layers as in [36]. For the D-Faust dataset, a fixed learning rate of 0.0005 and 500 training epochs were used for all mod- els. On the other hand, for the ShapeNet dataset, the initial learning rate was set to 0.0005 and all models were trained FIGURE 3. Training loss curves of IGR, SALD, and the proposed DC-DFFN model. It can be clearly seen that the proposed architecture learns faster and reaches lower loss than the baseline architectures. The SAL loss curve is not provided because the numerical scale of SAL loss significantly differs from the other three methods. for 1500 epochs. Moreover, a scheduler was set to decrease the learning rate by a factor of 0.5 after every 1000 epochs for the Shapenet dataset. For both datasets, the training was performed on a dual 24GB GeForce RTX 3090 GPU in the Ubuntu (20.04) Linux environment. In the inference phase, the implicit representation of test samples was meshed using the Marching Cubes algo- rithm [25]. For quantitative comparisons Chamfer distances and intersection-over-union (IOU) between the reconstructed surface against the ground truth (for both datasets) and input raw scans (for D-Faust dataset) were computed. IV. EXPERIMENTAL RESULT The proposed DC-DFFN architecture is evaluated on two challenging benchmark datasets, and compared to recently proposed three state-of-the-art approaches [1], [8], [36]. A. DATASETS 1) D-FAUST [51] The D-Faust dataset contains 41k raw 3D scans (triangle soups) of 10 Human subjects including 5 female and 5 male subjects, in multiple poses. The scans have various defects, such as noise, holes, missing body parts, and occasional arti- facts caused by reflections. In training and testing, we include only 1 out of 5 samples from the total 41k scans due to the dense temporal sampling of the dataset. We establish three types of experiments on the D-Faust dataset [51] following the experimental setup used in [8]: (a) shape space learning where 10 human subjects in various poses (129 different actions) are used for training and testing, (b) generalization on unseen human shapes where 8 human (4 females and 4 males) subjects are used to train the model, and 2 human (1 male and 1 female) subjects are used to test the perfor- mance of the trained model, (c) generalization to unseen poses, where randomly selected two human poses (from 10 human subjects) were used to test the model, and the rest of the data (10 humans) were used to train the model. VOLUME 11, 2023 46403 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 4. Visual and quantitative results of a single test sample for shape space learning. From the quantitative and qualitative results, it is clearly visible that the proposed DC-DFFN generates high fidelity reconstruction and achieves better IOU and CD scores than the baseline architectures. TABLE 1. Quantitative results on shape space learning for the D-Faust dataset. The Chamfer distances are presented in percentiles (5th, 50th, and 95th) and mean scores, Chamfer distances multiplied by 103. ↓: lower value is better; ↑ higher value is better. We consider the same train and test split as [8]. However, we have removed those poses from the test shapes that contain scanning artifacts such as floor or side walls. The cleaned test split shapes have been used during the inference for all methods. Therefore, for each experimental setting the number of test shapes before and after the removal are: (a) (2044 → 2003), (b) (1920 → 1869), and (c) (652 → 651). However, the results with the original test splits (with artifacts) can be found in Appendix B. 2) ShapeNet [52] The ShapeNet dataset contains non-manifold meshes with inconsistent orientation. We consider four different object classes in our experiments: (1) Car (3533), (2) Sofa (3173), (3) Guitar (797), and (4) Airplane (4045). The performance of the proposed DC-DFFN architecture. For ShapeNet, train and test split files (75/25) were created locally. B. METRICS For performance evaluation, we consider Chamfer distance (CD) and volumetric intersection over union (IOU). Volumetric IOU is the quotient of the volume of the generated and the ground truth meshes’ union and their inter- section. As the baseline implicit reconstruction methods and our proposed DC-DFFN architecture produce only the mesh file, we create voxelized volumes of the test-time ground truth meshes and of the generated meshes. In order to obtain unbiased estimates of the union and intersection volumes, we randomly sample 100k data points from the ground truth and generated meshes, and determine whether the points are occupied or not occupied. 46404 VOLUME 11, 2023 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 5. Visual comparison of a single test sample for unseen human shape learning. Quantitative scores are for reconstructed mesh against the ground truth. ↓: lower value is better; ↑: higher value is better. TABLE 2. Generalization performance on unseen human shape reconstruction for the D-Faust dataset. The chamfer distances are presented in percentiles (5th, 50th, and 95th) and mean scores, Chamfer distances multiplied by 103. ↓: lower value is better; ↑ higher value is better. Chamfer distance is computed as the mean distance of points from the generated mesh to the ground truth mesh and in the opposite direction as well. Additionally, we compute the Chamfer distances between the generated mesh and the input scan. Similar to the evaluation approach taken in [5], we define completeness as the computed mean Chamfer dis- tance from the direction of registration, Rg, (ground truth) / raw input scan (Sc) to the generated mesh (Gn) (Rg→Gn, and Sc→Gn), whereas the opposite direction (Gn→Rg, and Gn→Sc) is defined as accuracy. C. BASELINES We compare the proposed DC-DFFN architecture to several related generative approaches that are capable of learning the shape space directly from raw 3D data. SAL [8]: SAL is a generative implicit 3D reconstruction approach that learns the shape space from raw unsigned geometric data in a sign agnostic manner. We compare the proposed work against SAL using the D-Faust dataset, as SAL has inherent difficulties [8] in reconstructing thin shapes that are common in the ShapeNet dataset. SALD [1]: SALD is a state-of-the-art approach for recon- structing 3D surfaces, which uses a sign agnostic regression loss function with derivatives, and learns the shape space directly from raw unsigned geometric data. The proposed approach is compared against SALD in all of our experi- ments. IGR [36]: As shown in [36], a simple loss function can possess the implicit geometric regularization (IGR) prop- erty, which allows to generate smooth and high fidelity VOLUME 11, 2023 46405 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 6. Visual comparison of baseline methods and the proposed DC-DFFN on unseen pose learning with IOU and Chamfer distance scores. ↓: lower value is better; ↑: higher value is better. TABLE 3. Comparison on unseen pose learning for D-Faust test samples. The Chamfer distances are presented in percentiles (5th, 50th, and 95th) and mean scores, Chamfer distances multiplied by 103. ↓: lower value is better; ↑ higher value is better. surfaces directly from raw (un-oriented point cloud or tri- angle soup) input data, by directing the neural network to vanish on the input data, and ensuring unit norm gradi- ent. Previously, IGR has achieved state-of-the-art quantita- tive results and high fidelity reconstruction [36]. The pro- posed approach is compared against IGR using the D-Faust dataset. D. SHAPE SPACE LEARNING In this experiment, randomly selected 70% of 10 D-Faust human subjects are used to train the proposed and the baseline architectures with 500 epochs. The remaining data, 30% of the samples, are used to test the trained models. Additionally, randomly drawn 1282 points from pre-computed 500k sample points are used to train and test the proposed architecture. For baseline architectures, the number of randomly drawn points is as follows: SAL — 1282, SALD — 922, and IGR — 1282 as given in the respective original implementations. The final shape reconstruction has been generated with a resolution of 1003 for all architectures. The quantitative and qualitative results of this experiment are shown in Table 1, and Fig. 4, respectively. From the quan- titative results, it can be clearly seen that the proposed archi- tecture achieves superior results compared to the baseline methods, with one exception: SAL outperforms the proposed approach in 95% percentile using the completeness (Sc→Gn) measure. Moreover, the proposed architecture generates sur- face reconstruction with superior accuracy in small details compared to the baselines, which can be seen in the IOU results and in Fig. 1. 46406 VOLUME 11, 2023 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 7. Visual quality comparison of a single test sample from each ShapeNet [52] test class. ↓: lower value is better, ↑: higher value is better. E. GENERALIZATION TO UNSEEN HUMANS In this experiment on the D-Faust dataset, we test our the pro- posed architecture’s generalization capability on previously unknown subjects. The training samples have been drawn from randomly selected 8 human subjects (4 females and 4 males subjects), whereas the remaining 2 human subjects (1 male and 1 female subject) were used to test the architec- tures. The same number of randomly drawn data points (SAL: 1282, SALD: 922, IGR:1282, andDC-DFFN:1282) were used to train and test the models. The evaluation results in Table 2 show that the proposed architecture has outperformed the baseline architectures in most cases in each metric. As exceptions, Table 2 shows that IGR outperforms the proposed architecture in two Cham- fer distance cases (5%, and 50% percentiles, Rg→Gn and Sc→Gn, respectively). However, in the case of mean Cham- fer distance and IOU, DC-DFFN outperforms IGR by a large margin. Additionally, the proposed architecture can preserves structural detail better than the baseline architec- tures. Compared to the shape space learning experiment (Section IV-D), all architectures however provide overall worse results, as expected. F. GENERALIZATION TO UNSEEN POSES In this experiment on the D-Faust dataset, two poses have randomly been selected from 10 humans subjects for testing, and the rest of the data is used to train the models. Similar to previous settings, the number of data points used are: SAL— 1282, SALD — 922, IGR — 1282, and DC-DFFN — 1282, drawn from the pre-computed sample data points during the training and testing the models. We results are shown in Table 3, and Fig. 6: on aver- age, DC-DFFN achieves better or comparable results in all metrics. As an exception, SAL outperforms the proposed architecture in terms of Chamfer distance (95% percentile) and completeness (Rg→Gn and Sc→Gn) scores. However, in terms of volumetric IOU DC-DFFN outperforms all other approaches without exceptions. G. GENERALIZATION TO OBJECT SHAPES Beyond learning human shapes, also experiments on learn- ing object shapes were performed to evaluate the proposed architecture using the ShapeNet dataset comprises of non- manifold/non-oriented meshes that depict various objects. VOLUME 11, 2023 46407 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation TABLE 4. ShapeNet quantitative results. The Chamfer distances are presented in percentiles (5th, 50th, and 95th) and mean scores, Chamfer distances multiplied by 103. Moreover, ↓ means lower is better, whereas ↑ means higher is better. TABLE 5. Comparison against inferior DC-DFFN variants on unseen human reconstruction. The Chamfer distances are presented in percentiles (5th, 50th, and 95th) and mean scores, where the Chamfer distance numbers are multiplied by 103. All variants were with 500 epochs. ↓: lower value is better; ↑: higher value is better. Compared to human shapes, objects exhibit sharp corners, holes and thin structure. For baseline architectures, we have resorted to the train and test settings as described in the original works, except for data split files where we used 75% of the samples for training and 25% samples for testing. The network architectures were trained with 1500 epochs on each class (Car, Guitar, Airplanes, and Sofa). The IGRmethod was not included in this evaluation, as IGR expects consistently oriented normals, which were unavailable for the ShapeNet dataset. The results are shown in Table 4 and Fig. 7, respec- tively. DC-DFFN significantly outperforms the state-of- the-art SALD architecture, except for the Guitar class, where measured Chamfer distances favor SALD in 5% and 95% percentile cases. The qualitative results of Fig. 7 show that DC-DFFN can capture thin structure (airplane wings) and sharp corners (sofa armrests) much better than SALD. V. DISCUSSION In this study, we proposed the feature fusion-based variational auto-encoder network DC-DFFN. The novel characteristics in the architecture design improve training speed (see Fig. 3), improve performance at inference time (see Tables 1, 2, 3, and 4), and provide better generalisation in the 3D shape space compared to reviewed baseline approaches. In the test data split of D-Faust, we decided to remove the samples (See Section IV-A) that include scanning artifacts (see also Appendix B and Fig. 9) for improving the inter- pretability of the results. However, this decision can also be questioned, as keeping the scanning artifacts could on the other hand evaluate the architectures’ capability of rejecting outliers. However, Table 6, and Table 7 in Appendix B show the D-Faust results with artifacts included, and confirms that the proposed DC-DFFN architecture outperforms in gen- eral the baseline architectures in the presence of artifacts as 46408 VOLUME 11, 2023 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 8. The qualitative results are shown for other alternative variant of the feature fusion networks and the proposed DC-DFFN on the experimental setup illustrated in section IV-E of main paper for D-Faust dataset. Additionally, we reported the computed Chamfer distance for each model, where ↓ means lower is better. TABLE 6. Results on D-Faust shape space learning including samples with artifacts. The Chamfer distances are presented in percentiles (5th, 50th, and 95th) and mean scores, Chamfer distances multiplied by 103. ↓: lower value is better; ↑ higher value is better. TABLE 7. Generalization to D-Faust unseen humans including samples with artifacts. The Chamfer distances are presented in percentiles (5th, 50th, and 95th) and mean scores, Chamfer distances multiplied by 103. ↓: lower value is better; ↑ higher value is better. well. In individual cases (See Fig. 9) the baseline SALD approach is sometimes better in removing outliers. In all experimental setups for both D-Faust and ShapeNet datasets, the proposed DC-DFFN significantly outperformed the baseline approaches in almost all the cases. In a few cases the baseline approaches provided better Cham- fer distance results with a small margin. Considering volumetric IOU, in contrast, DC-DFFN outperformed the VOLUME 11, 2023 46409 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 9. Visual reconstruction result comparison for a sample with scanning artifacts. ↓: lower value is better. FIGURE 10. Training Loss curve of SALD and DC-DFFN models on ShapeNet dataset (class: lamp). In this dataset, we compared our approach only with SALD. From the learning curve, it can be seen that DC-DFFN is learning faster than the baseline SALD model, however, both architectures are not yet reached to the saddle point at 1500 epochs. Based on the complexities of the data, the required number of training epochs will vary for all models. However, DC-DFFN converges faster than all other baseline approaches presented in this study. baseline approaches in all experimental setups in both datasets. Although DC-DFFN performs better than the baseline approaches, however, the proposed architecture still suffers in reconstructing the thin structures to some extent. VI. CONCLUSION In this paper we have proposed the densely connected deep feature fusion network architecture DC-DFFN for neural implicit shape learning and reconstruction from raw input data. In the experimental section the proposed work is shown to learn faster, generalize better and outperform the baseline works quantitatively by a clear margin in all experiments. Additionally, the visual results show that the proposed archi- tecture can especially capture small detail better than the previousworks. As the broader impact of ourworkwe see that in the future DC-DFFN has potential to serve as the prevailing neural architecture for upcoming studies on shape learning from raw 3D data. APPENDIX A ARCHITECTURE ALTERNATIVES In the process of designing DC-DFFN, two alternative implementation variants of the proposed architecture were developed, out of which the proposed architecture was iden- tified as the best one. The other alternative variants were: (I) Densely connected feature fusion network with multilay- ered latent codes, DC-DFFN-MLLCs, and (II) Dense layer feature fusion with dense neural network, DFF-DFFN-Lin. Results of these architecture variants are shown in one exper- imental setting in Table 5. A. DC-DFFN-MLLCs In the DC-DFFN-MLLCs architecture variant, one latent code is extracted after every Conv1D-MaxPool-DeepSet- Relu block, and finally, concatenated in the channel dimension. In this case, the final latent code shape is (B, 1024, N), where N=1282 and B is the batch size. The assumption was that multiple multi-layered latent codes would contain more information than a single latent code and provide better reconstruction quality. However, in prac- tice this architecture variant was performing worse than DC-DFFN. 1) DC-DFFN-LIN In this variant, the 1D convolutional layers (Kernel: 1× 1) of the encoder and decoder were replaced by fully connected layers, keeping the rest of the network settings similar to DC-DFFN-MLLCs. Eventually, DC-DFFN-Lin performed comparatively worse than the multiple latent codes-based architecture, DC-DFFN-MLLCs, constructed with of 1D convolutional layers. Even clearer, DC-DFFN-MLLCs per- formed significantly worse than the proposed DC-DFFN. APPENDIX B RECONSTRUCTION OF SAMPLES WITH ARTIFACTS In the main paper experiments, the D-Faust samples that con- tain scanning artifacts in the test samples were removed from the experiments shown in Section IV-D and Section IV-E. Here, in Table 6, Table 6 and Fig. 9 we present as additional quantitative results the D-Faust results without any removed 46410 VOLUME 11, 2023 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation FIGURE 11. Additional qualitative results are shown for ShapeNet dataste (Class: lamp). We reported the computed Chamfer distance for both models, where ↓ means lower is better. samples, making the test sets identical to the ones used in [1] and [8]. APPENDIX C ADDITIONAL RESULTS Additional qualitative results are shown in Fig. 11 from ShapeNet dataset (class: Lamp). Moreover, the training loss curve on lamp class is shown in Fig. 10 for SALD, and the proposed DC-DFFN models. It can be seen from the learning curve that the proposed architecture learns faster than the baseline SALD architecture. REFERENCES [1] M. Atzmon and Y. Lipman, ‘‘SALD: Sign agnostic learning with derivatives,’’ in Proc. 9th Int. Conf. Learn. Represent. (ICLR). Austria: OpenReview.net, May 2021. [Online]. Available: https://openreview.net/ forum?id=7EDgLu9reQD [2] J. Chibane and G. Pons-Moll, ‘‘Implicit feature networks for texture com- pletion from partial 3D data,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 717–725. [3] J. Chibane, M. A. Mir, and G. Pons-Moll, ‘‘Neural unsigned distance fields for implicit function learning,’’ in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, Jan. 2020, pp. 21638–21652. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/f69e505b08403ad2298b9f 262659929a-Paper.pdf [4] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, ‘‘Occu- pancy flow: 4D reconstruction by learning particle dynamics,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5379–5389. [5] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, ‘‘Occupancy networks: Learning 3D reconstruction in function space,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4455–4465. [6] V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, ‘‘Implicit neural representations with periodic activation functions,’’ 2020, arXiv:2006.09661. [7] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, ‘‘NeRF: Representing scenes as neural radiance fields for view synthesis,’’ 2020, arXiv:2003.08934. [8] M. Atzmon and Y. Lipman, ‘‘SAL: Sign agnostic learning of shapes from raw data,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 2565–2574. [9] S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger, ‘‘Con- volutional occupancy networks,’’ in Proc. 16th Eur. Conf. Comput. Vis.— ECCV. Glasgow, U.K.: Springer, Aug. 2020, pp. 523–540. [10] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, ‘‘DeepSDF: Learning continuous signed distance functions for shape rep- resentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 165–174. [11] Y. Li and J. Barbič, ‘‘Immersion of self-intersecting solids and surfaces,’’ ACM Trans. Graph., vol. 37, no. 4, pp. 1–14, Aug. 2018. [12] A. Basher, M. Sarmad, and J. Boutellier, ‘‘LightSAL: Lightweight sign agnostic learning for implicit surface representation,’’ 2021, arXiv:2103.14273. [13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. [14] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778. [15] R. K. Srivastava, K. Greff, and J. Schmidhuber, ‘‘Training very deep networks,’’ in Proc. 28th Int. Conf. Neural Inf. Process. Syst. (NIPS), Montreal, QC, Canada, vol. 2. Cambridge, MA, USA: MIT Press, 2015, pp. 2377–2385. [16] M. Kazhdan and H. Hoppe, ‘‘Screened Poisson surface reconstruction,’’ ACM Trans. Graph., vol. 32, no. 3, pp. 1–13, Jun. 2013. [17] D. Levin, ‘‘Mesh-independent surface interpolation,’’ in Geometric Mod- eling for Scientific Visualization. Berlin, Germany: Springer, 2004, pp. 37–49. [18] F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin, ‘‘The ball-pivoting algorithm for surface reconstruction,’’ IEEE Trans. Vis. Comput. Graphics, vol. 5, no. 4, pp. 349–359, Oct. 1999. [19] J. C. Carr, R. K. Beatson, J. B. Cherrie, T. J. Mitchell, W. R. Fright, B. C. McCallum, and T. R. Evans, ‘‘Reconstruction and representation of 3D objects with radial basis functions,’’ in Proc. 28th Annu. Conf. Comput. Graph. Interact. Techn., Aug. 2001, pp. 67–76. [20] M. Kazhdan, M. Bolitho, and H. Hoppe, ‘‘Poisson surface reconstruction,’’ Jun. 2006, pp. 61–70, doi: 10.2312/SGP/SGP06/061-070. [21] W. Zhao, J. Lei, Y. Wen, J. Zhang, and K. Jia, ‘‘Sign-agnostic implicit learning of surface self-similarities for shape modeling and reconstruction from raw point clouds,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 10256–10265. [22] A. Gilbert, M. Volino, J. Collomosse, and A. Hilton, ‘‘Volumetric per- formance capture from minimal camera viewpoints,’’ in Proc. Eur. Conf. Comput. Vis., 2018, pp. 566–581. VOLUME 11, 2023 46411 http://dx.doi.org/10.2312/SGP/SGP06/061-070 A. Basher, J. Boutellier: DC-DFFN With Sign Agnostic Learning for Implicit Shape Representation [23] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid, ‘‘BodyNet: Volumetric inference of 3D human body shapes,’’ in Proc. Eur. Conf. Comput. Vis., 2018, pp. 20–36. [24] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu, ‘‘DeepHuman: 3D human reconstruction from a single image,’’ in Proc. IEEE/CVF Int. Conf. Com- put. Vis. (ICCV), Oct. 2019, pp. 7739–7749. [25] W. E. Lorensen and H. E. Cline, ‘‘Marching cubes: A high resolution 3D surface construction algorithm,’’ ACM SIGGRAPH Comput. Graph., vol. 21, no. 4, pp. 163–169, Aug. 1987. [26] Y. Liao, S. Donné, andA.Geiger, ‘‘Deepmarching cubes: Learning explicit surface representations,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2916–2925. [27] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, ‘‘3D ShapeNets: A deep representation for volumetric shapes,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1912–1920. [28] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, ‘‘Multi-view supervi- sion for single-view reconstruction via differentiable ray consistency,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2626–2634. [29] G. Gkioxari, J. Johnson, and J.Malik, ‘‘MeshR-CNN,’’ inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9785–9795. [30] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, ‘‘Atlas- Net: A papier-Mâché approach to learning 3D surface generation,’’ 2018, arXiv:1802.05384. [31] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, ‘‘Geometric deep learning: Going beyond Euclidean data,’’ IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18–42, Jul. 2017. [32] P. Wang, Y. Gan, P. Shui, F. Yu, Y. Zhang, S. Chen, and Z. Sun, ‘‘3D shape segmentation via shape fully convolutional networks,’’ Comput. Graph., vol. 70, pp. 128–139, Feb. 2018. [33] K. Guo, D. Zou, and X. Chen, ‘‘3D mesh labeling via deep convolutional neural networks,’’ ACM Trans. Graph., vol. 35, no. 1, pp. 1–12, Dec. 2015. [34] R. Venkatesh, S. Sharma, A. Ghosh, L. Jeni, and M. Singh, ‘‘DUDE: Deep unsigned distance embeddings for hi-fidelity representation of complex 3D surfaces,’’ 2020, arXiv:2011.02570. [35] M. Atzmon, N. Haim, L. Yariv, O. Israelov, H. Maron, and Y. Lipman, ‘‘Controlling neural level sets,’’ in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 2032–2041. [36] A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman, ‘‘Implicit geometric regularization for learning shapes,’’ in Proc. Mach. Learn. Syst., 2020, pp. 3569–3579. [37] A. Pandey and D. Wang, ‘‘Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 6629–6633. [38] H. Zhang and V. M. Patel, ‘‘Densely connected pyramid dehazing network,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3194–3203. [39] Y. Li, G. Han, and X. Liu, ‘‘DCNet: Densely connected deep convolutional encoder–decoder network for nasopharyngeal carcinoma segmentation,’’ Sensors, vol. 21, no. 23, p. 7877, Nov. 2021. [40] H. Zhou, Z. Fang, Y. Gao, B. Huang, C. Zhong, and R. Shang, ‘‘Feature fusion network based on attention mechanism for 3D semantic segmen- tation of point clouds,’’ Pattern Recognit. Lett., vol. 133, pp. 327–333, May 2020. [41] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, ‘‘PointNet++: Deep hierarchical feature learning on point sets in a metric space,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5099–5108. [42] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, ‘‘Point trans- former,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 16259–16268. [43] X. Wu, Y. Lao, L. Jiang, X. Liu, and H. Zhao, ‘‘Point trans- former v2: Grouped vector attention and partition-based pooling,’’ 2022, arXiv:2210.05666. [44] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional net- works for biomedical image segmentation,’’ in Proc. 18th Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI). Munich, Germany: Springer, Oct. 2015, pp. 234–241. [45] R. Q. Charles, H. Su,M. Kaichun, and L. J. Guibas, ‘‘PointNet: Deep learn- ing on point sets for 3D classification and segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 652–660. [46] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola, ‘‘Deep sets,’’ 2017, arXiv:1703.06114. [47] V. Nair, and G. E. Hinton, ‘‘Rectified linear units improve restricted Boltzmann machines,’’ Jun. 2010, pp. 807–814. [Online]. Available: https://icml.cc/Conferences/2010/papers/432.pdf [48] A. Fabri and S. Pion, ‘‘CGAL: The computational geometry algorithms library,’’ in Proc. 17th ACM SIGSPATIAL Int. Conf. Adv. Geograph. Inf. Syst., 2009, pp. 538–539. [49] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’ 2014, arXiv:1412.6980. [50] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, ‘‘Auto- matic differentiation in machine learning: A survey,’’ J. Mach. Learn. Res., vol. 18, pp. 1–43, Apr. 2018. [51] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black, ‘‘Dynamic FAUST: Registering human bodies in motion,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5573–5582. [52] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, ‘‘ShapeNet: An information-rich 3D model repository,’’ 2015, arXiv:1512.03012. ABOL BASHER received the B.Sc. degree in electrical and electronics engineering from the Mymensingh Engineering College, University of Dhaka, Bangladesh, in 2015, and the M.E. degree in computer engineering from Chosun Univer- sity, Gwangju, South Korea, in 2020. He is cur- rently pursuing the Ph.D. degree in computer science with the University of Vaasa, Finland. He is a Project Researcher with the Digital Econ- omy Research Platform, University of Vaasa. His research interests include 3D data representation, computer vision, machine learning, and medical image processing. JANI BOUTELLIER received the Ph.D. degree, in 2009. He is currently an Associate Profes- sor with the University of Vaasa, Finland. He is leading projects that concentrate on efficient neu- ral networks and 3D computer vision. He has coauthored more than 70 peer-reviewed articles. His research interests include parallel computing, model-based design, and signal processing. He is a member of the IEEE Signal Processing Society ASPS Technical Committee. He is an Associate Editor of the Journal of Signal Processing Systems (Springer). 46412 VOLUME 11, 2023