Don Joseph Rajindu Goonewardena 

Leveraging RGB‑D Data and SAM Segmentation for 
Object Segmentation in Industrial Bin Picking 

 
Master’s thesis 

 
Vaasa 2025 

School of Technology and Innovations 
    Sustainable and Autonomous 

 Systems 
Master of Science in Technology 


2 

 
UNIVERSITY OF VAASA 
School of Technology and Innovations 

Author: Don Joseph Rajindu Goonewardena 
Title of the thesis:  Leveraging RGB‑D Data and SAM Segmentation for Object Segmen-

tation in Industrial Bin Picking. 
Degree: Master of Sciences 
Degree Programme: Computing Science 
Supervisor: Jani Boutellier  
Year: 2025 Pages: 79 

 
ABSTRACT: 
 
Robotic bin picking requires a robotic arm to identify and extract objects from a container for 
placement on a production line. While this task can be broken down to two main aspects, the 
object selection and path planning and picking of the object. This thesis will focus on the latter. 
The task would include the selection of an object that can reliably be identified and picked in the 
least amount of time. The challenges faced here will include cluttered bins, occlusions and an 
ever-changing selection of objects that need to be identified by the segmentation model with 
the least amount of training time to identify novel objects. Traditional segmentation models 
struggle in these environments especially when they have not been trained on the specific ob-
jects being picked. This thesis investigates how the Segment Anything Model (SAM), a modern 
zero-shot segmentation model can be adapted in the simplest and effective way possible to in-
dustrial bin picking. The goal is to refine SAM’s mask selection process so that a suitable candi-
date object can be selected from a bin with the least amount of training or tweaking of the 
segmentation pipeline. A modular scoring pipeline was designed and integrated into SAM. The 
experiments to test out the pipeline were conducted on both a synthetic dataset and real RGB-
D images captured by a depth camera. Initial tests showed that a simple least depth-based mask 
performed well but suffered in the selection of masks that covered the whole object and in se-
lecting objects that were away from the border of the image. By including all the aspects of the 
segmentation pipeline the quality of the object picked increased dramatically by avoiding issues 
such as partial masks, selection of objects too close to the bin walls and selection of the bin wall 
as an object. The findings highlight both the strengths and weaknesses of SAM in industrial con-
text. More importantly, it shows that a carefully designed scoring system can negate those weak-
nesses and be used as an adaptable and effective segmentation solution for robotic bin picking 
offering a practical solution without retraining or complex hardware. 
 
 
KEYWORDS: RGB-D segmentation; Segment Anything Model (SAM); bin picking; depth sens-
ing; industrial robotics; mask scoring. 

 
3 

 
Contents 

 
1 Introduction 9 

2 Background and Literature Review 12 

2.1 Evolution of Automation in Manufacturing 12 

2.2 Role of Robotics in Modern Industry 12 

2.3 Bin Picking in Industrial Contexts 13 

2.3.1 Definition and Scope of Bin Picking 13 

2.4 Depth Data in Machine Vision 14 

2.4.1 Types of Depth Sensors 14 

2.4.2 Advantages of Depth Information in Object Recognition 15 

2.5 Segmentation in Industrial Robotics 15 

2.5.1 Principles of Semantic Segmentation 15 

2.5.2 Segmentation in Cluttered Scenes 17 

2.6 Segment Anything Model (SAM) 18 

2.6.1 Overview of SAM Architecture 18 

2.6.2 Integration of SAM with Depth Information 19 

2.6.3 Comparative Analysis of Segmentation Approaches 20 

2.6.4 Depth Data Utilization in Bin Picking 20 

2.6.5 Fusion of RGB and Depth Data 21 

2.6.6 Object Detection and Pose Estimation 22 

2.6.7 Handling Occlusions in Cluttered Bins 23 

2.6.8 Integration with Depth-based Bin Picking 24 

2.6.9 Workflow for Object Segmentation and Grasp Planning 26 

2.7 Classic computer vision for segmentation and proposal 27 

2.7.1 Bottom-Up Segmentation Approaches 27 

2.7.2 Region Proposal Generation: Selective Search 27 

2.7.3 Sliding Window Detection 28 

2.7.4 Voxel-Space Processing 28 

2.8 Supervised Learning Methods 29 


4 

 
2.8.1 Overview 29 

2.8.2 Model Training and Validation 31 

2.9 Reinforcement Learning 31 

3 Methodology 33 

3.1 System Setup 33 

3.2 Image acquisition and loading of images 35 

3.2.1 RealSense camera image acquisition and dataset creation. 35 

3.2.2 Siléane Loader 36 

3.3 Segmentation Pipeline 37 

3.3.1 Evaluation Protocol 42 

4 Experiments 44 

4.1 Siléane Dataset Tests 44 

4.1.1 Results with standalone SAM. 44 

4.1.2 Results with SAM plus the custom scoring function and top most mask. 45 

4.2 Ablation Study 49 

4.2.1 Results of the performance considering only the depth results. 49 

4.2.2 Results of the performance considering the depth and depth variance 

results 51 

4.2.3 Results of the performance considering the depth, depth variance and area 

results. 52 

4.2.4 Results of the performance considering the depth, depth variance, area 

and shape results. 54 

4.2.5 Results of the performance considering the depth, depth variance, area, 

shape and border closeness results. 54 

4.2.6 Final scoring system and an Intuitive look in terms of bin picking. 57 

4.3 Real-World Demo 60 

5 Discussion 63 

5.1 Summary of findings. 63 

5.2 Interpretation of the Results 64 


5 

 
6 Conclusion 67 

7 Future work 68 

Depth completion and sensor robustness 68 

Adaptive weighing of scoring terms 68 

Object separation factor 68 

Efficiency and real time deployment 68 

References 70 

 
6 

 
Figures 
 
 
Figure 1 Typical industrial bin picking scenario (Imagen 2., 2025) 9 

Figure 2. Real dataset capture setup including a depth and RGB intel RealSense D435i 

camera and a laptop with an intel i5-7300HQ at 2.5 GHz, 16 GB of RAM and a Nvidia 

1050Ti GPU. 34 

Figure 3 RealSense pipeline setup for RGB and depth streams which are combined using 

the object rs.align . 35 

Figure 4 Loading of the Siléane dataset with the created dictionary. 36 

Figure 5 Depth variance normalization. 38 

Figure 6 Penalty score against the deviation from an ideal range between 5 and 2%. 39 

Figure 7 Shape/compactness score using the Polsby-Popper compactness score. 40 

Figure 8 Border penalty calculation. 41 

Figure 9 Structure of the scoring equation. 42 

Figure 10 Top mask vs the best mask where the green brick would be the ground truth 

mask, and the bright red highlight would be the generated mask that was selected after 

SAMs mask generation. 47 

Figure 11 Filtering of the generated masks from the use of SAMs confidence and stability 

scores. 48 

Figure 12 Ablation configurations for the Siléane bricks dataset. 49 

Figure 13 Depth only best mask vs Topmost mask. 50 

Figure 14 Depth + variance best masks selection ranked for one image. 52 

Figure 15 Depth + variance + area best masks selection ranked for one image. 53 

Figure 16 Segmentation without (left) and with (right) border penalty, giving the mask 

which is away from the border for easier grasping from a robotic arm. 55 

Figure 17 Final weights for the scoring pipeline 57 

Figure 18 Brick 32/36 best vs topmost mask selection comparison. 57 

Figure 19 Brick 40 and 42 best mask segmentation. 58 


7 

 
Figure 20 Brick 43 and 52 best vs topmost mask showing that the best mask consistently 

tries to strike a balance between lowest depth and masks away from the border while 

the top most mask tend to select any mask with the least depth. 59 

Figure 21 Best and topmost mask overlays, the object masks in blue would be the 

selected segmentation mask by the scoring system and masks in green are the top masks.

 60 

Figure 22 Improper segmentation for the top mask (in green) 61 

Figure 23 Missing depth value filling solution using the Navier stokes algorithm provided 

by the Open CV library. 62 

Figure 24 Gray scale depth images of image 18 and 19 of the real-world dataset collected 

using the RealSense RGB-D camera, where darker colors mean closer it is to the camera 

except the pure black areas meaning 0 mm depth or no valid depth reading. 62 

 
8 

 
Tables 
 
Table 1 SAM segmentation across the Siléane bricks dataset 44 

Table 2 Segmentation across image brick_0_008 using SAM for all objects in the image.

 45 

Table 3 Initial scoring pipeline results with SamPredictor across 30 images of the Siléane 

bricksdataset. 46 

Table 4 Subset of the topmost mask’s performances for 6 separate images in the dataset.

 46 

Table 5 Performance of the masks after the use of SAMs confidence scores and NMS on 

the 30 Siléane bricks dataset images. 48 

Table 6 Best mask performance with depth alone. 49 

Table 7 Best mask performance with depth and depth variance. 51 

Table 8 Best mask performance with depth, variance and area. 52 

Table 9 Best mask performance with depth, variance, area and shape. 54 

Table 10 Performance with the full best mask pipeline 54 

Table 11 Segmentation performance across all ablation settings in the best mask scoring 

pipeline and topmost masks 56 

 
9 

 
1 Introduction 

 
Figure 1 Typical industrial bin picking scenario (Imagen 2., 2025) 

 
In industrial bin picking one rarely gets an organized bin to pick from. The parts overlap, 

reflect light and can be tricky for texture-based segmentation methods alone to identify 

the most suitable object to pick and it also can have changes in the dimensions and shape 

of the objects that needs to be picked from time to time with minimum time to retrain 

on the new objects put on to the production line.  The Segment Anything Model (SAM) 

is almost the ideal model to be used for the latter challenge as it produces masks on 

novel objects without the need for retraining using prompts such as points, boxes or 


10 

 
masks (Kirillov et al., 2023) or even without any prompts. Using these prompts combin-

ing the image encoder, prompt encoder and mask decoder SAM produces accurate 

masks with low latency even when the prompts given are not ideal (Zhao et al., 2023). 

Even with the advantages it has as a zero-shot segmentation model, it still can struggle 

with accurate segmentation of occluded objects, metallic or glossy parts and textureless 

objects. In selection of the best option to pick from the bin various aspects of the object 

that need to be picked need to be considered. While depth data can be brought in to 

virtually give SAM a different perspective and almost creating a 3D perspective of the 

space to uncover some occluded objects (Yu et al., 2024). It might not be the most reli-

able as depth cannot determine objects, as smaller masks can be generated for objects 

in a bin but only represent a portion of the surface. Depth in this instance cannot miti-

gate this issue. While the fact that boundaries in depth often align with physical object 

edges and helps in distinguishing objects that look similar in the RGB domain is true, 

depth alone cannot determine if the mask generated partially due to poor lighting or 

object surface fully segments the object. While depth is an important part of the seg-

mentation in bin picking shown by numerous studies there is a lack of studies where the 

evaluation framework combines geometry of the object, depth with zero shot segmen-

tation without the need for retraining but by only calibrating the geometric and depth 

thresholds. This combination works well in not only bin picking datasets but also now it 

is much more accessible to get prebuilt sensors that combine the depth and RGB data 

without the need for complex calibration methods and multiple cameras and hardware. 

The contribution of this thesis would be based on a scoring-based mask selection frame-

work that integrates multiple thresholds into a unified evaluation pipeline. It includes a 

depth aware scoring, where higher depth values and depth variance are penalized and 

preferring masks within operational ranges, area normalization where the mask size pen-

alties have been tied to the ground truth masks obtained, to avoid being too large or 

small of a mask. Shape compactness calculated through the Polsby-Popper compactness 

score was used to ensure irregular masks were also penalized and border penalties to 

avoid objects that are too close to the image border in being picked up in the initial runs 


11 

 
of the bin picking. This study also includes detailed ablation experiments, CSV logging 

and overlay visualization for the transparent visualization of the methods developed.  

The scope of the thesis consists of designing and implementing a multi-term scoring 

function for evaluations of masks in  RGB-D datasets obtained from zero shot segmenta-

tion pipelines such as SAM in further developing on the accuracy of the segmentation 

for bin picking using depth data and keeping the pipeline and hardware as accessible as 

possible, conducting of ablation studies to quantify the impact of the scoring system for 

each component in the scoring system, benchmarking against the ground truth segmen-

tation on the Siléane bricks dataset with metrics such as IoU, precision, recall and F1. 

Qualitative analysis of real-world data obtained through our own commercially available 

and accessible hardware was also included. Finally visual overlays and CSV summaries of 

studies conducted have been attached. While the scope here is limited to static dataset 

evaluation it lays the foundation for development of robust models to be used in real 

time segmentation in robotic bin picking pipelines. To summarize this thesis proposes an 

adjustable scoring-based selection process to improve the mask selection process in in-

dustrial bin picking for SAM based segmentation without the need for complex hardware 

setups and retraining. 

To fully understand the challenges faced and addressed in this thesis it is important to 

grasp the whole process of industrial bin picking which includes perception, which is the 

area this thesis is based on, planning and manipulation which is based on the robotic 

systems used to carry on the picking process. Therefore, the following literature review 

will give an overview of industrial bin picking as a whole and then narrow down to the 

perception and segmentation methods most relevant to this study.  

 
12 

 
2 Background and Literature Review  

 
2.1 Evolution of Automation in Manufacturing 

 
Industrial automation has shifted from rigid, specialized setups to flexible, software-

driven systems. Tools like ROS-Industrial make it easier to connect hardware from differ-

ent vendors(Janis Arents et al., 2018), allowing robot models and controllers to be 

swapped with minimal effort. This flexibility, combined with GPU-powered deep learning, 

has enabled real-time perception to be part of the control loop (Janis Arents et al., 2018; 

Ribeiro et al., 2021). 

Today, multi-modal sensing is common, and RGB-D pipelines with learned segmentation 

are standard for tasks like random bin-picking (Franaszek et al., 2024). But more sensors 

don’t always help—larger setups can be harder to deploy (Xie et al., 2023a). New sensing 

methods, like vision-based tactile devices, can support RGB-D when visual cues aren’t 

enough to judge grasp stability. At the same time, manipulation strategies are advancing. 

Robots no longer rely only on open-loop picks; reinforcement learning and hybrid meth-

ods can plan actions like pre-grasp pushes to improve success (Zeng et al., 2018). Overall, 

perception and control strategies have evolved together, making SAM’s flexible, depth-

aware segmentation a strong fit for these systems.  

 
2.2 Role of Robotics in Modern Industry 

 
In modern industrial robotic manipulation, robots need to overcome many variables 

such as unknown object types, deformable materials and unpredictable poses, therefore 

systems will fare much better if they infer from RGB-D data collected rather than relying 

on fixed CAD models. This is important in production environments where the product 

changes rapidly and the models used can be outdated quickly (Franaszek et al., 2024; 


13 

 
Mahler et al., 2017; Zeng et al., 2018). In these instances, SAM’s zero-shot instance seg-

mentation by prompting is particularly attractive. 

Location of an object in a group of multiple objects depends on RGB-D data and geomet-

ric information, including in methods such as PPF where both RGB data, geometric and 

depth information were used to gain better results(Liu et al., 2021; Zhuang et al., 2021). 

Mixing of both RGB and depth data allows these industrial systems to tackle common 

issues such as shiny surfaces and surfaces with low texture. These setups not only help 

in bin picking but in many industrial processes involving computer vision (Chang et al., 

2023; Franaszek et al., 2024; Nishi et al., 2023; Zou et al., 2020).  

 
2.3  Bin Picking in Industrial Contexts 

 
2.3.1 Definition and Scope of Bin Picking 

 
Bin picking is the process of using automated robotics to recognize and localize objects 

even under occlusion, and to detect their pose despite variations in material and surface 

characteristics. Deep learning-based approaches have shown strong performance in 

these scenarios (Le & Lin, 2019). The issues are not only about visually detecting objects, 

but it is also about the surrounding environment and how much of space is available 

around the object (Schwarz et al., 2018). Therefore, the gripper designs, reachability and 

other aspects of the overall selection pipeline need to be thought of from the beginning.  

There are usually two main methods in object selection in bin picking, one would be 3D 

model matching and the other is segmentation and object selection through RGB-D data 

(Grard, n.d ;Cordeiro et al., 2023).Most pipelines start by capturing RGB-D data, mapping 

pixels to real-world coordinates, segmenting objects, and planning grasps that avoid col-

lisions and respect robot limits (Le & Lin, 2019) ; (Iriondo et al., 2021). But industrially 

the goal would be to minimize the time between the start of perception to the picking 

as much as possible. 

 
14 

 
2.4 Depth Data in Machine Vision 

 
2.4.1 Types of Depth Sensors 

 
There are various types of depth sensors. Time-of-Flight (ToF) sensors use pulsed signals 

or continuous wave modulation and measure the time taken for the waves to travel back 

and forth to determine distance(Zollhöfer, 2019); (Shao et al., 2023).Pulsed ToF sensors 

requires mechanical scanning such as moving mirrors to collect point by point data, while 

matrix-based ToF captures full depth image data for all points in one go, which is fast and 

compact, ideal for industrial bin picking. Continuous Wave ToF measures phase shifts in 

the sent modulated light and can perform reliably at short to medium ranges, even under 

changing factory lighting (O’sullivan & Le Dortz, 2021; Zollhöfer, 2019). Structured light 

sensors project known patterns and calculated depth based on how those patterns de-

form when hitting surfaces (Shaikh & Chai, 2021). Devices like Intel’s RealSense D-series 

can achieve millimeter-level accuracy at close range (Intel Corporation, 2019). However, 

structured light can struggle in multi-robot environments due to pattern interference, 

and its performance drops at longer distances because of weaker illumination(Zollhöfer, 

2019). LiDAR excels at scanning large areas but is less suited for tight bin-picking spaces. 

However, newer solid-state versions are starting to change this by offering compact and 

cost-effective solutions for industrial robotics (Shaikh & Chai, 2021). Passive stereo sys-

tems, such as ZED, work well on textured surfaces but can be thrown off by smooth, 

reflective materials. Combining stereo with active sensing or fusion techniques helps 

overcome these limitations (Keselman et al., 2017). Each sensor type has its own noise 

characteristics. Structured light may lose detail on fine features and suffer from crosstalk, 

while ToF can be affected by multi-path reflections and ambient light (Zollhöfer, 2019). 

Encouragingly, newer sensors offer better factory calibration and more user-friendly 

SDKs, which help reduce setup and integration time. Even budget-friendly Continuous 

Wave ToF sensors can be effective in close-range, controlled environments—though 

they’re less suitable for long-range tasks like forestry (McGlade et al., 2022). The 


15 

 
pipelines developed to be used with such sensors therefore need to take into consider-

ation the nuances in each sensor type. 

 
2.4.2 Advantages of Depth Information in Object Recognition 

 
Depth’s main advantage is that it captures geometry directly. Instead of depending on 

color or texture, it separates objects by their physical boundaries, which is vital when 

most of the detected items are similar(Schwarz et al., 2018). It is also less prone to error 

due to lighting changes (Le & Lin, 2019). Using depth reduces the risk of masks drifting 

onto other separated surfaces (Cordeiro et al., 2023). 

Handling occlusion is another area where depth data is useful. Even when only part of 

an object is occluded, depth information can help connect the dots by piecing them to-

gether into a recognizable shape by connecting the detected segmentations into a prob-

able depth related surface (Schwarz et al., 2018). When combined with SAM, these ge-

ometric guidelines can guide masks to follow actual edges more closely. While more flat 

or thin objects may not show much depth variation, but yet depth can be used to capture 

small tilts and bend on the object and to plan more reliable grasp approaches (Le & Lin, 

2019). Together, these properties make the segmentation pipeline less prone to error, 

less likely to be fooled by glossy finishes or uniform textures, and more likely to produce 

masks that align with the detected object. 

 
2.5  Segmentation in Industrial Robotics 

 
2.5.1 Principles of Semantic Segmentation 

 
Semantic segmentation gives every pixel in an image a label to identify as to what seg-

ment the specific pixel belongs to. The issue is how the algorithm is going to balance the 

sharp boundaries detected with segments that make sense in the real world. Traditional 

encoder–decoder CNNs compress and then rebuild features, but they often struggle 


16 

 
with thin shapes or tiny parts (Shvets et al., 2024). Vision transformers improved this by 

using attention to relationships between parts of an object which might not seem con-

nected but are connected, which is especially useful in cluttered industrial scenes where 

one part of an object can appear in one of the images and be cut off by another object 

and the rest of the object might appear at another end. 

SAM works by taking in prompts instead of fixed class labels. A strong encoder first cre-

ates general features from the image, while a prompt encoder turns user inputs (like 

points or boxes) into signals. A decoder then combines both the general features and the 

user prompt signals to produce the final mask (Kirillov et al., 2023). Because it relies on 

prompts, SAM can handle new object categories it hasn’t seen before (Kirillov et al., 

2023). Training on large, diverse datasets also helps it avoid overfitting, which is a com-

mon problem when models are trained only on a small set of industrial data (Kirillov et 

al., 2023) 

In multi-task setups, for example in joint semantic and instance segmentation, loss needs 

to be balanced to ensure that the segmentation does not bias towards either the seman-

tic side or the instance segmentation. Proper weighting will result in both segmentation 

methods balancing out meaningless outputs (Kirillov et al., 2023). Fusion of the data 

used also plays a major role in increasing the accuracy and practicality of segmentation. 

RGB and depth when combined at early or mid-layers result in counteracting weaknesses 

that might be unique to each method. Thus, producing masks that involves the appear-

ance and geometry of the objects (Cordeiro et al., 2023; Zhuang et al., 2021). By creating 

artificial data which is also augmented in unlikely ways makes the model more robust to 

unexpected scenarios in the real world by focusing more on the essential structures of 

objects during the training phase (Eghbal-zadeh et al., 2024; Hernández-García & König, 

2020; Rebuffi et al., 2021). Post-processing, although not as important as the model de-

sign, still matters, especially for edge cleanup, where there can be jagged or noisy bound-

aries and small objects where it can be merged into the background if not looked into. 

 
17 

 
2.5.2  Segmentation in Cluttered Scenes 

 
Clutter makes segmentation harder by adding occlusions, unclear edges, sensor noise, 

and objects that look alike. Clustering based on features of the pixels tends to fail in 

cluttered environments, especially in environments with identical objects. When objects 

are occluded the objects that are overlapping might be segmented in a way that they 

are merged due to the similar features of their pixels. Pipelines that are not heavily based 

on features, such as geometric cues and boundary aware models usually do better in 

these scenarios(Xu et al., 2022). While merging of EGB and depth data is helpful here, it 

is crucial to align the two data streams accurately to avoid the borders shifting on to 

other objects (Liu et al., 2021). 

Another way to reduce the occlusions or bring out more of the hidden surfaces in an 

occluded environment is to slightly move the pile so that a second scan of the scene has 

more information of the occluded areas resulting in better segmentation (Zeng et al., 

2018). Starting with segmentation before checking gripper constraints also helps and is 

generally recommended, since it avoids false positives from background geometry (Cor-

deiro et al., 2023; Zeng et al., 2018). Another obvious strategy is to have multiple cam-

eras from different angles so as to get more information of hidden areas of the scene 

(Zeng et al., 2018). 

In bin picking scenarios the way training data is labeled matters too. Marking graspable 

regions instead of whole objects directs the model towards useful pixels, reducing errors 

in cluttered scenes (Nishi et al., 2023). Speed is one of the most important aspects in 

deciding a model to be used in an industrial bin picking process and segmentation is 

often the slowest step and can eat into the  cycle time (Le & Lin, 2019). Efficient prompts, 

parallel GPU use, and depth-based filtering help keep processing fast, so that the robots 

spend more time in the manipulation process than computing 

 
18 

 
2.6 Segment Anything Model (SAM) 

 
2.6.1 Overview of SAM Architecture 

 
SAM’s design is clean and practical. A transformer-based encoder first creates a detailed 

feature map of the image. A prompt encoder then turns inputs like points, boxes, or 

masks into guidance, and a lightweight decoder combines both the prompt guidance and 

the feature map to generate the final mask in a matter of few milliseconds once the 

features are ready (Kirillov et al., 2023; Shvets et al., 2024). SAM is able to reduce com-

putationally heavy tasks such as the feature creation by reusing the features that were 

created for many prompts, which is especially helpful when a bin is full of similar objects. 

SAM also gives a confidence score for each mask, similar to IoU, which systems down 

the ladder can use to filter results (Kirillov et al., 2023). 

The training regime uses a mix of dice and focal losses, making sure not only that the 

predicted masks overlap well but also considering the hard to categorize pixels and in-

cludes simulated multi-round prompting, improving the model’s responsiveness to iter-

ative changes between perception cycles where the scene usually changes between 

each cycle (Kirillov et al., 2023). Its SA-1B pretraining dataset is massive with millions of 

images and over a billion masks, supporting cross-domain generalization (Kirillov et al., 

2023). While SAM trades specialization for variety, it can be paired with domain specific 

modules to specialize in certain tasks. Focused more on low latency there is also a faster 

version of SAM called FastSAM, which has a twostep process, detect all objects and use 

the prompts to filter the ones that are needed (Zhao et al., 2023). In practice, SAM per-

forms best when combined well with other models. For example, where SAM produces 

the general masks and the other models refine them further. 

 
19 

 
2.6.2 Integration of SAM with Depth Information 

 
Depth can be added to a SAM pipeline in several ways. At a basic level, depth-based 

prompts can be used to nudge SAM towards areas that correspond to object surfaces 

and not just the textures in the RGB image. Examples of these depth-based points can 

be some sparse points picked from a 3D point cluster in a depth map or dense masks, 

regions created by thresholding depth discontinuities (Danielczuk et al., 2019). An in-

depth approach would be to fuse RGB and depth inside the encoder. Either early fusion, 

where RGB and depth are combined and processed, where the model can learn very fine 

alignment between colour edges and depth edges thereby reinforcing the boundary or 

late fusion, where each is encoded separately and then combined. Where each type of 

data would have their own encoder that deals with noise specific to its modality, for 

example glare in RGB and gaps in the depth map (Cordeiro et al., 2023; Zhuang et al., 

2021). 

More advanced methods bring depth into the attention layers, so that similar objects 

can be decided by colour, texture and depth. This helps the model separate objects that 

look alike in color but are at different depths (Yi et al., 2019). Training with a mix of syn-

thetic and real datasets such as WISDOM (Danielczuk et al., 2019) plus real-world scenes 

further improves stability by training the model with almost unlimited amounts of vari-

ation through the synthetic datasets and on tackling the sensor imperfections of real-

world sensors (Danielczuk et al., 2019). Whether through prompts, fusion, or attention, 

the aim is the same. To make sure masks follow the true 3D shape of objects, not just 

their RGB appearance. 

 
20 

 
2.6.3 Comparative Analysis of Segmentation Approaches 

 
Mask R-CNN  (He et al., 2017) remains a classic model for instance segmentation baseline, 

particularly when large, labeled datasets are available for the target classes (Xu et al., 

2020). But pixel level annotation can be costly, and the model is not robust enough to 

segment objects it was not trained on. SAM, by contrast, segments on prompts without 

training for every novel object, making it more adaptable to industrial bin picking, espe-

cially with smart automated prompting. RGB-only prompts can blend overlapping parts 

in clutter, so aligning SAM with depth data whether through prompt generation or fused 

encoders improves object segmentation (Shvets et al., 2024; Zhuang et al., 2021). 

 Semantic PPF (Point Pair Features) pipelines are geometric methods that use pairs of 3D 

points to predict object pose. When combines with semantic labels these pipelines can 

estimate the pose of objects that are partially visible or occluded (Drost et al., 2010; Liu 

et al., 2018; Zhao et al., 2024). Similar to Mask R-CNN, this pipeline excels when the 

object shape is known but struggles when new objects where the geometry is unknown 

are introduced. Meanwhile, sim-to-real strategies can customize models to specific parts 

using synthetic pretraining plus self-supervised adaptation on real data (J. Chen et al., 

2025). In general, it can be stated that in simple clusters of similar objects class, specific 

models will perform better accuracy wise than general purpose models. 

 
2.6.4  Depth Data Utilization in Bin Picking 

 
The quality of the depth data obtained from the sensors initially starts from the place-

ment of the sensor. The camera needs to be positioned in a way that avoids steep angles 

that would normally distort structured light or weakens Time of Flight signals (RF Wire-

less World (no individual author listed), n.d.; Zanuttigh et al., 2016). Having multiple 

viewing angles by having sensors placed on opposite corners reduces levels of occlusion 

but also requires more complex synchronization and registration of the cameras or 


21 

 
sensors used. To ensure accurate synchronization of hardware triggers and accurate 

timestamps are some of the solutions (Liu et al., 2021; McGlade et al., 2022). 

Factories tend to be very dynamic environments and the possibility of the position of the 

cameras being moved if not secured properly is high. Therefore, mounts must be rigid 

and fixed to stable surfaces to avoid drift. If portable or handheld sensors are used, they 

will need frequent recalibration (McGlade et al., 2022; Zanuttigh et al., 2016). Calibration 

has two parts. Intrinsics which corrects for the lens distortions, focal length and repre-

sent the geometry of the external scene accurately without distortions. Extrinsic, which 

defines the position and the orientation of the sensors relative to each other, and the 

robots coordinate frame. (Khoshelham & Elberink, 2012; Z. Zhang, 2000). Proper calibra-

tion is important because even small extrinsic errors can shift targets by millimeters 

which can hinder the picking process. 

Finally, the optimal working ranges of the type of sensor used need to be taken into 

account. For example, structured light works best at close distances while ToF reaches 

farther but is not the most accurate at short distances. But in bin picking shot range 

accuracy is key. Making in this case structured light the better choice (Schellhase, 2023; 

Siltala & Latokartano, 2023; Thanh Ly et al., 2022). In practice, solid calibration usually 

matters more than clever algorithms. 

 
2.6.5 Fusion of RGB and Depth Data 

 
Normally the RGB pixel values and the depth maps come from different sensors or dif-

ferent parts of the same sensor, which usually have different values for the resolution 

and field of view. Therefore, to have good fusion between the two, they need to be nor-

malized and combined. Ensuring each depth value is matched up with the proper corre-

sponding pixel values (Donné & Geiger, n.d.; Wang et al., n.d.). Early fusion feeds raw 

RGB–depth edges directly into the feature extractor like in a CNN or transformer encoder, 

while mid or late fusion allows each data stream to be cleaned of its specific noise. For 

example, smoothing depth with bilateral filtering to keep sharp edges(Li et al., 2024; 

Tomasi & Manduchi, 1998; Zhou, 2024)  


22 

 
While RGB and depth sensors housed in a single unit do have their own intrinsics, their 

relative position to each other is fixed and factory calibrated, which makes the total cal-

ibration process easier (G. Chen et al., 2018; Villena-Martínez et al., 2017; C. Zhang et 

al., 2019). If there are multiple sensors used which are housed separately, like a RGB 

sensor, lidar sensor and a stereo camera. Extrinsics need to be calibrated in these setups, 

which is complicated. After calibration they also need to be recalibrated over time due 

to gradual shifting of the devices. Once aligned, fused data improves segmentation by 

combining shading from the RGB data and geometry from the depth, giving models such 

as SAM better prompts and cleaner masks. 

 
2.6.6 Object Detection and Pose Estimation 

 
Depth maps can be turned into point clouds to be used in pose estimation. After the 

point cloud has been segmented it is aligned with the CAD model of the object by using 

algorithms such as ICP (iterative closest point). ICP tries to adjust the CAD model position 

and orientation until it matches as close as possible to the observed point. This results 

in a 6 Degrees of Freedom (DoF) pose. The segmentation here needs to be accurate as 

well. If not the ICP model might try to align the CAD to a part that’s out of the object. 

Some objects can be symmetric or occluded, in such situations geometry alone cannot 

give the orientation of the object. In such situations adding semantic cues such as “top” 

through labels or context helps understand the alignment (Rusinkiewicz & Levoy, 2001; 

Wang et al., 2019; Xiang et al., 2018) 

Fusion again makes this easier. While adding another set of data such as RGB data can 

increase the search time in the point cloud but color bounding boxes can narrow the 

search area in the point cloud. While depth denoising and surface normal estimation 

give better features for registration. If CAD models aren’t available approximate geome-

try of the object or prior patterns the robot has learnt about grasping objects can be 

used to plan a stable grasp. For example, if the CAD model isn’t available the covariance 

matrix of the segmented point cloud would be calculated which gives the main direction 

of the spread of points of the object. This generally aligns with the objects’ natural shape 


23 

 
of the objects. Which can be used to infer a strategic approach of the gripper, especially 

parallel jaw grippers and suction grippers (Mahler et al., 2017; Miller & Allen, 2004; 

Schwarz et al., 2018). Training depth-only pose predictors on synthetic data also works 

well in controlled scenarios. 

 
2.6.7 Handling Occlusions in Cluttered Bins 

 
In industrial bin picking occlusions are unavoidable. To tackle this issue depth disconti-

nuities are used to find object boundaries from sudden jumps in distances, even if only 

part of the RGB domain of that object is visible (Siltala & Latokartano, 2023; Wang et al., 

2019; Yang et al., 2007). By creating a point cloud from the depth map and obtaining 

even fragments of the relevant object and grouping them using Euclidean distance and 

surface normal clusters that belong to the same object can be identified (Liu et al., 2021; 

Rabbani et al., 2006). Once clustered, these fragments can be matched to CAD models 

or reconstructed using the fragments using different points of view (Aldoma Buchaca et 

al., n.d.; Drost et al., 2010; Kazhdan et al., 2006). 

SAM can rely on prompts to guide its segmentation and if these prompts are generated 

from depth data it will be able to better differentiate the foreground and background 

while generating masks that stay true to its geometric shape (Donne & Geiger, 2019; 

Kirillov et al., 2023; Li et al., 2024; Wang et al., 2019). In dealing with flat or symmetric 

objects semantic PPF can help by aligning visible fragments obtained by SAMs masks 

onto a full pose using geometry and semantic labels (Aldoma Buchaca et al., n.d.; Drost 

et al., 2010; Zhuang et al., 2021). The two physical changes that can be used to tackle 

occlusions is to use a multi view set up to obtain better viewing angles which need accu-

rate extrinsics and as a last resort the pile can be shifted slightly to expose items further. 

 
24 

 
2.6.8 Integration with Depth-based Bin Picking 

 
Hardware Components 

 
When it comes to hardware to be used in a bin picking scenario, it needs to be selected 

to suit the required performance of the system. To obtain a view of objects here are 

some of the technologies that can be used. Modulated ToF (Time of Flight) such as a 

Kinect v2 works by emitting modulated infrared light and measuring the time it takes to 

return to the sensor. While it works in short to medium ranges and delivers decent frame 

rates which is suitable for bin picking it can struggle with reflective or transparent sur-

faces (Berger et al., 2013; Khoshelham & Elberink, 2012; Lachat et al., 2015). Passive 

stereo units like ZED can also be considered, where two RGB cameras are used to infer 

depth by comparing the differences in pixels. While they work well in outdoor scenarios 

due to not needing to project light. But it also struggles with shiny, reflective or texture 

less surfaces. For close range scenarios that need fine details captured structured light 

sensors such as the intel RealSense D series can be used. It emits a known pattern of 

infrared light and measures how it deforms. Here it can be caused with multiple struc-

tured light sensors due to the patterns interfering with each other and it is also sensitive 

to strong ambient light (Intel Corporation, 2019; Zollhöfer, 2019). When considering the 

manipulation hardware, a 6 DoF robot arm has been standard in the industry due to the 

flexibility it provides in reaching in cluttered bins and precision gripping. The end effector 

needs to be selected depending on the shape and texture of the object. While Vacuum 

grippers are great for flat, smooth surfaces, parallel-jaw grippers are good for rigid parts 

with graspable edges and custom grippers good for unusual shapes or delicate items. 

The robot can be controlled by ROS drivers (Robot Operating System) which handles 

communication and control and MoveIt! which handles motion planning, collision han-

dling and trajectory generation. The processing of the perception usually relies on the 

GPU and control is handled by a real time controller which handles tasks such as motion 

planning. The networking speed between the processes is critical as to stay synchronized 


25 

 
with low latency (Coleman & Robotics, n.d.; Create Realistic Robotics Simulations with 

ROS 2 MoveIt and NVIDIA Isaac Sim | NVIDIA Technical Blog, n.d.; Janis Arents et al., 

2018). Specialized embedded boards like Jetson can run transformer models if optimized 

carefully to ensure the lowest latency. 

Since both RGB and depth sensors are sensitive to lighting. It is also critical. Diffuse, syn-

chronized light spreads evenly and avoids harsh shadows and bright spots which stabi-

lizes RGB and depth data (Cippitelli et al., 2015; Zollhöfer, 2019). The mounts used to 

connect the lights should be rigid to avoid drift but flexible enough to adapt to new bin 

layouts. Finally, communication protocols used such as fieldbuses or low-latency Ether-

net keep sensing and actuation tightly synchronized to deliver stable and reliable input 

for segmentation and depth estimation. 

 
Software Stack and Data Flow 

 
The software stack begins with acquisition: The ROS drivers or vendor SDKs stream RGB-

D frames with timestamps and IDs (Janis Arents et al., 2018). While multi-camera setups 

add synchronization and triggers so that the depth and RGB frames that are captured 

align in time (Cippitelli et al., 2015; Lachat et al., 2015; Siltala & Latokartano, 2023). Next 

comes preprocessing. Pre-processing can include undistortion, which corrects lens dis-

tortion, depth correction where depth values need to be mapped to the proper pixels, 

and noise removal where random errors need to be removed from the depth map. Next 

extrinsic transforms are stored in tf trees in ROS, which allows the system to convert 

between different spaces. From pixel coordinated to 3D camera coordinated to robot 

coordinated. (Quigley et al., n.d.; Shaikh & Chai, 2021).  

When depth input is given to the SAM model it can be done with shared encoders in the 

beginning of the fusion process, while late fusion is done with separate encoders and 

merged before decoding (Boulahia et al., 2021; Kirillov et al., 2023; Zollhöfer et al., 2018). 

SAM needs prompts to guide segmentation. These prompts are generated from depth 

edges and surface clusters. Once SAM outputs segment masks the masks are filtered out 

so that the generated masks align with geometric shapes. Next the 3D segments are 


26 

 
passed to pose estimators or grasp planners. While keeping latency low, batching and 

reusing of the SAM embeddings across multiple frames are used. 

 
2.6.9 Workflow for Object Segmentation and Grasp Planning 

 
A usual workflow for object segmentation starts with synchronized RGB-D frames and 

proper calibration. Next prompts are generated from the fused point cloud using two 

common methods, depth thresholding and surface normal clustering which produces 

regions which are then projected back on to the RGB frame. These prompts are used to 

guide SAM (Rabbani et al., n.d.; Yang et al., 2007; Zollhöfer et al., 2018). Afterwards SAM 

encodes the image once and applies the created prompts to generate segmentation 

masks. The masks can be cross checked against the main depth distributions of that spe-

cific surface to remove invalid masks (Kirillov et al., 2023; Rabbani et al., n.d.; Zollhöfer 

et al., 2018). The confidence scores obtained for the masks are combined with a geom-

etry check when available and the most reliable masks are passed through for pose es-

timation or grasp planning. 

The filtered maps are then projected on to a 3D space. If CAD models are available ICP 

can be used to align the masks to the model but with occlusions a method such as se-

mantic PPF would be used to assist with the alignment (Drost et al., 2010; Zhuang et al., 

2021). When CAD models are not available geometric aspects such as principal axes, 

curvature and machine learning models are trained to predict where a gripper can hold 

an object can be used (Kumra et al., 2020; Shao et al., 2023). To improve the accuracy of 

the whole process Multiview merging of segments can also be used to avoid some of the 

guess work in occluded object (Aldoma Buchaca et al., n.d.; Drost et al., 2010) 

Finally, when it comes to grasp planning, grasp planning such as MoveIt! select candi-

dates using kinematics and collision checks. However, when the visibility is low the grasp 

planner will plan a nudge of the scene to change the occlusions and then run segmenta-

tion again (Dogar & Srinivasa, 2010; Zeng et al., 2020). If the grasp fails, the force, torque 

or vision are used to obtain feedback to rerun the segmentation around the remaining 


27 

 
objects. To optimize time per pick, the segmentation and grasp planning parallelly. This 

makes the whole picking process smoother, minimizing delays. 

 
2.7 Classic computer vision for segmentation and proposal 

 
2.7.1 Bottom-Up Segmentation Approaches 

 
In the beginning segmentation and object detection started by searching for lines or con-

tours in images to identify the boundaries of the objects in the image. If many objects 

which are smaller are found, they would be in consideration of being grouped into a 

bigger region if the color and texture of the areas were similar (Nielsen & Nock, 2013). 

Regions were then given an objectness score to decide if the object was of the fore-

ground or background. The main idea here was to generate as many candidate masks as 

possible to not miss any subjects in the image. Afterwards classifiers filter out lower 

probability masks (Girshick et al., 2013; Uijlings et al., 2013). 

 
2.7.2 Region Proposal Generation: Selective Search  

 
In industrial bin picking, however classical methods provided mixed results, the areas 

where these methods were effective are when lighting conditions were good and objects 

were clear with clear surfaces with distinct edges and boundaries. But in bin picking sce-

narios that was rarely the case. The bins are messy and cluttered and with low contrast, 

the performance of these systems dropped dramatically. A common scenario is when 

identical objects are packed tightly together, edge detection systems would not identify 

them separately most of the time. The classical methods did not understand the scene 

and even consider shadows or surface mask as object boundaries (Danielczuk et al., 2019; 

Shi & Malik, 2000) 

 
28 

 
Methods such as selective search improved the traditional segmentation by including 

multi scale segmentation. Objects of many sizes were analyzed on various cues such as 

color, texture and shape. This helps in identifying parts of many sizes in a bin. The issue 

that arose with this method is that it produced too many overlapping boxes requiring 

extra filtering afterwards. 

 
2.7.3 Sliding Window Detection 

 
In these methods the image was scanned part by part of features like HOG (edge patterns 

or SIFT (key points). Each scanned patch of the image would be described in these fea-

tures, next a classifier such as a SVM would decide if the patch contains an object. This 

method was able to detect objects even with weak or noisy edges but was computation-

ally heavy and struggled to detect unusual shapes compared to modern deep learning 

methods (Dalal & Triggs, 2005; Girshick et al., 2013; Lowe, 2004).  

 
2.7.4 Voxel-Space Processing  

 
The classic methods also extended to 3D recognition, RGB-D data was converted to point 

cloud by running box detectors through them and creating voxels or 3D pixels. This 

method was effective when the objects being detected had uniform textures or colors 

like in bis filled with identical parts. Although voxel space segmentation is simpler and 

less effective than today’s state of the art it worked reasonably well (Maturana & Scherer, 

2015). 

While effective at the time, there were some challenges that needed to be addressed in 

an industrial setting. Since most of these methods relied on handcrafted feature map-

ping, they had to be returned all the time when the product was changed, or the parts 

changed (Zeng et al., 2018). 


29 

 
Another major challenge was occlusion, hierarchical segmentation failed in this aspect 

due to merging two objects into one or splitting one object into many segments (Le & 

Lin, 2019). 

But even though these flawed methods provided the foundation for modern approaches, 

the modularity of classic methods in contour detection, region merging and finally pro-

posal scoring continues in modern perception pipelines powered by deep learning. The 

visual cues from classical methods are also useful when combined with depth or pose 

estimation. They are also useful as baseline or sanity checks in domain adaptation (Dan-

ielczuk et al., 2019).  

In the next section we will investigate the current deep learning methods used in bin 

picking segmentation scenarios. 

 
2.8 Supervised Learning Methods 

 
2.8.1 Overview  

 
Deep learning pipelines have changed how robotic systems can navigate problem solving 

in cluttered bin picking scenarios, it offers a level of adaptability that older hand-crafted 

feature matching pipelines could not. Using deep learning models such as convolutional 

neural networks (CNN) robots can look at images semantically and create object masks. 

A well-known example is the Mask R-CNN, where it can integrate detection and segmen-

tation into the same pipeline giving it the ability to segment single objects in a cluttered 

scene (He et al., 2018). Where classical methods used low level features to detect objects, 

deep learning methods can capture low level features such as edges and textures in early 

layers of the network and high-level features such as shapes in deeper layers at various 

size scales (Geirhos et al., 2018; Jogin et al., 2018). This allows these methods to infer 

objects in situations of high occlusion or low lighting. 


30 

 
In bin picking scenarios CNN based segmentation alone is not enough. It needs further 

information for the robot to plan on how to pick the object, therefore the pose is neces-

sary.  To obtain this some pipelines feed the feed the segmented masks to pose estima-

tion modules to do so. Point Pair feature matching is one of the methods, however these 

methods can fail if the segmented object fed is occluded which can lead to increasing 

grasping errors (Aldoma Buchaca et al., n.d.; Danielczuk et al., 2019; Drost et al., 2010; 

D. Liu et al., 2021; Zhuang et al., 2021). To deal with these issues researchers had come 

up with joint methods that combine segmentation and pose estimation such as PPR-Net, 

which obtains pose information directly from a point cloud and estimated pose while 

having awareness of masked segments to average the poses over these regions to get 

stable results. These methods have shown improvements of 15-41% over previous (Dong 

et al., 2019). From an engineering point of view, it is inefficient to only focus on models 

seeing and identifying the objects, it needs to have an idea on if the objects detected 

can be grasped. By embedding affordance logic into deep segmentation models, the 

graspability of the objects can be determined. Which is what the GQ-CNN model does 

by using depth data to determine the success of a parallel jaw gripper success for candi-

date points (Mahler et al., 2017; Xie et al., 2023b). Thereby reducing the number of failed 

attempts on points that are visually clear by low in graspability.  

There is another issue that deep learning models fail at. It is when they are trained on 

one specific type of object. But if this network is used on a different object, its accuracy 

can drop significantly (Danielczuk et al., 2019). To solve this foundation model such as 

Segment Anything has been used. These models have been designed to detect general 

object boundaries of various kinds and filters can be implemented depending on the 

industry. While adaptability is important, speed is also important. Which is why in light 

weight models such as FastSAM optimized CNN backbones have been used instead of 

computationally heavy encoders, allowing them speed in detection as well as adaptabil-

ity (Zhao et al., 2023). 

 
31 

 
2.8.2 Model Training and Validation 

 
As discussed previously with regards to segmentation strategies in the training stage the 

RGB and depth images need to be aligned perfectly for the model to learn anything rel-

evant from the data. (Cordeiro et al., 2023; Zhuang et al., 2021) While industry-based 

data sets are small and specific, it can lead to overfitting. Therefore, pretraining models 

on large generic datasets helps the model to identify a broad variety of features which 

then can be fined tuned with the industry dataset (Kirillov et al., 2023). The loss functions 

are also optimized so that it considers Dice loss and focal loss together. Again, the batch 

composition such as amounts of clutter and clutter need to be balanced to prevent bias. 

When it comes to validation in segmentation models in bin picking, the model should 

not only be validated on clean benchmark data but also on real world use case scenario 

outcomes like pose accuracy and grasp success. This also needs to be tested across vari-

ous sensors and lighting conditions to validate robustness (Cordeiro et al., 2023; Zhuang 

et al., 2021). Finally, calibration needs to be considered as it can insert its errors as model 

errors if not dealt with properly. 

 
2.9 Reinforcement Learning  

 
Reinforcement learning is different from supervised learning as it is not learned from 

labels but by trial and error. In bin picking a robot cannot always succeed with a one-

shot grasp, in reinforcement learning it can learn extra strategies like pushing or reori-

enting objects to grasp the target object (Laskey et al., n.d.; Zeng et al., 2020). If the robot 

only sees RGB it can miss important 3D information. By adding depth data, it can update 

its policy accordingly (Cordeiro et al., 2023; Zhuang et al., 2021; Zollhöfer et al., 2018). 

Reinforcement learning relies heavily on reward, and if the only reward was the success 

of picking up the object, learning would be too slow. Therefore, the reward also needs 

to include reducing occlusions and aligning the gripper with the best grasp angles to 


32 

 
speed up learning (Kumra et al., 2020). The use of optimum force and torque can also be 

a part of the reward policy. 

The various RL algorithms balances stability and efficiency at different proportions. For 

example, DQN variants work for discrete actions (e.g., suction, pinch, side grasp) It works 

well for small action sets and is efficient but struggles with noisy and very large actions, 

so Double DQN is used to reduce bias (Iriondo et al., 2021; Joshi et al., 2020). Continu-

ous-action methods like PPO and SAC generate smooth motions, making them better for 

fine manipulation but it also needs more samples to learn from than value-based meth-

ods like DQN. 

Having established a broader understanding of the requirements for accuracy and effi-

ciency in bin picking, and how these factors can optimize industrial environments, the 

next step is to outline the methodology. Specifically, we focus on designing an accessible, 

simple, and modular segmentation pipeline built on SAM, aimed at addressing the limi-

tations identified in prior work and enabling practical deployment in bin picking scenar-

ios. 

 
33 

 
3 Methodology 

 
3.1 System Setup 

 
The experimental setup shown in figure 2 used for these tests are simple and easy to 

implement. Hardware wise a depth and RGB intel RealSense D435i was used to capture 

RGB-D data. It provides a resolution of 1280x720 pixels up to 30Hz, with an operational 

range from 0.3 m to 3m. As for processing, a laptop with an intel i5-7300HQ at 2.5 GHz, 

16 GB of RAM and a Nvidia 1050Ti GPU that supports Cuda were used. 

As for the software the IDE used was PyCharm, a segmentation model implemented in 

PyTorch called Segment Anything Model (SAM) was used with its ViT-B encoder. The 

other libraries used we OpenCV for image processing and NumPy for mathematical cal-

culations. A custom scoring system was created to determine the best mask, taking into 

account factors such as median depth, mask areas and border penalties.  


34 

 
Figure 2. Real dataset capture setup including a depth and RGB intel RealSense D435i camera 
and a laptop with an intel i5-7300HQ at 2.5 GHz, 16 GB of RAM and a Nvidia 1050Ti 
GPU. 

  
35 

 
3.2 Image acquisition and loading of images 

 
As discussed further down this study, a subset of the Siléane dataset was used to run a 

quantitative analysis of the discussed pipeline and a RealSense RGB+D camera was used 

to collect a set of data to run a qualitative analysis of the pipeline. As expected, there 

are two methods that needed to be used to collect and feed the image data in those 

instances.  

 
3.2.1 RealSense camera image acquisition and dataset creation. 

 
pipeline = rs.pipeline() 

config = rs.config() 

config.enable_stream(rs.stream.color, 640, 480, rs.for-

mat.bgr8, 30) 

config.enable_stream(rs.stream.depth, 640, 480, rs.for-

mat.z16, 30) 

pipeline.start(config) 

align = rs.align(rs.stream.color) 

for _ in range(30):  # allow auto-exposure to settle 

    frames = pipeline.wait_for_frames() 

 
frames = pipeline.wait_for_frames() 

aligned_frames = align.process(frames) 

color_frame = aligned_frames.get_color_frame() 

depth_frame = aligned_frames.get_depth_frame() 

Figure 3 RealSense pipeline setup for RGB and depth streams which are combined using the 
object rs.align . 

 
A pipeline is created to start streaming from RealSense camera, it creates two streams, 

a standard RGB stream and a stream of 16bit unsigned integers representing depth. Both 

streams are synchronized with time but importantly with different perspectives since the 

camera has an offset between the two sensors. This needs to be realigned with the ob-

ject rs.align. A lower resolution was selected here to reduce the processing time for the 

encoding process. Next the image and depth data are collected one at a time as the 


36 

 
position of the objects in the image needs to be changed for each image. A warmup time 

is given to the camera so that the auto exposure settles. Finally, both data streams are 

converted to NumPy arrays, and the depth array is converted to mm before being saved 

as png files. 

This cycle was repeated for all images collected. 

The images collected here were then manually annotated to mark some probable 

ground truths. This was done to see how the selection pipeline behaved compared to an 

intuitive selection. 

 
3.2.2 Siléane Loader 

 
The subset used from the Siléane data was the Siléane bricks dataset. The SiléaneBricks-

Dataset loader is a custom dataset class to load the BGR, depth and segmentation files 

from the specific local folder. It then maps the corresponding image, depth and segmen-

tation files to a specific naming order and creates naming order. Example: "frame_001" 

→ "rgb/frame_001.png". Next the camera intrinsics are loaded from text files provided 

with the Siléane dataset. It is needed to align the depth and RGB data in a 3D space. 

After converting the bgr data to RGB data and normalizing depth to convert the raw 

depth values to meters. Finally, a dictionary is created with the keys below which can be 

loaded into the segmentation pipeline with SAM and the scoring system. 

 
self.intrinsics = self._load_intrinsics(os.path.join(root, 

"camera_params.txt")) 

depth_raw = cv2.imread(self.depth_map[frame_id], cv2.IM-

READ_UNCHANGED) 

depth_meters = self._normalize_depth(depth_raw) 

    "rgb": rgb, 

    "depth": depth_meters, 

    "segmentation": seg_mask, 

    "intrinsics": self.intrinsics, 

    "frame_id": frame_id 

Figure 4 Loading of the Siléane dataset with the created dictionary. 

 
37 

 
3.3 Segmentation Pipeline 

 
Step 1: Depth-based candidate box extraction 

While in the initial iteration of the scoring pipeline SamPredictor was used as an option 

in the segmentation process to reduce the computational load by only segmenting rele-

vant parts of the image, it was found to be less accurate in obtaining proper segmenta-

tion mask. The issuing of bounding boxes also seemed irrelevant in most cases as the 

whole box area needs to be scanned for objects. Which meant that the computation 

time saved by using SamPredictor was not worth the tradeoff in time gained for the ac-

curacy of the masks obtained by the SamAutomaticMaskGenerator.  

 
Step 2: SAM automatic segmentation 

Therefore, the whole image is fed into SAM and its SamAutomaticMaskGenerator utility 

is used to generate masks for the whole image. This is done by the image encoder, a 

vision transformer. Then the lightweight mask decoder gives the output (Kirillov et al., 

2023)  

 
Step 3: Scoring function 

All the segments obtained through the models are then run through a scoring function 

(penalty). The point of this scoring function is to select objects that are most suitable for 

bin picking depending on the object’s depth, depth variance, area, shape, proximity to 

the image border and SAM confidence levels. The weights of these specific elements will 

be left to the user to be tuned intuitively. Although practically the only aspect that would 

need tuning would be the area dependent weights. It is initially filtered by the median 

depth of all the pixels in a mask to ensure that the objects detected are in a proper work-

ing range and also so that outliers due to random error depths do not affect the selection 

of suitable masks. Next the standard deviation of depth values of the masks are used to 

select masks that have consistent depth values. In taking the images of objects in a box 

it was noticed that the walls of the box in the image tends to come up as good segments 


38 

 
due to it large area and it being close to the camera, therefore it was necessary to nor-

malize the area of the masks to the total image area and penalize out masks that were 

too large or small (covering >10%  or <5% of the image). Finally, it was also necessary to 

consider that in bin picking, picking objects at the edge of the image would be slightly 

harder than the ones in the middle, therefore a border penalty was also created. All of 

the factors that are used to compute a mask score will be shown below including snip-

pets from the code that computes those features.  

 
1. Median Depth 

𝑑̃ = median(𝐷mask)                            (1) 

 
Definition: The median depth of all valid pixels inside the mask. 

The median depth was then used to calculate the depth score for each mask. No further 

equations were needed here. It keeps the calculation simple and universal for all scenar-

ios. By having the depth score as the median depth of a mask, the score for specific 

objects can defer between various hardware setups but it won’t affect the selection of 

the objects to be picked as it is relative to all the other objects in the same hardware 

setup. 

 
2. Depth Variance 

𝜎𝑑 = std(𝐷mask)                                  (2) 

 
Definition: The standard deviation of depth values inside the mask. 

The standard deviation of the mask matters in making sure that the objects cover one 

single object. If the depths are too varied it might be due to a noisy mask or a mask that 

covers multiple objects.  

dv = np.std(vals)                  

var_score = min(dv / 0.02, 1.0)   # normalize by 2 cm, cap 

at 1.0 

Figure 5 Depth variance normalization. 

 
39 

 
3. Area Normalization 

𝐴norm =
∣𝑚𝑎𝑠𝑘∣

𝐻×𝑊
                                   (3) 

 
Definition: Ratio of mask area to total image area. 

To avoid the use of masks that were too large or too small to be viable in practical use 

any mask that was greater than 5% or less than 2% of the image was given a higher 

penalty.  

 
area_pixels = mask.sum() 

    area_norm = area_pixels / (H * W)  # fraction of image 

covered 

 
    if area_norm < 0.02:  # too small 

        area_score = (0.02 - area_norm) / 0.02  # penalty 

grows as it gets smaller 

    elif area_norm > 0.05:  # too large 

        area_score = (area_norm - 0.05) / 0.05  # penalty 

grows as it gets larger 

    else: 

        area_score = 0.0  # ideal range → no penalty 

Figure 6 Penalty score against the deviation from an ideal range between 5 and 2%. 

 
The area score for the real-world dataset is however one aspect that needs more of a 

nuanced approach compared to a fixed a fixed boundary score. Here the score is given 

by how far the segmented mask area deviates from the average ground truth area. This 

helps to have a better selection of masks generated without retraining on specific data. 

This, however, can be switched depending on user preference.  

 
4. Shape Compactness 

𝐶 = 1 −
4𝜋⋅𝐴

𝑃2+𝜖
                                   (4) 

• Where 𝐴= mask area, 𝑃= contour perimeter. 

 
40 

 
By using a compactness score the uniformity of the masks will be assessed and scored 

accordingly to avoid dealing with obscure shapes that might appear. This is done by using 

the Polsby-Popper compactness score (Belotti et al., n.d.). The implementation of which 

can be seen in figure 7. 

 
contours, _ = cv2.findContours(mask.astype(np.uint8), 

                                   cv2.RETR_EXTERNAL, 

                                   cv2.CHAIN_APPROX_SIMPLE) 

 
    if contours: 

        # Use total perimeter across all contours  

        perimeter = sum(cv2.arcLength(c, True) for c in con-

tours) 

 
        # Area is already mask.sum() 

        compactness = (4 * np.pi * area_pixels) / (perimeter 

** 2 + 1e-6) 

 
        # Normalize compactness into [0,1] range 

        # Perfect circle → compactness ≈ 1 → shape_score ≈ 0 

        # Jagged/elongated → compactness ≪ 1 → shape_score 

closer to 1 

        shape_score = 1.0 / (compactness + 1.0) 

    else: 

        shape_score = 1.0 

Figure 7 Shape/compactness score using the Polsby-Popper compactness score. 

 
Here the boundary of a shape of a mask is obtained and measured to get the perimeter, 

which is used to calculate the compactness score, which if ideally fits gives a score of 1. 

Which would then give a shape penalty score of 0.  

 
5. Border Penalty 

𝑃border =
1

min (Δ𝑥,Δ𝑦)+1
                            (5) 

 
Where Δ𝑥, Δ𝑦are distances from the mask to the nearest image border. 

The masks are the image border was deemed a penalty due to the fact that masks found 

at the border tend to be incomplete due to being cut off by the camera frame and also 


41 

 
might be too close to the bin walls hindering the picking process. As SAM tend prefere 

flat objects with large surface area in segmentation, giving a border penalty reduces the 

possibility of a wall being detected as an object to be picked.  

 
edge_dist = min(xs.min(), W - xs.max(), ys.min(), H - 

ys.max()) 

border_penalty = 1.0 / (edge_dist + 1) 

 
#xs.min() → leftmost pixel of the mask. 

#W - xs.max() → distance from right edge.  

#ys.min() → topmost pixel of the mask. 

#H - ys.max() → distance from bottom edge. 

Figure 8 Border penalty calculation. 

 
As seen above in figure 8, first it finds the smallest distance between the mask and any 

of the four image borders: the left edge (xs.min()), the right edge (W - xs.max()), the top 

edge (ys.min()), and the bottom edge (H - ys.max()). That minimum value (edge_dist) 

represents how close the mask is to the nearest border. Then, the border_penalty is 

computed as 1.0 / (edge_dist + 1), which means the closer the mask is to the edge, the 

larger the penalty becomes. The more pixels it has between itself and any image border, 

lower the penalty score. 

 
Step 4: Best mask and top mask selection 

Finally, after running all the masks through the scoring function, the mask with the low-

est score is selected as the best mask. This mask is compared with the top mask to eval-

uate the performance of the best mask. The top mask in the mask with the lowest me-

dian depth. In the evaluation phase where the masks selected needs to be selected with 

ground truths, the objects that have the highest IoUs with the specific masks are selected 

as the ground truth object that the rest of the evaluation metrics are carried up on. 

 
For the best mask selection, the penalties were added up given their own weights for 

each category. SAMs own IoUs and ground masks were also considered in this score. 

Higher the total score is for a segment lower the rank in terms of being picked. 


42 

 
score = ( 

        weights["depth"] * depth_score + 

        weights["var"] * var_score + 

        weights["area"] * area_score + 

        weights["shape"] * shape_score + 

        weights["border"] * border_penalty + 

        weights["sam_iou"] * sam_iou_score + 

        weights["sam_stability"] * sam_stability 

    ) 

Figure 9 Structure of the scoring equation. 

 
3.3.1 Evaluation Protocol 

 
• Metrics: IoU, precision, recall and F1. 

The performance of the model was evaluated on a quantitative basis using standard seg-

mentation metrics as mentioned above. 

Intersection over Union measures the overlap between the predicted masks and the 

ground truth masks provided in the Siléane dataset itself. 

Precision gives us how clean and accurate the predicted mask is, which should ideally 

include only the object that needs to be detected and not the background as well. 

Precision =
True Positives

True Positives + False Positives
 

Recall shows how much of the intended object was successfully identified. More of the 

object if included in the segment, higher the recall. 

 
Recall =
True Positives

True Positives + False Negatives
 

F1 score takes into consideration both precsion and recall. It is a balance between the 

two. A high F1 score would mean a high recall and precision. 

 
43 

 
𝐹1 = 2 ⋅
Precision ⋅ Recall

Precision + Recall
 

In addition to the quantitative validation on the Siléane dataset, a qualitative analysis of 

the generated masks for both the dataset and a small sample of manually annotated real 

word images would be compared so that the performance of the model can be analyzed 

in an artificial environment and in the real world. 

 
44 

 
4  Experiments 

 
4.1 Siléane Dataset Tests 

 
The evaluation was carried out using the Siléane bricks dataset, which contains both syn-

thetic and real bin-picking scenes designed to replicate cluttered industrial environments. 

The segmentation pipeline was executed across 30 frames with various object orienta-

tions, number of objects and positions. This provided sufficient data to benchmark the 

performance of the depth aware scoring factors and to compare with different configu-

rations while keeping the computational load manageable.  

 
4.1.1 Results with standalone SAM. 

 
The results of the performance of SAM on its own can be seen next. This was SAM as a 

standalone segmentation model used to segment the Bricks subset of simulated RGB 

and depth data. SAM was run on the first 10 of the images. The SAM model was able to 

run through and generate extremely good results on the objects using only the simulated 

RGB data.  

 
Table 1 SAM segmentation across the Siléane bricks dataset 

 IoU Precision Recall F1 

Average 0.94 0.965 0.961 0.963 

 
45 

 
Table 2 Segmentation across image brick_0_008 using SAM for all objects in the image. 

frame_id object_id IoU Precision Recall F1 

brick_0_008 8192 0.975612 0.9992301 0.976346 0.987655 

brick_0_008 16384 0.955426 0.986987 0.967615 0.977205 

brick_0_008 24576 0.955789 0.9752025 0.979597 0.977395 

brick_0_008 32768 0.970503 0.9781013 0.992059 0.98503 

brick_0_008 40959 0.969382 1 0.969382 0.984453 

brick_0_008 49151 0.800253 0.8043202 0.993721 0.889045 

brick_0_008 57343 0.955275 1 0.955275 0.977126 

brick_0_008 65535 0.971714 0.9948893 0.976589 0.985654 

     
As observed in table 2 above, the results are extremely accurate and precise. Therefore, 

it was clear that SAM was extremely accurate in semantic segmentation. These results 

lead to considering the assumption that it would also be able to handle the use of depth 

obtain accurate results in a bin picking scenario as well. But it should also be noted that 

to come up with these segment masks that fit almost perfectly, SAM generated 300 to 

400 masks per scene, while only having less than 10 bricks to detect in any given scene. 

This high number of masks can also be due to the simulated nature of the dataset fed 

into SAM. 

 
4.1.2 Results with SAM plus the custom scoring function and top most mask.  

 
 The initial iteration of the scoring function used SAM predictor over the whole image to 

obtain segment masks to filter using the scoring pipeline. After the use of depth and 

custom scoring function, these were the results obtained from the model across 30 im-

ages of the Siléane bricks dataset. 

 
46 

 
Table 3 Initial scoring pipeline results with SamPredictor across 30 images of the Siléane bricks-
dataset. 

Average IoU Precision Recall F1 

Best mask 0.22396 0.22583 0.90604 0.35505 

Topmost mask 0.56760 0.8969 0.61979 0.66475 

 
SAM initially performs poorly in providing proper masks that could be filtered out for the 

objects being selected as the most suitable for bin picking. Using the metrics collected 

for the segmentation masks selected using the scoring function, there are two notable 

aspects. They are the low average precision and the highest average recall of the masks 

selected. While the Topmost masks tend to do much better overall. With average recall 

values and high precision, leading to notable average F1 score. Below in table 4 are some 

of the scores for 6 separate images in the dataset. 

 
Table 4 Subset of the topmost mask’s performances for 6 separate images in the dataset. 

TopMask IoU TopMask Precision TopMask Recall TopMask F1 

0.768019727 0.794388856 0.958570076 0.868790902 

0.845803133 0.981549315 0.859467807 0.916460827 

0.025351282 0.977272727 0.025366237 0.04944897 

0.666335291 0.990063113 0.670821581 0.799761362 

0.553863999 0.640509725 0.803702924 0.712886069 

0.060431347 0.371879106 0.067300832 0.11397503 

 
It was also noteworthy that the top mask was also perceptible to irregularities in its pre-

cision and recall scores, scoring a 0.06 for recall in one instance. While in figure 10 it is 

visible neither of the methods perform well in selecting the topmost or best mask to pick 

accurately. The green brick would be the ground truth mask, and the bright red highlight 

would be the generated mask that was selected after SAMs mask generation.  

 
47 

 
 Figure 10 Top mask vs the best mask where the green brick would be the ground truth mask, 
and the bright red highlight would be the generated mask that was selected after SAMs mask 
generation. 

 
The two masks above were one of the worst mask selections from the images that were 

obtained from the top mask and the best mask (masks obtained through the scoring 

pipeline). While these values are far from standards necessary for use in an industrial 

environment, let’s break down the results as to what causes such results. While SAM 

was able to obtain extremely accurate segmentations of all the bricks in the Siléane da-

taset without using depth data to select the best object for bin picking. It also created 

substantial number of potential masks and used its own scoring system to select the best 

masks that would fit all the objects in the image dataset. Thereby obtaining masks that 

performed extremely well in all scoring metrics. But when obtaining the results of the 

masks using our scoring system and lowest overall depth and using them to select the 

best brick to pick, the masks that fit the description would not fit the brick perfectly as 

we hoped for. The most likely culprit for this specific issue is the overabundance of masks 

that SamPredictor creates initially. This makes the creation of the best object selection 

complicated and time consuming to calculate. We will investigate the refining the mask 

selection process next. The refinement process to have better mask selection was, to 

take into consideration the use of SAMs own confidence scores as filter after the mask 

generation through SamAutomaticMaskGenerator, which performs non maximal sup-

pression (Kirillov et al., 2023). 

 
48 

 
masks = mask_generator.generate(rgb) 

filtered_masks = [ 

    m for m in masks  

    if m.get("predicted_iou", 0) > 0.8 and m.get("stabil-

ity_score", 0) > 0.9 

] 

if not filtered_masks:  

    continue 

Figure 11 Filtering of the generated masks from the use of SAMs confidence and stability scores. 

 
By using these confidence scores, the masks fed into the scoring system would be of a 

higher quality and have a better chance of aligning with the objects in the image. After 

this initial filtration process. The evaluation metric reflected this change.  It should also 

be noted that in SAMs generate() function, non maximal suppression (NMS) is included 

to remove almost identical masks (Kirillov et al., 2023). 

 
Table 5 Performance of the masks after the use of SAMs confidence scores and NMS on the 30 

Siléane bricks dataset images. 

Average IoU Precision Recall F1 

Best mask 0.783426 0.880086 0.876465 0.870563 

Topmost mask 0.505799 0.947529 0.537752 0.580217 

 
When comparing table 5 with the performance of the scoring system before the use of 

SAMs confidence levels as an initial filter. There are performance increases of more than 

100% on almost all metrics, excluding BestMaskRecall. While the performance of the 

Topmost mask has decreased slightly, the best masks outperform in a more stable fash-

ion than what the topmost mask was able to provide before the SAM filtering process.  

 
Next the improved filtering system is run through an ablation study to find out what 

aspects of the scoring system are responsible for most of the performance gains. 

 
49 

 
4.2 Ablation Study 

 
In the ablation study we will be starting off with only using the depth values in the cus-

tom scoring system and adding each aspect of the scoring system one at a time and 

monitoring how they perform with addition or removal of a metric. The weights used for 

each run in the ablation test are given in figure 12 below. 

 
ABLATION_CONFIGS = { 

"depth_only": {"depth":30.0, "var":0.0, "area":0.0, 

"shape":0.0, "border":0.0, "sam_iou":0.0, "sam_stabil-

ity":0.0}, 

"depth_var": {"depth":30.0, "var":0.5, "area":0, "shape":0, 

"border":0.0, "sam_iou":1, "sam_stability":1}, 

"depth_var_area": {"depth":30.0, "var":0.5, "area":5.0, 

"shape":0, "border":0, "sam_iou":1, "sam_stability":1}, 

"depth_var_area_shape": {"depth":30.0, "var":0.5, "area":5.0, 

"shape":1.5, "border":0.0, "sam_iou":1, "sam_stability":1}, 

"depth_var_area_shape_border": {"depth":30.0, "var":0.5, 

"area":5.0, "shape":1.5, "border":6.0, "sam_iou":1, 

"sam_stability":1}, 

Figure 12 Ablation configurations for the Siléane bricks dataset. 

 
4.2.1 Results of the performance considering only the depth results. 

 
Table 6 Best mask performance with depth alone. 

 IoU Precision Recall F1 

AVERAGE 0.505799 0.947529 0.537752 0.580217 

 
The results show the performance of scoring systems performance over 30 scenes in the 

Siléane bricks dataset. But as it is clearly visible it is not handling the mask selection 

process well for this process. But it is identical to the performance of the top mask se-

lection, which makes the most sense. Almost all of the segmentation by custom scoring 

with only depth and the top mask selection is identical. This is due to the fact that using 

only a depth penalty on the masks and selecting the mask that has the least depth would 


50 

 
give us the exact same result. It can again be observed that while depth can be an im-

portant aspect in the mask selection process, it cannot be the sole criteria as seen by the 

first two images, an object in the border of the image was selected as the object to be 

picked. This would improve in the coming steps. 

 
  Figure 13 Depth only best mask vs Topmost mask.  

 
51 

 
4.2.2 Results of the performance considering the depth and depth variance results 

 
Table 7 Best mask performance with depth and depth variance. 

 IoU Precision Recall F1 

AVERAGE 0.471308 0.944517 0.505817 0.552507 

 
By considering the variance in height the results seem to select smaller masks although 

they still are focused on the image with the least depth. It also does not seem to focus 

on the topmost image as much due to the addition of variance. This factor can be an 

influence on how the scoring system needs to be tuned for a real-world dataset. While 

some datasets might need to be penalized on depth variance, the bricks due to their 

specific shape might do better without a lot of influence from depth variance. Below in 

figure 14 are few of the top candidate masks for an image in the dataset. It should be 

noted that the masks of some cutouts (circles) of the bricks were selected in the latter 

images in figure 14, this is due to the area of the mask not being considered. 

 
52 

 
Figure 14 Depth + variance best masks selection ranked for one image.  

 
4.2.3 Results of the performance considering the depth, depth variance and area re-

sults. 

 
Table 8 Best mask performance with depth, variance and area. 

 IoU Precision Recall F1 

AVERAGE 0.748533 0.832725 0.883677 0.847803 

 
53 

 
  Figure 15 Depth + variance + area best masks selection ranked for one image.  

 
While area was given less weight than the depth as can be observed, in the rank 1 image 

it has a higher area score, which is the worst out of the 4 but it has much more of a lower 

depth, making it the most suitable at this stage. It is also worth noting that the introduc-

tion of area penalty made way for the selection of masks that were similar to the objects 

that they belong to. 

 
54 

 
4.2.4 Results of the performance considering the depth, depth variance, area and 

shape results. 

 
Table 9 Best mask performance with depth, variance, area and shape. 

 IoU Precision Recall F1 

AVERAGE 0.762948 0.845712 0.888367 0.857336 

          
While no major improvements were seen, all performance metrics in terms of IoU, Pre-

cision and F1 scores improved by approximately 0.02. The shape penalty would also im-

plement more standard, well rounded shapes when it comes to mask selection. There-

fore, it would be a safe assumption to say that these improvements would come through 

a more refined mask that stays within the borders of the specific ground truth masks, 

while in the terms of bin picking, no major changes were seen in the sense of the object 

selected.  

 
4.2.5  Results of the performance considering the depth, depth variance, area, shape 

and border closeness results. 

 
Table 10 Performance with the full best mask pipeline 

 IoU Precision Recall F1 

AVERAGE 0.783426 0.880086 0.876465 0.870563 

 
Figure 16 Shows the effects of adding the border penalty to the scoring function has on 

the segment selection for the same image. 


55 

 
 Figure 16 Segmentation without (left) and with (right) border penalty, giving the mask which is 

away from the border for easier grasping from a robotic arm.  

 
Again, with the addition of the border penalty all the segmentation metrics saw an in-

crease of 0.02 approximately. This would be possibly due to the availability of cleaner 

masks away from the border but more importantly this gives the selected object a better 

chance of being picked more easily due to the ease of maneuvering a robotic arm away 

from the borders of a bin and having a full view of the entire object in frame. The takea-

way here would be that the weights of the border penalty also need to be higher to make 

sure the objects selected from rest of the elements in the scoring system do not fall into 

the edge of the image boundary.  

 
56 

 
Finally let’s consider all the results of the ablation study overall.  

 
   Table 11 Segmentation performance across all ablation settings in the best mask scoring pipeline 
and topmost masks 

Scoring parameters IoU Precision Recall F1 

 depth 0.505799 0.947529 0.537752 0.580217 

depth/var 0.471308 0.944517 0.505817 0.552507 

depth/var/area 0.748533 0.832725 0.883677 0.847803 

depth/var/area/shape 0.762948 0.845712 0.888367 0.857336 

depth/var/area/shape/border 0.783426 0.880086 0.876465 0.870563 

Topmost mask 0.505799 0.947529 0.537752 0.580217 

 
While all the results shown in table 11 ablation study have been performed after being 

prefiltered with the SAM confidence scores, when starting off with only the depth the 

results were identical to obtaining the mask with least median depth which was ex-

pected. When the depth variance penalty was introduced the performance in segmen-

tation dropped slightly. But what needs to be considered is that in a compact environ-

ment the probability for SAM to combine masks on various depth levels is higher, there-

fore depth variance filters are needed even though in this case which might mean that 

the variance levels need to be lowered or not used. Addition of the area penalty on top 

of the variance penalty brought the performance metrics of the scoring pipeline to re-

spectable levels however since the area filter is added manually between a specific range 

it was also decided to add an option to select an average area that a ground truth mask 

of a specific object that would be selected for bin picking as a comparison to give an area 

score, therefore further a mask area goes from  the average area of the ground truths of 

that dataset higher the penalty. The shape score does increase the performance of the 

scoring system helping it to select smoother masks. When it came to the border score 

the performance again did not change massively however it was found that the while 

the segmentation was at a good level, when it came to the object picking from a bin, 

selection of the masks from the central areas of the bin gives the robotic arm more room 


57 

 
to maneuver, therefore the weight on the border score was also increased for the final 

scoring pipeline. While the depth penalty weight was also increased. The following sec-

tion will investigate how the object selection has improved with regards to bin picking.  

 
4.2.6 Final scoring system and an Intuitive look in terms of bin picking. 

 
{"depth":30.0, "var":0.0, "area":5.0, "shape":1.5, "bor-

der":10.0, "sam_iou":0.5, "sam_stability":0.5} 

Figure 17 Final weights for the scoring pipeline 

 
Figure 18 Brick 32/36 best vs topmost mask selection comparison. 

 
The changes made to the scoring system in figure 18 have brought the selected segmen-

tation masks for bin picking further away from the borders of the image. The biggest 

issue with the topmost mask selection is the masks selected more frequently do not 


58 

 
cover the whole object and lock onto a minor part of the object. This mostly does not 

occur with the best mask strategy due to the area restrictions. The issue that the best 

mask selection pipeline faces at times is the selection of masks that involve two or more 

objects as seen in figure 19. This could be an issue that needs to be looked into further 

in the future. 

 
Figure 19 Brick 40 and 42 best mask segmentation. 

 
However, the updated scoring also gets the segmentation into the center of the image 

in most situations which is important. This can be seen more clearly in figure 20 below 

when compared with the topmost mask. 

 
59 

 
Figure 20 Brick 43 and 52 best vs topmost mask showing that the best mask consistently tries 

to strike a balance between lowest depth and masks away from the border while the top most 

mask tend to select any mask with the least depth. 

 
Another aspect that is visible here is that even though the pipeline is not perfect in iden-

tifying objects it does perform well in most instances, as seen in the best depth images. 

All the original depth images of both the Siléane and the captured datasets will be up-

loaded into an online database to be accessed, and links are available in. 

 
In the following section the study focuses on real world datasets. 

 
60 

 
4.3 Real-World Demo 

 
To have a qualitative analysis of how SAM performs on real-world data obtained through 

an off-the-shelf RealSense RGB+D camera, 20 images with mixed object orientation of 

reflective metal objects that were identical in shape and size were taken. The best object 

to pick (intuitive) for a robot was selected manually and marked. Finally, the selected 

object masks were selected for both the topmost mask and the best mask and the over-

lap was visually compared and result documented. This would provide us with a clearer 

understanding of how SAM would behave is bin picking in scenarios and environments 

that are far from ideal. It should be noted that these images were taken in room lighting 

conditions and are less controlled and calibrated than a factory environment. The object 

masks in blue would be the selected segmentation mask by the scoring system and 

masks in green are the top masks. The dot structure that is visible in a pink hue is from 

the depth measurement IR dots that are visible to the camera alone. 

 
Figure 21 Best and topmost mask overlays, the object masks in blue would be the selected seg-
mentation mask by the scoring system and masks in green are the top masks.  

      
From the observations from the images (figure 21) taken the topmost mask can overlap 

fully with the best mask at times but the topmost mask cannot be relied upon to obtain 

the full mask for the object. Another positive observation for the best mask is that the 

segmentation on all 20 of the pictures taken did not involve the combination of two or 

many objects. As was evident in the Siléane bricks dataset, the topmost mask is at a 


61 

 
disadvantage when it comes to selecting the correct object mask especially when there 

are distinct shapes on the object itself as seen below in figure 22.  

 
Figure 22 Improper segmentation for the top mask (in green) 

 
It should also be noted that during the data collection process there were multiple masks 

that dept data could not be accounted for, which might be due to the material nature or 

device inaccuracy, to make sure most masks have a median depth values the Navier 

stokes algorithm provided by the Open CV library was used to fill in empty mask depth 

information with (Bertalmio et al., n.d.). 

 
62 

 
vals = depth_meters[mask] 

vals = vals[vals > 0] 

 
if len(vals) == 0: 

    # Fallback: use nearest-neighbor depth for mask pixels 

    depth_filled = cv2.inpaint( 

        (depth_meters * 1000).astype(np.uint16),  # convert 

to mm for inpainting 

        (depth_meters == 0).astype(np.uint8),     # mask of 

invalid pixels 

        3,                                        # radius 

        cv2.INPAINT_NS                            # Navier-

Stokes method 

    ).astype(np.float32) / 1000.0                 # back to 

meters 

 
    vals = depth_filled[mask] 

    vals = vals[vals > 0] 

 
Figure 23 Missing depth value filling solution using the Navier stokes algorithm provided by the 

Open CV library. 

 
In figure 24, showing the gray scale depth images, we could see the areas in black are 

areas with missing depth values that need to be filled in. 

 
Figure 24 Gray scale depth images of image 18 and 19 of the real-world dataset collected using 
the RealSense RGB-D camera, where darker colors mean closer it is to the camera 
except the pure black areas meaning 0 mm depth or no valid depth reading. 

 
63 

 
5 Discussion 

 
5.1 Summary of findings. 

 
While SAM by itself does an excellent job in creating segmentation masks as seen previ-

ously for the bricks subset from the Siléane dataset. To consider what aspects or attrib-

utes would bring out the best segmentation mask when considering a bin picking sce-

nario, it was decided to design a scoring pipeline and run it across all the masks that SAM 

created and compare it against what the performance would be when compared against 

a much simpler lowest median depth filter (top most mask). The results favored the sim-

pler method of filtering until the introduction of the shape penalty. While the perfor-

mance of the lowest median depth filter was not acceptable it was almost twice as good 

as the scoring pipeline in terms of quantitative output before the introduction of the 

shape penalty. This was mostly due to the overabundance of masks created by SAM 

model, which required a precise and more complicated scoring pipeline. Therefore, the 

same filters were decided to be run on the masks after they had been filtered out by 

using SAMs own confidence and stability scores and the addition of the shape penalty. 

This resulted in much improved results quantitatively in all aspects of IoU, precision, re-

call and F1 scores. While the scores showed that the scoring pipeline was doing better 

compared to the lowest median depth mask filter, when the overlay images were com-

pared it could be noticed that even though the topmost mask were getting lower scores 

than the custom scoring system, it tends to select more logical candidates in segmenting 

for bin selection. With closer analysis it was also noticed that although the topmost mask 

gave outputs that were generally more suitable for bin picking it also gave a few masks 

that were on the edge of the image frame and a few that were partials of the object that 

was selected. To counteract this the depth, area and border penalties were increased in 

the scoring pipeline.  

This was made clearer when the pipeline was tested on real world data collected by our 

RealSense RGB +D camera. While generally the scoring system would provide logical 


64 

 
masks selections for bin picking, the topmost mask tends to select any mask which can 

lead to partial mask of the topmost object.  

 
5.2 Interpretation of the Results 

 
While the objective of this study is to explore methods of using a zero-shot segmentation 

method such as SAM and have an uncomplicated path to using those segmentations in 

a bin picking use case, the start of process was to study how depth alone can be used for 

this process. While testing the in-house RGB-D camera to detect the topmost point there 

were a few outliers which led to the use of median depth in selecting the topmost mask. 

This increased the stability of our depth scores as it calculated the median depth for each 

mask instead of going for the mask with the least given single depth measurement. Using 

depth alone had another major flaw, this was that when SAM analyzed the photo and 

output possible segmentation masks it gives multiple masks which includes masks of 

parts of the full objects. In this situation there is a high likelihood that the least depth 

could be found on one of these incomplete masks which happened often as observed in 

fig 12, this issue did not fully rectify even with the use of SAMs own confidence scores 

to filter out the best masks possible. While carrying out more tests it was obvious that 

there needed to be more factors integrated into the selection process of an object that 

needs to be selected from a robotic arm in the most efficient way possible.  

The first issue other than the depth that popped up during the testing phase was the 

selection of large flat areas by SAM which was as issue, where walls and floor areas were 

considered as objects by SAM. To negate this a fixed area penalty was introduced for 

masks going above a certain threshold of the percentage the total image area. Keeping 

this value around 15% would usually stop SAM from selecting too large of a mask. This 

strategy was effective with the testing done on the Siléane bricks dataset as the objects 

seemed to have an average area and the masks created were clean due to the synthetic 

nature of the RGB data. The masks created for the real-world datasets differ significantly 

even with the use of SAM confidence scores. Also, with the initial ablation test addition 

to the area penalty reduces the segmentation performance of the masks, this was due 


65 

 
to the score being designed in a way that smaller the mask was less of a penalty it gets. 

This was changed to the mask only receiving a penalty when it went above the set limit. 

Since the scoring system needs to be used with various object sizes in different poses, 

the average area of the ground truth of the real-world target object was selected as the 

benchmark of the area penalty. Therefore, this can be used as simple pretraining process 

in the s