AI智能总结
1. IntroductionEqual contribution1Peking University2Proceedings of the42ndby the author(s). * geometric collisions), with part poses denoted as alignmentposes. We then gradually move the fragments toward eachother to fit them together precisely. The alignment posesof the two fragments can be obtained by disassembling as-sembled parts in opposite directions. With this information,it becomes straightforward to extend the geometry-awareaffordance to further be aware of whether the controller canmove fragments into the alignment poses without collisions.We develop a simulation environment where robots canbe controlled to assemble broken parts. This simulationenvironment bridges the gap between vision-based pose pre-diction for broken parts and the real-world robotic geometricassembly. Moreover, since broken parts exhibit varied ge-ometries (e.g., the same bowl falling from different heightsbreaking into different groups of fragments), it is challeng- of bimanual collaboration. Consequently, the policy mustaccount for geometry, contact-rich assembly processes, andbimanual coordination.We propose ourBiAssembleframework for this challengingtask. For geometric awareness, we utilize point-level affor-dance, which is trained to focus on local geometry. This ap-proach has demonstrated strong geometric generalization indiverse tasks (Wu et al., 2022; 2023b), including short-termbimanual manipulation (Zhao et al., 2022), such as pushinga box or lifting a basket. To enhance the affordance modelwith an understanding of subsequent long-horizon bimanualassembly actions, we draw inspiration from how humansintuitively assemble fragments: after picking up two frag-ments, we align them at the seam, deliberately leaving a gap(since directly placing them in the target pose often causes ing to fairly assess policy performance in real-world settings.To address this, we further introduce a real-world bench-mark featuring globally available objects with reproduciblebroken parts, along with their corresponding 3D meshes,which can be integrated into simulation environment. Thisbenchmark enables consistent and fair evaluation of roboticgeometric assembly policies. Extensive experiments on di-verse categories demonstrate the superiority of our methodboth quantitatively and qualitatively.2. Related Work2.1. 3D Shape AssemblyShape assembly is a well-established problem in visual ma-nipulation, with many studies focusing on constructing acomplete shape from given parts, typically involving thepose prediction of each part for accurate placement (Zhanet al., 2020; Wang et al., 2022a). Further work (Heo et al.,2023; Tian et al., 2022; Jones et al., 2021; Willis et al., 2022;Tie et al., 2025) studied assembly with robotic execution,requiring robots to carry out each step, with benchmarksspanning various applications from home furniture assem-bly (Lee et al., 2021) to factory-based nut-and-bolt inter-actions (Narang et al., 2022). We categorize these tasksinto two types: furniture assembly and geometric assembly.This paper focuses on geometric assembly, where piecesare irregular and lack semantic definitions, such as in thecase of a broken bowl, making categorization difficult. Thiscontrasts with furniture assembly that involves parts likenuts, bolts, and screws, each with specific functions andclear categorization.Previous work on geometric assembly (Sell´an et al., 2022;Chen et al., 2022; Lu et al., 2024c), includes Wu et al.(2023c), which learns SE(3)-equivariant part representa-tions by capturing part correlations for assembly, and Leeet al. (2024), which introduces a low-complexity, high-orderfeature transform layer that refines feature pair matching.However, these methods primarily focus on synthesizingparts into a cohesive object based on pose considerationswithout incorporating robotic execution, which is impracti-cal in real-world scenarios where collisions may occur if theassembly process ignores actions. To tackle this challenge,we introduce a robotic bimanual geometric assembly frame-work that leverages two robots to collaboratively assemblepieces, enhancing stability in real-world execution.2.2. Bimanual ManipulationBimanual manipulation (Chen et al., 2023; Grannen et al.,2023; Mu et al., 2021; Chitnis et al., 2020; Lee et al., 2015;Xie et al., 2020; Ren et al., 2024b; Liu et al., 2024; 2022;Li et al., 2023; Mu et al., 2024) offers several advantages,particularly in tasks requiring stable control or wide action space. Current research primarily focuses on planning andcollaboration. ACT (Fu et al., 2024; Zhao et al., 2023) in-troduces a transformer-based encoder-decoder architecturethat leverages semantic knowledge from image inputs to pre-dict bimanual actions. PerAct2 (Grotz et al., 2024) learnsfeatures at both voxel and language levels, utilizing sharedand private transformer blocks to coordinate two roboticarms based on semantic instructions.However, in tasksrich in geometric complexity, where objects have limited se-mantic