Manual2Skill++

TL;DR: Manual2Skill++ extracts connector-aware graph representations from assembly manuals to enable diverse robot assembly tasks.

Abstract

Assembly success hinges on precise connector operations like mortise-tenon joints, dowels, and screws. However, most robotic assembly research has focused on high-level planning or pose estimation, treating connectors as an afterthought. Manual2Skill++ systematically extracts connector-aware representations directly from instruction manuals by transforming manual images into a hierarchical graph that explicitly models connector types, quantities, and spatial relationships. We then apply a training-free, constraint-based part alignment module to achieve millimeter-level accuracy, followed by search-based motion planning to insert each connector into its correct pose. We validate our approach on a dataset of 21 diverse assembly tasks spanning IKEA furniture, Lego models, and mechanical assemblies, and demonstrate four representative tasks in Isaac Gym simulation and physical setups, achieving consistent millimeter-level placement accuracy.

Overview

Figure 1: (1) A VLM processes manual images to extract part and connector relations, generating a connection-enriched hierarchical assembly representation. (2) The extracted connection constraints guide a geometric optimization process to compute precise target poses for parts and connectors. (3) The system executes the assembly by planning and performing robotic actions based on the connection-enriched hierarchical graph and aligned poses.

Dataset & Benchmark

Figure 2: The dataset comprises 21 representative assembly tasks, each with high-fidelity 3D part models, manual pages for every step, and fully annotated connection information. From this collection, we selected four tasks and implemented them in Isaac Lab, designing connection mechanics to enable long-horizon assembly with explicit connector operations that directly correspond to real-world procedures.

Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models

TL;DR: Manual2Skill++ extracts connector-aware graph representations from assembly manuals to enable diverse robot assembly tasks.

Abstract

Overview

Dataset & Benchmark

Gallery of Benchmark Tasks