논문 링크 : https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_CPF_Learning_a_Contact_Potential_Field_To_Model_the_Hand-Object_ICCV_2021_paper.pdf

1. Introduction

To model the contact, we propose an explicit representation named Contact Potential Field (CPF, §4). It is built upon the idea that the contact between a hand and an object mesh under grasp configuration is multi-point contact, which involves multiple hand-object vertex pair affinities. These affinities are regarded as the contact semantics, which depict the pairing of the hand-object vertices that come into contact with each other during the interaction. When noisy predicted hand and object are disjointed from each other, we shall apply an attraction to pull these vertex pairs close;

contact를 모델링하기 위해 Contact Potential Field(CPF) 라는 explicit representation을 제안한다. grasp configuration 조건에서 hand와 object mesh 사이의 contact는 multiple hand-object vertex pair affinities(다수의 손-물체간 vertex pair의 유사성)를 포함하는 multi-point contact라는 아이디어를 기반으로 한다. 이러한 유사성은 interaction 중에 서로 접촉하는 hand-object vertex pair를 묘사하는 contact semantics로 간주된다. noisy한 예측된 hand와 object가 서로 분리되어 있을 때 저자는 이 vertex pair를 가까이 끌어당기기 위해 인력을 적용할 것이다.

While the hand and object are intersected, we shall have a repulsion to push them away. Contacts of those affinitive vertex pairs are the result of equilibrium between the attraction and repulsion. In this paper, we treat each contacting HO vertex pair as a spring-mass system.

손과 물체가 만나는동안에는 이들이 서로 밀어내고자 하는 척력을 갖게 할것이다. 이러한 affinitive vertex pair의 contact는 인력과 척력 사이의 평형 결과이다. 이 논문에서 각각의 접촉하는 HO vertex pair를 spring-mass system으로 다룰 것이다.

First, the two end-points of spring is a counterpart of the two HO vertices in affinity. Second, the spring’s elastic property is another counterpart of the intensity of the vertex pair affinity. In this way, we can model the HO interaction with a potential field, as we call it CPF, which is determined by minimal elastic energy at the grasp position. Therefore, estimating the HO pose under contact is equivalent to minimizing the elastic energy inside CPF.

첫째, 스프링의 두 끝점은 affinity에서 두 개의 HO vertex의 상대이다. 둘째, 스프링의 탄성 속성은 vertex pair affinity의 강도에 대한 또 다른 상대이다. 이러한 방식으로, 저자는 CPF라고 부르는 potential field와의 HO interaction을 모델링할 수 있으며, CPF는 움켜쥐는 위치의 최소 탄성 에너지에 의해 결정된다. 따라서, 접촉 상태에서 HO pose를 추정하는 것은 CPF 내부의 탄성 에너지를 최소화하는 것과 같다.

First, compared with contact heuristic with proximity metrics [1, 52] or distance field [26, 6], CPF is able to assign per-vertex contact semantics (contact points on different hand part) to object mesh. Second, by minimizing the elastic energy, CPF can uniformly avoid interpenetration and control the disjointedness. Based on CPF, we also propose a novel learning-fitting hybrid framework namely for Modeling the Interaction of Hand and Object, as we call it MIHO (§5).

첫째, proximity metric 또는 distance field가 있는 contact heuristic과 비교하여, CPF는 vertex 마다의 contact semantics (다른 손 부분의 접촉점) 를 object mesh에 할당할 수 있다. 둘째, 탄성 에너지를 최소화함으로써 CPF는 균일하게 상호관통을 피하고 분리를 제어할 수 있다. CPF를 기반으로, 저자는 MIHO라고 부르는 손과 물체의 상호작용을 모델링하기 위한 새로운 learning-fitting hybrid framework를 제안한다.

Another problem with the existing methods is the representation of the hand model. Most researches adopted a skinning model, MANO [47], to represent hand. MANO is considered to be flexible and deformable with its pose and shape parameters. However, fitting on these high DoFs parameters is prone to anatomical abnormality. To make the best of both worlds, we propose a novel anatomically constrained hand model namely A-MANO (§3). It inherits the formulation of the skinning model and constrains the hand joints’ rotation within a proposed twist-splay-bend frame (Fig. 2).

기존 방법의 또 다른 문제점은 hand model의 representation이다. 대부분의 연구에서는 손을 표현하기 위해 skinning model인 MANO를 채택했다. MANO는 pose 및 shape parameter를 사용하여 유연하고 변형 가능한 것으로 고려된다. 그러나 이러한 high DOF parameter에 맞추는 것은 해부학적인 이상을 일으키기 쉽다. 로봇 공학과 CV/CG 응용 분야를 최대한 활용하기 위해 A-MANO라는 새로운 해부학적 제약이 있는 hand model을 제안한다. 이는 skinning model의 공식을 상속하고 제안된 twist-splay-bend(비틀림-펼침-굽힘) 프레임 내에서 손 관절의 회전을 제한한다.

For evaluation, we report our scores on FHB [18] and HO3D [22, 21] dataset in terms of reconstruction and physical quality metrics. Note that, the ground truth of FHB is noisy and suffers from severe interpenetration [26]. Since our method can avoid the penetration in the first place, our results are more visually and physically plausible. Therefore, we argue that, in this dataset, a higher reconstruction score does not necessarily benchmark the performance of the method. While on HO3D, we achieve state-of-the-art performance on both reconstruction and physical metrics. The contributions of this paper are as follows.

평가를 위해 reconstruction 및 physical quality metric 측면에서 FHB 및 HO3D 데이터 셋에 대한 점수를 보고한다. FHB의 GT는 noisy하고 심각한 interpenetration을 겪는다. 논문의 방법은 처음에 상호관통을 피할 수 있기 때문에 시각적이나 물리적으로 더 그럴듯하다. 따라서 저자는 이 데이터 셋에서 더 높은 reconstruction score가 반드시 method의 성능을 벤치마킹하는 것은 아니라고 주장한다. HO3D를 사용하는 동안 reconstruction 및 physical metric 모두에서 SOTA 성능을 달성한다.

Contribution

We highlight contact in the hand-object interaction modeling task by proposing an explicit representation named CPF.

CPF라는 explicit representation을 제안하여 hand-object interaction modeling 작업에서 접촉을 강조한다.

We introduce A-MANO, a novel anatomical-constrained hand model that helps to mitigate pose’s abnormality during optimization.

최적화 동안 pose의 이상을 완화하는 데 도움이 되는 새로운 해부학적 제약이 있는 hand model인 A-MANO를 소개한다.

We present a novel framework, MIHO, for modeling hand-object interaction. It can achieve state-of-the-art performance on several benchmarks.

논문은 hand-object interaction을 모델링하기 위한 새로운 프레임워크인 MIHO를 제시한다. 여러 벤치마크에서 SOTA를 달성할 수 있다.

3. Anatomically Constrained A-MANO

제안된 A-MANO는 pose parameter $\theta$ 와 shape parameter $\beta$ 로 만들어진 parametric skinning hand model MANO를 상속한다. 여기서 $\theta \in \mathbb{R}^{15\times3}$ 은 hand kinemetic tree를 따라 15 joint rotation, 그리고 $\beta \in \mathbb{R}^{10}$ 는 hand shape에 대한 PCA component이다. A-MANO와 MANO와의 주요 차이는 다음과 같다.

1) twist-splay-bend 프레임 내에서 joint의 회전 축과 각도에 대한 제한

2) 손 영역에 대한 세분화에서 anchor 표현

The Twist-splay-bend Frame.

Fitting on 15 joint rotations of MANO requires high DoFs regression which may cause abnormal hand posture as shown in Fig. 7. Since the human hand can be modeled in a kinematic tree, and the majority of the joints only have one DoF about the bend axis, we can impose constraints over the rotation about the unwanted axes. Therefore the proposed twist-splay-bend Cartesian coordinate frame can be assigned to each joint along the kinematic tree. The frame’ s x, y, z axes are coaxial to the 3 revolute directions: twist, splay, and bend direction on the basis of hand anatomy (Fig. 2 right). Then we can impose axial constraints in the twist and splay axes, and impose angular constraints w.r.t the bend angle.

MANO의 15개 joint rotation에 맞추려면 high DoF regression이 필요한데, 이는 비정상적인 손 자세를 유발할 수 있다. (Fig.7) 사람의 손은 kinematic tree에서 모델링될 수 있고 joint의 대부분은 굽힘 축(bend axis)에 대해 하나의 DoF만 가지고 있기 때문에 원하지 않는 축에 대한 회전에 제약을 가할 수 있다. 따라서, 제안된 twist,splay,bend Cartesian coordinate frame을 kinematic tree를 따라 각 joint에 할당할 수 있다. (각 join에 twist,splay,bend에 대한 정보 할당) 프레임의 x,y,z 축은 twist,splay,bend 방향과 같은 축이 된다. 그런 다음 twist와 splay 축에 제약을 적용하고 굽힘 각도에 각도 제약을 적용할 수 있다.

Anchors.

Since the hand mesh of different subjects are almost identical in the subdivision of hand region (e.g. phalanges), we can interpolate several representative points (later we call it anchors) on hand mesh to largely reduce the number of HO vertex pairs. Instead of attaching springs from object mesh to all the affinitive vertices on hand mesh, we only attach them on the several hand subregion centers, as we call it anchors (Fig. 2 left). According to the statistics [24, 7] on the contact frequency of different hand parts, we first divide the full hand palm into 17 subregions: 3 for each phalange of 5 fingers, 1 for metacarpals, and another for carpals. Then, we interpolate up to 4 anchors for each subregion. We ignore all the vertices on the back side of the hand.

다른 물체에 대한 hand mesh에서 hand region에 대한 세분화는 거의 동일하기 때문에 hand mesh에 여러 대표 점(즉, anchor)을 interpolate하여 HO vertex pair의 수를 크게 줄일 수 있다. object mesh의 스프링을 관련있는 모든 hand mesh의 vertex에 붙이는 대신에, anchor라고 하는 몇몇 개의 손의 하위영역의 중심에만 붙인다. 다른 손 부분의 접촉 빈도에 대한 통계에 따르면, 먼저 전체 손바닥을 17개의 하위영역으로 나눈다 : 5개의 손가락의 각 지골에 대해 3개씩, 중수골(손바닥 뼈, Fig.2 왼쪽 참조)에 대해 1개, 손목에 대해 1개이다. 그런 다음 각 하위 영역에 대해 최대 4개의 앵커를 interpolation한다. 저자는 손의 뒷면에 있는 모든 vertex를 무시한다.

4. Contact Potential Field

Contact as Spring-Mass System

A single contact is modeled as a spring-mass system which consists of a spring and two mass points on each side (hand and object). When the spring is at its rest position, it does not store energy, whilst it is stretched or compressed, according to Hooke’s Law, it will store the elastic potential energy with the form: $\frac{1}{2}k|\Delta l|^2$ , where k is the spring elasticity, and $|\Delta l|$ is a certain “distance” metric w.r.t. the spring’s rest position. In CPF, we define two types of spring: attractive spring and repulsive spring. The goal of attractive spring is to pull the hand vertex $v^h$ toward the object vertex $v^o$ based on a given HO vertex pair affinity. And the goal of repulsive spring is to push the $v^h$ away from $v^o$ along the $v^o$ ’s normal if the $v^h$ is in the vicinity of $v^o$ . Apart from these definitions, we should also point out that the attractive spring is bound with a certain pair of HO vertex affinity, while the repulsive spring only takes effect in the neighborhood of HO vertex pairs at some point.

single contact는 스프링과 양 쪽에 두 개의 질량 지점(hand, object)으로 구성된 spring-mass 시스템으로 모델링된다. 스프링이 정지 위치에 있을 때는 에너지를 저장하지 않고, 늘리거나 압축했을 때 탄성 potential energy를 저장한다. : $\frac{1}{2}k|\Delta l|^2$ , 여기서 k는 스프링 탄성계수이고 $|\Delta l|$ 는 스프링의 일정한 거리이다. CPF에서는 두 가지 유형의 스프링을 정의한다: 인력 스프링, 척력 스프링. 인력 스프링의 목표는 주어진 HO vertex pair의 연관성에 따라 hand vertex $v^h$ 를 object vertex $v^o$ 쪽으로 당기는 것이다. 척력 스프링의 목표는 $v^h$ 가 $v^o$ 근처에 있을 때 $v^o$ 의 normal을 따라 $v^o$ 로 부터 $v^h$ 를 밀어내는 것이다. 이러한 정의와는 별개로, 인력 스프링이 특정 HO vertex pair의 연관성과 결합되어 있는 반면, 척력 스프링은 특정 지점의 HO vertex pair 근처에서만 효과가 발생한다는 것을 지적한다.

Attractive Spring.

We define the rest length of attractive spring as 0 in which the hand vertex and object vertex are in perfect contact, and the distance metric $|\Delta l|$ as Euclidean distance. Given a HO affinity that includes a vertex pair: $v_i^h$ and $v_j^o$ , the $|\Delta l_{ij}^{atr}|$ is equal to $||v_i^h - v_j^o||_2$ . The potential energy of the current attractive spring is given by:

논문에서는 인력 스프링의 나머지 길이를 hand vertex와 object vertex가 완전히 접촉하는 경우 0이라고 정의하고 distance metric을 Euclidean distance로 정의한다. vertex pair를 포함하는 HO affinity가 주어지면, vertex pair : ( $v_i^h$ , $v_j^o$ ) & $|\Delta l_{ij}^{atr}|$ = $||v_i^h - v_j^o||_2$ 이다. 현재의 척력 스프링의 potential energy는 다음과 같다. → $\frac{1}{2}$ * (스프링 계수) * (스프링 길이)^2

Repulsive Spring.

We hope that the repulsion energy is high when $v_i^h$ is penetrating or in the vicinity of $v_j^o$ , but gradually decays as the $v_i^h$ moves away from the object, and finally becomes negligible at certain distance. Given a proximate HO vertex pair: $v_i^h$ and $v_j^o$ , We define a repulsive spring to model this behavior. Supposing that the repulsive spring has the rest position at +∞ away along the object normal $n_j^o$ . We adopt a heuristic distance metric $|\Delta l| = e^{-|\Delta_{ij}^{rpl}|}-e^{-\infty} = e^{-|\Delta_{ij}^{rpl}|}$ , where $|\Delta l_{ij}^{rpl}| = (v_i^h - v_j^o)\cdot n_j^o$ is the projection of the $(v_i^h - v_j^o)$ on the object normal $n_j^o$ . Thus, the potential energy of the current repulsive spring is

저자는 척력 에너지가 hand vertex가 object vertex의 근처를 관통할 때 높지만, hand vertex가 물체에서 멀어짐에 따라 점차적으로 감소하고, 최종적으로 특정 거리에서는 무시할 수 있는 수준이 되길 바란다. HO vertex pair가 주어지면, 이를 모델링 하기위한 척력 스프링을 정의한다. 척력 스프링이 object normal을 따라 +∞에서 0이라고 가정한다. 그리고 heuristic distance metric을 채택하는데, 이는 길이를 다음과 같이 정의하고, 여기서 $l_{ij}^{rpl}$ 은 object normal에 hand, object vertex 사이의 거리를 projection한 값이다. (이러면 거리가 멀어질 수록 작아지고, 가까워질 수록 커짐) 따라서, potential energy는 다음과 같이 정의된다.

Grasping inside Contact Potential Field.

By collecting all the attractive and repulsive springs, to form a natural grasp is equivalent to minimize the elastic energy:

As discussed in §3, the hand vertices can be simplified to subregion anchors, which will largely relax the difficulty of learning and fitting inside the CPF. Thus, for attractive spring, we replace the $\Delta l_{ij}$ in Eq.1, to $\Delta l_{ij}' = a_i - v_j^o$ ,where $a_i$ is the closest anchor to $v_i^h$ . Besides, we would like to have the repulsion force be only applied to those HO affinity pairs that are of vertices in vicinity. Thus we set zero the repulsion energy when the vertex distance $||v_j^o - v_i^h||_2$ is greater than a threshold $t_{rpl}$ = 20 mm.

모든 인력,척력 스프링을 모아 자연스러운 그립을 형성하는 것은 탄성에너지를 최소화하는 것과 같다. Section.3 에서 언급했던 것 처럼 hand vertex는 하위 영역의 앵커로 단순화될 수 있으며, 이는 CPF 내부에서 학습 및 피팅의 어려움을 크게 완화할 것이다. 따라서, 척력 스프링의 경우 $\Delta l_{ij}$ 를 $\Delta l_{ij}' = a_i - v_j^o$ 로 교체한다. 여기서 $a_i$ 는 object vertex와 가장 가까운 anchor이다. (즉, hand vertex와의 거리 대신 anchor와의 거리로 바꿈) 게다가, 논문에서는 척력이 vertex가 가까이에 있는 HO affinity pair에만 적용되기를 원해 vertex간의 distance가 20mm가 넘으면 척력 에너지를 0으로 설정한다.

Annotation of the Attractive Springs ( $k^{atr})$ .

To integrate the CPF into learning framework, we only consider the $k_{ij}^{atr}$ as the prediction of neural network. To enable this, network shall have the abilities of 1) pairing the hand anchors and object vertices into HO affinity pair, e.g. $(a_i, v_j^o)$ ; and 2) regressing the intensity of those affinity pairs, e.g. $k_{ij}^{atr}$ . These require annotation of the attractive springs $k_{ij}^{atr}$ .

CPF를 학습 프레임워크에 통합하기 위해 $k_{ij}^{atr}$ 만 신경망의 예측으로 간주한다. 이를 가능하게 하기위에 네트워크는 hand anchor와 object vertex를 HO affinity pair로 구성하며, affinity pair의 강도( $k_{ij}^{atr}$ )를 regression하는 능력을 가져야 한다. 여기에는 $k_{ij}^{atr}$ 에 대한 annotation이 필요하다.

Given the ground-truth (gt.) HO pose and their mesh model, we automatically annotate each $k_{ij}^{atr}$ based on a heuristic of the $(a_i, v_j^o)$ pair distance. Since each $a_i$ may be included in several affinity pairs, we hope the attraction energy stored in each spring at gt. HO pose is well balanced. Thus we assign the gt. $k_{ij}^{atr}$ a value that is inverse-proportional to the gt. $|\Delta \hat l_{ij}^{atr}|$ . In order to train the network, we also bound the magnitude of $k_{ij}^{atr}$ by 0 and 1. Here we only provide a glimpse of the annotation heuristic of $k_{ij}^{atr}$ :

GT HO pose와 mesh model이 주어지면 $(a_i, v_j^o)$ pair distance를 기반으로 $k_{ij}^{atr}$ 에 annotation을 달았다. 각 $a_i$ 가 여러 affinity pair에 속할 수 있으므로, 저자는 GT에서 각 스프링이 인력 에너지를 저장하고 있기를 희망한다. 따라서, $k_{ij}^{atr}$ 에 $|\Delta \hat l_{ij}^{atr}|$ 와 반비례 하는 값을 저장하고, network를 학습하기 위해 $k_{ij}^{atr}$ 값을 0에서 1사이로 제한한다. 아래는 annotation에 대한 식이다. 경험적으로, s = 20으로 두고, 만약 gt $|\Delta \hat l_{ij}^{atr}|$ 가 20mm 이상이면 HO affinity를 무시한다. 그리고 척력 스프링에서의 scale factor를 $k_{ij}^{rpl}$ 을 $1 \times 10^{-3}$ 로 설정한다.

5. Hybrid Framework - MIHO

제안된 CPF와 관련하여 MIHO model은 hand-object interaction을 3-stage로 모델링하는데, 이는 HoNet, PiCR, GeO 이다. Fig.3에서와 같이, 첫 번째로 RGB image가 주어지면, HoNet은 hand mesh와 object mesh에 대한 coarse pose를 예측한다. 그리고, PiCR은 CPF를 구성하기 위해 학습하고 탄성에너지( $E_{elast}$ )를 모은다. 마지막으로, GeO는 CPF에서 탄성에너지를 최소화시키고 개선된 HO mesh * $\mathcal{V}^o$ , * $\mathcal{V}^h$ 를 얻는다.

5.1. Hand-object Pose Estimation Network, HoNet

The HoNet first predicts coarse poses of HO meshes by the baseline model MeshRegNet as in [23]. The outcomes from the baseline comprise in total 37 coefficients: object 6D pose $\mathbf{P}_o$ ∈ $\mathcal{se}(3) (\mathbb{R}^6)$ , hand wrist 6D pose $\mathbf{P}_w$ ∈ $\mathcal{se}(3)$ , PCA components of MANO pose $\theta_{pca}$ ∈ $\mathbb{R}^{15}$ and shape β ∈ $\mathbb{R}^{10}$ . With these coefficients, HoNet could place the HO meshes into camera space.

HoNet에서는 먼저 baseline model인 MeshRegNet을 통해 HO mesh에 대해 coarse pose를 예측한다. baseline의 결과는 총 37개의 계수로 이루어진다. object 6D pose (6) , hand wrist 6D pose (6) , MANO pose에 대해 PCA component (15), shape (10). 이러한 계수를 사용하여 HoNet은 HO mesh를 camera space에 배치할 수 있다.

5.2. Pixel-wise Contact Recovery Module, PiCR

With the coarse meshes of hand and object in HoNet, PiCR learns to recover the CPF by firstly paring the hand anchors and object vertices into HO affinity pairs and then regressing the spring elasticities that describe the affinities. To achieve this, PiCR yields three cascaded outcomes: 1) Vertex Contact (VC) decides which vertices on object are in contact with hand; 2) Contact Region (CR) decides the subregion that is most likely to contact with those vertices in VC; 3) Anchor Elasticity (AE) represents the elasticities of the attractive springs. With VC, CR, and AE, we can then recover the CPF as illustrated in Fig. 4.

HoNet에서 hand, object의 coarse mesh를 사용하여, PiCR은 먼저 hand anchor와 object vertex를 HO affinity pair로 묶은다음 affinity를 설명하는 스프링 탄성을 regression하여 CPF를 복구하는 방법을 학습한다. 이를 달성하기 위해 PiCR은 세 가지의 결과를 산출한다. 1) vertex contact(VC)는 object의 어떤 vertex가 손과 접촉하는지 결정한다. 2) contact region (CR)은 VC의 vertex와 접촉할 가능성이 가장 높은 subregion을 결정한다. 3) anchor elasticity (AE)는 인력 스프링의 탄성을 나타낸다. VC, CR, AE를 사용하면 Fig.4 처럼 CPF를 복구할 수 있다.

Vertex Contact.(VC)

PiCR’ s first outcome $VC \in \mathbb{R}^{N_o}$ stands for the contact probability of object vertices. More specifically, $VC[j]$ is a probability that implies the j-th object vertex $v_j^o$ is in contact with hand. The loss function of VC is defined as a binary focal loss [33]:

where $f_j = p_j$ if the gt. $\hat v_j^o$ belongs to any HO affinity, otherwise $f_j = (1-p_j)$ , and the $p_j$ is the predicted probability at $VC[j]$ . $\mathbf{1}^{img}_j$ denotes whether the vertex $\hat v_j^o$ is projected inside the image. $\alpha_j$ is inverse class frequency and γ is empirically set to 2.

PiCR의 첫 번째 결과 VC는 object vertex의 접촉 확률을 나타낸다. 구체적으로, VC[j]는 j번째 object vertex $v_j^o$ 가 손과 접촉하고 있음을 암시하는 확률이다. loss function은 다음과 같다. gt일 경우에 $f_j = p_j$ 이고, 그 외에는 $f_j = (1-p_j)$ 이다. 그리고 $p_j$ 는 $VC[j]$ 에서의 예측 확률이다. $\mathbf{1}^{img}_j$ 는 vertex $\hat v_j^o$ 가 이미지 안에 투영되는지 여부를 나타낸다. $\alpha_j$ 는 클래스 빈도의 역수이고, 감마는 2로 설정했다.

Contact Region.

PiCR’s second outcome $CR \in \mathbb{R}^{N_o\times17}$ stands for the subregion probabilities of object vertices. More specifically, for the j-th query, $CR[j]$ contains 17 probabilities that indicates $v_j^o`s$ affinity toward 17 hand subregions. The loss function $\mathcal{L}_{CR}$ is defined as a multi-class focal loss.

where the $m_j = \Sigma(p_j*t_j)$ in which $p_j = CR[j] \in \mathbb{R}^{17}$ is the predicted per-subregion probabilities through softmax, and $t_j \in \mathbb{R}^{17}$ is the gt. subregion affinity of $\hat v_j^o$ as a one-hot vector. $\mathbf{1}_j^{VC}$ denotes that the gt. VC of $\hat v_j^o$ is positive.

CR은 object vertex에 대한 subregion 확률을 나타낸다. 구체적으로, CR[j]의 경우 17 개의 hand subregion에 대한 $v_j^o`s$ 의 affinity를 담고있다. (즉, 17개의 확률값을 들고있음) loss function은 다음과 같다. 여기서, $t_j$ 는 one-hot vector로써 subregion affinity의 gt probability vector이다. 그리고, $p_j$ 는 각 subregion에 대해 예측된 확률이고, $m_j$ 는 이 둘을 내적한 값이다. 그리고 identity function은 VC가 양수일 경우(즉, object vertex가 hand와 접촉하고 있을 경우)를 의미한다.

Anchor Elasticity.

PiCR’s third outcome $AE \in \mathbb{R}^{N_o}$ stands for the predicted elasticity of attractive springs $k^{atr}$ . More specifically, $AE[j]$ is the elasticity $k_{ij}^{atr}$ of an attractive spring that connects $v_j^o$ to its affinitive anchor $a_i$ in the predicted subregion: $argmax(CR[j])$ . The loss function $\mathcal{L}_{AE}$ is defined as a binary cross-entropy (BCE):

where the $\hat{k}^{atr}_{ij}$ is the gt. elasticity described in 4.

PiCR의 세 번째 결과 $AE \in \mathbb{R}^{N_o}$ 는 인력 스프링의 예측 탄성 $k^{atr}$ 을 나타낸다. 구체적으로, AE[j]는 예측된 subregion인 $argmax(CR[j])$ 에서 $v_j^o$ 를 affinitive anchor $a_i$ 에 연결하는 인력 스프링의 탄성 $k^{atr}_{ij}$ 이다. 손실 함수는 Binary Cross-Entropy로 정의된다.

예측된 VC, CR, AE와 HoNet으로 부터의 coarse mesh $\mathcal{V}^o, \mathcal{V}^h$ 를 통해, PiCR은 최종적으로 CPF를 복구하고 탄성 에너지 $E_{elast}$ 를 모은다. (Algm1에서 언급됨.) 경험적으로, VC의 probability threshold는 0.8, distance threshold $t_{rpl} = 20 mm$ 이다.

PiCR's Framework.

The proposed PiCR consists of a backbone b that extracts features from image, an encoder p that converts image features to object vertex features, and 3 heads $h_{vc}, h_{cr}$ and $h_{ae}$ which sequentially convert those features into VC, CR, and AE. As illustrated in Fig. 3, the process of feature extraction in PiCR can be expressed as:

where b(·) is the hourglass networks [37], π(·) is the perspective camera projection, and f(·) stands for aligning $V^o$ ’s 2D projection π( $V^o$ ) with the image features $b(\mathcal{I})$ through bilinear sampling. Inspired from Eq.(1) in [48], we also append the object’s root-relative z value $z(\mathcal{V}^o)$ at the end of f(·) to form the pixel-wise features $\mathcal{F}'$ . Next, a PointNet [40] encoder p(·) is adopted to convert $\mathcal{F}'$ to its point-wise features $\mathcal{F}$ .

PiCR은 이미지에서 feature를 추출하는 backbone b, image feature를 object vertex feature로 변환하는 인코더 p, 이러한 특징을 순차적으로 VC, CR, AE로 변환하는 3개의 헤드 $h_{vc}, h_{cr}$ , $h_{ae}$ 로 구성된다. Fig.3에 설명된 것 처럼 PiCR의 feature extraction은 다음과 같이 표현된다.

$\mathcal{V}^o$ 는 object vertex feature, b는 hourglass network, $\pi (\cdot)$ 은 camera projection, $f (\cdot)$ 는 2D-projection 과 image feature를 bilinear sampling을 통해 정렬시킨다. 여기에 object의 root-relative z value라는 것을 추가하여 pixel-wise feature를 추출한다. 그 다음, PointNet encoder p를 통해 이를 point-wise feature로 바꾼다.

5.3. Grasping Energy Optimizer, GeO

The fitting part: Grasping Energy Optimizer (GeO) aims to refine the HO pose w.r.t. the recovered CPF. For the object part, we adjust its 6D pose $P_o \in \mathcal{se}(3)$ . For the hand part, we jointly adjust the A-MANO’ s 15 joint rotations $\{R_j \in so(3) | j \leq 15\}$ and a wrist pose $P_w \in se(3)$ . In order to mitigate the abnormal hand posture during optimization, we also define an anatomical cost function $\mathcal{L}_{anat}$ that penalizes the unwanted axial components and angular values of the 15 rotations in the proposed twist-splay-bend coordinate frame. First, for the joints along hand kinematic tree, we penalize the component of rotation axis $a^{rot}$ on twist direction: $n^{twist}$ , since any component that causes the finger twisting along its pointing direction is prohibited. Second, for the joints that do not belongs to 5 knuckles, we also penalize the component of $a^{rot}$ on splay direction: $n^{splay}$ . Last, we penalize the rotation angle $\phi^{bend}$ that revolves about the bend axis if it is greater than π/2. The total anatomical cost can be written as:

We also penalize the offset of the refined hand-object vertices $\ast\mathcal{V}^o, \ast\mathcal{V}^h$ from their initial estimation $\mathcal{V}^o, \mathcal{V}^h$ in form of $l2$ distance: $\mathcal{L}_{offset}$ .

Fitting part : Grasping Energy Optimizer (GeO)는 복구된 CPF를 따라 HO pose를 개선하는 것을 목표로 한다. 물체 부분에 대해서는 6D pose를 조정하고, 손 부분의 경우 A-MANO의 15개 joint rotation ( $\{R_j \in so(3) | j \leq 15\}$ ) 및 손목 포즈 $P_w \in se(3)$ 를 조정한다. 최적화 중 비정상적인 hand pose를 완화하기 위해 twist-spaly-bend coordinate frame에서 원치 않는 축 구성요소와 15개의 rotation angle value에 페널티를 주는 해부학적 cost function인 $\mathcal{L}_{anat}$ 를 정의한다. 먼저, hand kinematic tree를 따라 이어지는 joint에 대해서, 손가락을 비틀어지게 하는 것이 금지되므로 회전 방향( $n^{twist}$ )의 회전축 $a^{rot}$ 의 component에 페널티를 준다. 두 번째로, 5개의 너클에 속하지 않는 joint의 경우 splay 방향에서 $a^{rot}$ 의 구성요소에 페널티를 적용한다.(너클에 속하지 않는 경우 펼침) 마지막으로, bend(굽힘) 축을 중심으로 회전하는 회전 각도가 π/2보다 크면 페널티를 받는다. 그리고 개선된 HO vertex에서 initial estimation으로부터의 offset에 l2 distanct로 페널티를 준다.(크게 벗어나지 않도록)

6. Experiments and Results

6.1. Datasets

First-person Hand Action Benchmark, FHB.

HB is a first-person RGBD video dataset of hand in manipulation with objects. The ground-truth of hand poses was captured via magnetic sensors. In our experiments, we use a subset of FHB that contains 4 objects with a scanned model and pose annotation.

FHB는 물체를 조작하는 1인칭 RGBD video dataset이다. GT hand pose는 magnetic sensor로 캡처되었다. 실험에서는 스캔된 모델과 pose annotation이 있는 4개의 물체를 포함하는 FHB의 subset을 사용한다.

HO3D.

HO3D is another dataset that contains precise hand-object pose during the interaction.

HO3D는 interaction 중 정확한 hand-object pose를 포함하는 또 다른 dataset이다. (이 중 어떤 부분을 사용하였고 어떻게 사용하였는지에 대해서는 생략)

6.2. Metrics

Modeling the HO interaction requires not only a proper pose of both hand and object but also a natural grasp configuration. Here, we report 5 metrics in total that cover both reconstruction and grasp quality.

HO interaction을 모델링하려면 hand와 object 모두의 적절한 pose뿐만 아니라 자연스러운 grasp 구성이 필요하다. 여기서는 reconstruction과 grasp quality를 모두 다루는 5가지 metric를 사용한다.

MPVPE. (low)

We compute the mean per vertex position error for both hand and object in camera space to assess the quality of pose estimation.

pose estimation의 quality를 평가하기 위해 camera space에서 hand와 object 모두에 대한 vertex 위치당 mean error를 계산한다.

Penetration Depth (PD). (low)

To measure how deep that the hand is penetrating the object’s surface, we calculate the penetration depth that is the maximum distance of all the penetrated hand vertices to their closest object surface.

손이 물체의 surface를 관통하는 깊이를 측정하기 위한 metric이다.

Solid Intersection Volume (SIV). (low)

To measure how much space intersection that occurs during estimation, we voxelize the object mesh into 80 voxels, and calculate the sum of the voxel volume inside the hand surface.

estimation하는 동안 얼마나 많은 space간 교차가 발생하는지 측정하기 위해 복셀 볼륨의 합을 계산한다.

Disjointedness Distance (DD). (low)

We also encourage stable HO contact, which can be depicted as attracting fingertips onto the object surface. Therefore, we define the disjointedness metrics as the average distance of hand vertices in 5 fingertips region to their closet object surface.

object surface에 손가락 끝을 끌어당기듯이 안정적인 H2O contact를 권장하므로 5개의 손가락 끝 영역에서 object surface까지의 hand vertex의 평균 거리로 정의한다.

Simulation Displacement (SD). (low)

We further evaluate the grasp stability in a modern physics simulator [11]. We measure the average displacement of object’s center over a fixed time period by holding the hand steadily and applying gravity to the object [24].

grasp stability를 추가로 평가한다.

Comparison with State-of-the-Arts

실험에서 2가지 세팅을 사용하였음. 십자가 1개는 hand-alone (HoNet으로 부터의 초기 prediction으로 object를 고정하고, GeO로 hand pose만 최적화 하였다.), 십가자 2개는 hand-object (GeO를 통해 hand와 object pose를 동시에 최적화)이다. 그리고, 낮은 vertex error가 반드시 높은 reconstruction quality를 보장하지는 않는다는 것을 발견하였음. (실제로 gt나 [23]은 상당히 error 값이 좋지 않았음.) 전반적인 데이터셋에서 Ours가 높은 성능을 나타냄.

화살표로 끝나는 거는 penetration이 일어남, 그리고 다른 거는 제대로 붙지 않음.

6.4. Ablation Study

In this experiment, we further evaluate the effectiveness of the proposed CPF and A-MANO. In the main text, we include three of the most representative studies.

다음은 Ablation Study를 통해 CPF와 A-MANO의 효과를 추가적으로 평가한다.

Comparison with simple Distance-based Contact Heuristics.

MIHO를 사용함으로써 결과가 좋아짐을 보임. (HoNet, PiCR, GeO 모두를 포함하는 hybrid model)

Effectiveness of Repulsive Springs.

Repulsive spring을 사용하지 않았을 때 (즉, 인력 에너지만을 사용했을때) 와 비교하기 위함. 실제로 척력 스프링을 사용하지 않았을 경우에도, 인력 스프링이 손이 물체의 표면과 가까울 때 어느정도 밀어내는 작용을 하는 것을 볼 수 있음.

Effectiveness of the Anatomical Constraints.

anatomical constraints, 즉 해부학적 제약(splay-bend-twist 축에 대해 rotation에 제약을 가함)을 가하지 않았을 때에는 손이 이상한 모양을 취하게 됨을 보여준다.

7. Conclusion

In this work, we propose a novel contact representation named CPF and a learning-fitting hybrid framework MIHO to help modeling hand and object interaction. Comprehensive evaluations show that our methods, while being able to recover precise hand-object pose, can also effectively 1) avoid interpenetration and control disjointedness, and 2) prevent abnormality in hand pose. Later, we also plan to develop for an object-agnostic representation of CPF, for the interaction in general cases.

본 논문에서는 hand object interaction을 모델링하는 데 도움이 되는 CPF라는 새로운 contact representation과 학습에 적합한 hybrid framework MIHO를 제안한다. 종합적인 평가에 따르면, 논문에서 제시한 방법은 정밀한 hand-object pose를 복구할 수 있는 동시에 1) 관통을 방지하고 떨어짐을 제어할 수 있으며 2) 손 pose의 이상을 예방할 수 있음을 보여준다. Future work로는 general한 case에서 interaction을 위해 CPF의 물체와 관련없는 표현을 개발할 계획이다.

학부생 때 만든 것이라 오류가 있거나, 설명이 부족할 수 있습니다.

Code

Mano ⇒ smpl_data[”hands_components”] → (45, 45) matrix

mano vertex → 778개 / face → 1538개

Uploaded by N2T

'Paper Summary' 카테고리의 다른 글

Grasping Field: Learning Implicit Representations for Human Grasp, 3DV’20 (0)	2023.04.06
Hand-Object Contact Consistency Reasoning for Human Grasps Generation, ICCV’21 (0)	2023.04.06
GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes, CVPR’19 (0)	2023.04.06
Convolutional Occupancy Networks, ECCV’20 (0)	2023.03.26
Dynamic Plane Convolutional Occupancy Networks, WACV’21 (1)	2023.03.26

공부 기록

CPF: Learning a Contact Potential Field to Model the Hand-Object Interaction, ICCV’21

1. Introduction