这是用户在 2024-11-8 13:52 为 https://ieeexplore.ieee.org/document/9904480 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
通过对比降维进行交互式视觉聚类分析 |IEEE期刊和杂志 |IEEE Xplore --- Interactive Visual Cluster Analysis by Contrastive Dimensionality Reduction | IEEE Journals & Magazine | IEEE Xplore


通过对比降维进行交互式视觉聚类分析


出版商: IEEE

Abstract:

We propose a contrastive dimensionality reduction approach (CDR) for interactive visual cluster analysis. Although dimensionality reduction of high-dimensional data is wi...View more

 抽象:


我们提出了一种用于交互式视觉聚类分析的对比降维方法 (CDR)。尽管高维数据的降维与散点图一起广泛用于视觉聚类分析,但有效的视觉聚类分析存在一些限制。首先,在保持邻域结构时,嵌入呈现清晰的视觉聚类分离并非易事。其次,由于聚类分析是一项主观任务,因此需要用户引导。然而,在降维中实现交互也并非易事。为了解决这些问题,我们在降维中引入了对比学习,以实现高质量的嵌入。然后,我们重新定义损失函数到负对的梯度,以增强嵌入结果的视觉聚类分离。基于对比学习方案,我们采用基于链接的交互来指导嵌入。之后,我们实现了一个原型可视化界面,该界面集成了建议的算法和一组可视化效果。定量实验表明,CDR 在保持正确的邻域结构和改善视觉聚类分离方面优于现有技术。消融实验证明了梯度重新定义的有效性。用户研究验证了 CDR 在集群识别任务中优于 t-SNE 和 UMAP。我们还展示了真实数据集上的两个用例,以展示基于链接的交互的有效性。

发表于: IEEE Transactions on Visualization and Computer Graphics卷数: 29期号: 1, 2023 年 1 月

页码: 734 - 744

发布日期: 2022 年 9 月 27 日

 ISSN 信息:


PubMed 编号: 36166528

出版商: IEEE

 资助机构:


Fig. 1. - The embedding results by dimensionality reduction techniques. Top: the embedding results of the indian food dataset [4] by (a) ISOMAP, (b) t-SNE, (c) UMAP, and (d) CDR (the contrastive dimensionality reduction), respectively. The data points are color-encoded by class labels. Bottom: the interactive analysis of the animals dataset [26] by CDR. (e) The initial embedding result. (f) A must link is added to merge the butterfly clusters. (g) An additional cannot link (orange), and an additional must link (red) are added to separate the spider and the butterfly clusters. (h) The updated embedding result after the cannot link and must link interactions.
 图 1.


嵌入结果通过降维技术进行。顶部:分别通过 (a) ISOMAP、 (b) t-SNE、 (c) UMAP 和 (d) CDR (对比降维) 的印度食品数据集 [4] 的嵌入结果。数据点由类标签进行颜色编码。下图:通过 CDR 对动物数据集 [26] 的交互式分析。(e) 初始嵌入结果。(f) 添加 Must 链接以合并蝶形群集。(g) 添加额外的 cannot link (橙色) 和 additional must link (红色) 以分隔spider和butterfly集群。(h) 在 cannot linkmust link 交互之后更新的嵌入结果。

 第 1 部分

 介绍


由于可视化聚类分析是一项固有的人机协同任务,缺乏通用的地面实况,因此它通常采用降维 (DR) 技术来支持聚类模式 [9] 的可视化和探索 [87][84] 许多 DR 技术,例如 t-SNE 和 UMAP,采用“接近≈相似”的比喻,其中一对点之间的相似性由其 2D 嵌入之间的距离保持 [82] 。因此,在嵌入空间中聚集在一起的点可以直观地形成隐式集群。在本文中,我们提出了一种用于可视化集群分析的交互式 DR 技术。


尽管现有的 DR 技术广泛用于视觉聚类分析,但它们的结果往往不令人满意。 Fig. 1 显示了三种广泛使用的 DR 方法在 Indian Food 数据集 [4] 上的嵌入结果。我们可以看到,在使用 ISOMAP [73] 时,数据点严重混乱,使得识别集群变得极其困难。对于 t-SNE [76] 和 UMAP [55] ,它们被认为是最先进的 DR 方法 [28][62] 它们 [78] 的结果在集群之间具有更好的分离。但是,在没有类标签颜色编码的情况下识别所有集群仍然会令人困惑,集群任务中通常不提供这些标签。此外,目前尚不清楚用户如何引导这些嵌入。从这些限制中,我们确定了 DR 技术支持交互式可视化集群分析的三个主要要求。


首先是正确构建邻域结构,并忠实地将它们保留在嵌入空间中。这确保了集群模式的可信表示,这是可视化集群分析的先决条件。最先进的 t-SNE 和 UMAP 技术显式或隐式使用 k 最近邻 (kNN) 来表示邻域结构。但是,由于欧几里得相似性度量的限制,kNN 不可避免地存在错误,因为数据不相似性可能本质上是非欧几里得 [57] 的。在欧几里得相似性下,不同流形中的数据点可能被视为彼此的 k 最近邻。此外,在 embedding 空间中保留正确的邻居也并非易事。缺失邻居和假邻居经常出现在现有的 DR 技术 [52] 中。 [57] 提出了几种参数化技术,如参数化 UMAP [67] 、 参数化 t-SNE [75] 和 深度递归嵌入 [92] 。然而,它们在邻域保护方面的表现比最先进的非参数技术差。提出高质量的参数化 DR 技术仍然是一个悬而未决的问题。


二是在低维空间实现清晰的视觉聚类分离,可以增强用户的信心,加快识别聚类的过程。在这个问题上,t-SNE 和 UMAP 都不受欢迎。如图所示 Fig. 1 ,两种方法都获得了相对较低的轮廓系数 (SC) 分数 (t-SNE: 0.466;UMAP: 0.610),并在集群之间存在视觉混淆。现有工作致力于视觉区分标记类 [80] 的线性 DR 技术。但是,目前尚不清楚如何改善没有类标签的数据集的视觉聚类分离。此外,尽管 DR 技术旨在保留邻域结构,但如果两个簇在高维空间中没有很好地分离,那么在嵌入空间中分离簇会破坏结构。分离集群和保留邻域并非易事。通常,这两者之间存在权衡。


第三个是支持人机回环分析的有效交互。用户希望利用他们的领域知识来引导低维嵌入,例如,两个点应该在同一个集群中或属于不同的集群。尽管已经开发了 [31] 几种交互式嵌入技术, [45] [47] [49] 但关于引导 DR 进行视觉聚类分析的研究工作很少。Perez 等人 [60] 允许用户操作单个参数来生成分离的集群。但是,参数上的交互并不直观,需要模型知识。直观的交互式非线性 DR 技术仍然是一个悬而未决的问题。


为了满足这些要求,我们提出了一种用于交互式视觉聚类分析的对比降维技术 (CDR)。对比学习是一种自我监督的表示学习方法,在区分相似/不同的数据点 [23] 方面取得了巨大成功。 [37] 因此,我们首先将对比学习引入降维,以更好地保留邻域结构并呈现可信的集群模式。因此,我们仔细设计了正/负对,并在降维的背景下调整对比损失。其次,为了便于视觉聚类分析,我们进一步改进了所提出的 CDR 方法,在保持邻域结构的同时实现更好的视觉聚类分离。为了实现这一点,我们将损失函数的梯度重新定义为负对,以便模型不仅可以区分数据点之间的相似性/不相似性,还可以区分集群之间的相似性/不相似性。第三,我们在 CDR 中引入了基于链接的约束。这使得用户可以通过 Must LinkCannot Link 操作进行交互,支持用户利用领域知识引导嵌入。支持迭代交互和优化分析循环,以进一步减轻交互负担。通过提出的交互式对比降维方法,我们还开发了一个原型界面,该界面集成了所提出的方法和一组可视化效果,用于交互式视觉聚类分析。通过对真实数据集的定量比较,我们表明所提出的 CDR 在呈现可信和呈现分离良好的集群模式方面优于 t-SNE 和 UMAP。 消融实验证明了所提方法的有效性。我们还通过一项用户研究和两个用例展示了 CDR 在交互式视觉集群分析中的有效性。拟议 CDR 的代码可在 https://github.com/DRLib/CDR 上获得。


总而言之,我们的主要贡献是:


  • 我们提出了一种对比降维 (CDR) 方法,将对比学习引入降维并改进对比损失,以实现更好的邻域保持和视觉集群分离。


  • 我们将基于链接的交互引入 CDR 中,以捕获用户意图以进行交互式可视化集群分析。


  • 广泛的定量比较、一项用户研究和两个案例研究证明了与流行的 DR 方法(例如 t-SNE 和 UMAP)相比,所提出的 CDR 的有效性。

 第 2 部分

 相关工作


2.1 可视化集群分析


可视化聚类分析利用可视化技术来理解和探索数据特征、聚类模型和聚类模式。数据特征(例如维度的统计分布和聚合统计信息)描述了维度中的特征分布如何影响聚类结果,从而能够了解聚类与相关维度之间的关系。它们可以通过统计图表(如 字形 [64] 、 直方图 [18] 和 平行坐标 [2][42] )进行可视化。除了直接可视化统计特征之外,在并行坐标 [13] 或相似性矩阵中对它们 [53] [63] 进行重新排序也是一种有效的聚类方式 [91] ,因为具有相似特征的实例可以通过各种重新排序策略紧密放置。将可视化与聚类分析模型相结合,使用户能够理解、引导和调整聚类分析模型。例如,可以使用热图 [69] 或树 [86] 来探索分层聚类结构,以指导聚类结果的优化。此外,分类标签等可视化编码模型有助于检查群集 [29] 中实例属性的相关性。最近,人们努力采用 DR 技术来探索嵌入空间中 [82]集群模式。特别是,DR 技术可以与聚类算法结合使用,无论是在聚类 [56] 之前 [88] 还是在聚类 [21][30] [48] 、 之后,以在 2D 嵌入空间中显式形成集群。 采用“接近≈相似”隐喻的 DR 技术可以直接可视化,保留隐式聚类 [19][66] 与高维空间中的原始聚类结构建立最直观的关系。由于变换可能会导致变形,从而使聚类实例的相对位置不可靠 [39] ,因此挑战在于尽可能多地保留邻域结构,同时获得高质量、分离良好的聚类。我们的方法利用对比学习和交互式聚类来应对这一挑战。


2.2 交互式聚类


交互式集群利用最终用户的领域知识来指导自动集群流程。提供了许多不同的交互式操作来支持转向过程,例如调整集群实例的 [11] 特征或权重、 [58] 参数化底层模型 [3][6][14] [22] 、 或直接操作聚类结果(例如,通过拆分或合并集群)、 [10] [20] .但是,这些技术要么需要对底层模型有深入的了解,要么依赖于从外部源注入特定于领域的知识。为了减轻认知成本,已经初步努力进行微观层面的交互,这些交互侧重于监督环境中样本之间基于关系的变量。它们可以在视觉上进行操作,并且独立于特定域和底层模型。例如,用户可以调整特定集群实例 [15] 之间的嵌入距离 , [25] 指定它们的关系 [85] ,或者直接建立成对实例的 必须链接不能链接 约束,以优化各种降维模型 [50] 。聚类算法的基于链接的交互激发了我们在交互式可视化聚类分析中应用转向嵌入的灵感。


2.3 交互式降维


一些降维技术支持交互式 DR 流程,其中用户在半自动设置中以交互方式控制 DR 模型。大多数现有技术允许用户调整底层参数化,无论是针对特征权重 [19][40] [81] 还是直接在不同的 DR 模型 [24] 之间切换。 [48] 为了减轻参数化底层模型所需的认知成本,提出了观察级交互,以直接与嵌入中的数据点交互。例如,Brown 等人和 [15] Endert 等人中描述的技术 [31] 允许用户直接调整数据点的位置,从这些数据点中,加权距离函数等隐藏参数会相应地更新。此外,可以将单个点分配为控制点,其位置调整将动态修改嵌入的 [41] 全局布局 , [54] 。interAxis [45] 和 AxiSketcher [49] 等技术允许用户直接操作数据点,例如,将它们画成线,以指定嵌入动态更新的尺寸关系。但是,希望利用链接约束对嵌入的相似性矩阵提供显式影响,而不是使用上述交互 [17] 。与我们的方法类似,基于链接的交互已被引入嵌入技术,如多维缩放 [47] 和内核 PCA [16] ,其中嵌入的距离矩阵可以根据基于链接的约束进行细化。遗憾的是,它们没有针对视觉群集分离进行优化,导致性能 [5] 相对较差。 在利用交互式 DR 来促进聚类分析方面,最初所做的努力很少。例如,Perez 等人提出的技术 [60] 允许用户操纵单个参数来生成分离良好的集群,同时在高维空间中保留原始结构。与这些方法相比,所提出的方法可以直接对单个数据点进行约束。

 第 3 部分

 方法


在本节中,我们将介绍我们的交互式可视化集群分析方法。首先,我们将对比学习引入降维,以更好地保留邻域结构。其次,我们将损失函数的梯度重新定义为负对,以改善视觉聚类分离。第三,我们将基于链接的交互(包括必须链接不能链接)引入 CDR,使用户能够对嵌入进行定向。我们还实现了一个原型界面,该界面集成了建议的技术和一组可视化效果,用于交互式可视化集群分析。


3.1 基于对比学习的嵌入


对比学习是一种自我监督的表示学习方法,在计算机视觉任务 [23] 的实例判别方面取得了巨大成功。 [37] 通常,它将相似的实例拉在一起,并将不同的实例分开以学习有效的 embedding [74] 。相似和不同关系分别由正对和负对表示。嵌入网络使用对比损失函数进行训练,该函数测量点对的损失。为了利用对比学习的能力,我们在视觉聚类分析的降维中引入了它。在我们的方法中,嵌入函数是 f():XZ ,其中 XRn×d 表示 n 高维数据点,维度 d.ZRn×2 是嵌入数据点。

 正对


正对指定点之间的相似关系。在计算机视觉任务中,正对通常由数据增强技术 [23][37][74] (如旋转图像)指定,以提供实例判别能力。与计算机视觉任务不同,我们寻找集群级别的判别,而不是实例判别。因此,我们需要将属于同一集群的对指定为正对。我们注意到,k 最近邻 (kNN) [61] 被广泛认为属于相同的类/集群,例如,基于 kNN 的分类算法 [89][90] 因此,对于每个 xiX ,我们对其 kNN 中的一个点 xj 进行采样以构造一个正对 P(xi,xj) 。抽样概率是根据 UMAP [55] 定义的。具体来说,点 xi 和点 xj 是正对的概率计算为

pij=(pj|i+pi|j)pj|ipi|j(1)
View SourceRight-click on figure for MathML and additional features. 其中 pj|i 是条件概率,定义为
pj|i=exp(dist(xi,xj)ρiσi)(2)
View SourceRight-click on figure for MathML and additional features. 其中 dist(xi,xj) 表示两点之间的欧几里得距离。这是 ρi 到其最近邻域的距离 xi 。这是 σi 归一化项,计算方法是该点 xi 与其 k 个最近邻配对的概率之和。

 负对


如果只有正对,模型将折叠以将所有点嵌入到同一位置 [35] 。因此,我们需要指定负对来分离不同的点。遵循 对比学习 [23] 中使用的常用策略 ,对于每个点 xi ,其消极对应物被视为所有其他点,不包括积极对应物。由于不同邻居比相似邻居更有可能被采样为负对应对象,因此该策略可确保嵌入中不同邻居的平衡位置,从而缓解模型折叠问题。具体来说,在每次迭代中,训练数据被随机分组为大小相等的批次。在我们的实现中,我们将批次数设置为 10。每个批次都包含 B 点及其正对的对应项。因此,每个批次最终有 2B 点。对于批次中的每个点,其他 2(B - 1) 个点(不包括自身及其正对对应项)被指定为负对 N(xi,xj) 的对应项。


对比损失函数


我们采用 NT-Xent (归一化温度标度交叉熵损失) [72] 作为对比损失函数,并对其进行调整以进行降维。在训练批次中,point zi 的 NT-Xent 定义为

LNT(zi)=logexp(qij/τ)2Bk=11l[ki]exp(qik/τ)(3)
View SourceRight-click on figure for MathML and additional features. 其中 1l[ki]{0,1} 是一个指示函数,计算结果为 1 if kiτ 表示温度参数。 qij 表示 和 之间的 zi zj 相似性,它们分别是 xixj 的嵌入。理想情况下,我们将 zizj [55]
S(zi,zj)={1exp((zizj2ξ))if zizj2ξotherwise(4)
View SourceRight-click on figure for MathML and additional features. 之间的相似性定义为 其中 ξ 是两点之间的最小嵌入距离,其默认值为 0.1。然而,这个定义是不可区分的。相反,我们使用具有一个自由度的 Student t 分布来定义 qij ,以近似 S(zi,zj)
qij=11+a(zizj22)b(5)
View SourceRight-click on figure for MathML and additional features. 其中 ab 由非线性最小二乘法根据 的 S 曲线拟合 来选择 。


与交叉熵损失相比,NT-Xent 增加了一个温度参数 τ 。如上所述,高维空间中一个点的 kNN 可能包含不同集群中的点。简单地在低维空间中保留所有 k 最近邻会将这些误差保留在嵌入结果中。因此,我们希望加强区分负对的能力,尤其是相似度高的负对。考虑到 NT-Xent 相对于负对 N(zi,zj) 相似性的 qij 梯度,

LNT(zi)qij=1τexp(qij/τ)2Bk=11l[ki]exp(qik/τ)(6)
View SourceRight-click on figure for MathML and additional features.

 -
 -
 -
Fig. 2. - The gradients with respect to the negative similarity $q_{ij}$. (a) Gradients of NT-Xent loss under different $\tau$ values.(b) Gradients with respect to the negative similarity after gradient redefinition. On the basis of $\tau=0.15$ and the skewed distribution of SN(−40, 0.11, 0.13).
 图 2.


相对于负相似度 qij 的梯度。(a) 不同 τ 值下 NT-Xent 损失的梯度。(b) 梯度重新定义后相对于负相似性的梯度。根据 τ=0.15 和 SN(−40, 0.11, 0.13) 的偏态分布。


Fig. 2 (a) 显示对应于 的不同值的梯度 τ 。当等于 1 时 τ ,NT-Xent 退化为交叉熵损失。对于梯度到具有高相似度的负对(曲线的右侧部分),当 the τ 变低时,其值也会增加。但是,较高的值 (a τ ) 也会将梯度降低为具有中等相似度的负对。因此,我们选择 τ=0.15 in default。


3.2 通过梯度重新定义改进视觉聚类分离


保留邻域结构并不能确保聚类之间有足够的分离以进行可视化聚类分析。两个被区分的集群在嵌入空间中可能仍然彼此靠近,需要努力在视觉上识别它们。因此,我们需要改进 embedding 结果的视觉聚类分离。


如 所示 Fig. 2(a) ,如果负对的相似度较低,则梯度也会较低。因此,这些负对对嵌入优化的贡献很小。鉴于负对的点通常属于不同的集群,我们可以通过增加这些负对的梯度来增加它们对嵌入的贡献。增加的贡献将导致两个集群之间的距离变大。因此,我们将梯度重新定义为负对,以增强视觉聚类分离。


然而,挑战在于找到负对的适当部分来重新定义。简单地将梯度增加到所有负对会打破正负对之间的平衡。具有高相似性的负对也会在训练中引入误差,因为它们有可能属于同一集群。因此,我们决定将梯度增加到负对,这些负对很有可能被放置在两个不同的集群中。我们从 UCI Repository [7] 中选择 25 个真实世界的数据集,训练 25 个相应的对比学习模型,并对负对进行计数以获得它们的统计数据。在这里,我们使用类标签作为基本实况。因为嵌入已经纠正了部分假邻居,所以我们计算嵌入空间中的相似性 qij ,而不是高维空间中的概率 pijFig. 3 显示大多数负对的相似度低于 0.1。在这种情况下,放置在两个集群中的对数是属于同一集群的对数的 10 倍。


根据统计数据,我们只想将梯度增加到相似度较低的负对,同时保持损失可微分。NT-Xent 表明,相对于负相似性的指数梯度在分离负对时效果很好。因此,我们想在与单独聚类的相似性的低频带的梯度中添加一个相似但较小的分布。但是,直接添加剪切曲线将导致渐变不连续。因此,我们选择正偏态分布,它类似于左侧的指数分布,并继续在右侧为零。具体来说,偏态分布 SN(η,μ,σ) 定义为

sn(qij)=2σΘ(qijμσ)Φ(ηqijμσ)(7)
View SourceRight-click on figure for MathML and additional features. 其中 μ 表示位置, η 是偏度, σ 是严格的正刻度。 Θ() 是标准正态概率密度函数,是 Φ() 标准正态累积分布函数。在我们的实现中,我们设置为 μ 0.11、 σ 0.13 和 η -40。


基于正偏态分布,我们将负对 N(zi,zj) 的梯度重新定义为

L¯(zi)qij:=1τexp(qij/τ)2Bk=11l[kiexp(qik/τ)+αisn(qij)2Bk=11l[ki]sn(qik)(8)
View SourceRight-click on figure for MathML and additional features. 其中 αi 是平衡两个项的尺度的 weight 参数,
αi=5exp(qmaxij/τ)2Bk=11l[ki]exp(qik/τ)2Bk=11l[ki]sn(qik)sn(qmaxij)(9)
View SourceRight-click on figure for MathML and additional features. 其中 qmaxij 是正偏态分布下具有最大梯度的负对。最后,重新定义的梯度曲线如 Fig. 2(b) 所示。

3.3 Link-Based Steering of Embedding

We introduce link-based interactions into the proposed CDR to steer the embedding results. Must link and cannot link constraints are widely used in interactive clustering algorithms because it is convenient to make use of the user's instance-level knowledge [79], [85]. Must link refers to specifying two points to be clustered into the same cluster. On the contrary, cannot link specifies two points that cannot be grouped in a single cluster in the clustering.

Given a pre-trained embedding, the steering process starts by specifying link-based constraints from users. Then, the probability pij between the points of the specified links is modified in the high-dimensional space. Specifically, we set the probability of must link pairs as 1 and that of cannot link pairs as 0. Next, the must link pairs, if specified, are added to the positive pairs. As the number of negative pairs is usually large, adding a few negative pairs would have little impact on the model. Therefore, we add the cannot link pairs into the positive pairs rather than negative pairs. During the training, we set the similarity of a cannot link pair (zi,zj) in the low-dimensional space as 1qij to push them apart. After the embedding is updated, users can specify more link constraints to refine the model iteratively.

3.4 Network Structure and Training Process

Network Structure

The input of our model are high-dimensional data points. For image data, we emply SimCLR [23] to transfer it into a 512-dimensional vector. Similar to SimCLR [23], our training model has an encoder that learns the representation of data and a projection head that embeds the representation into a 2-dimensional embedding space (Fig. 4). The encoder is a Dense Neural Network that consists of four densely-connected layers. The four layers contain 128, 256, 256, and 512 units, respectively. We add a batch normalization layer after each layer to avoid the gradient vanishing. A ReLU activation function is applied to each layer. The projection head contains a densely-connected layer with 512 units, a ReLU activation function, and a densely-connected layer with 2 units.

Fig. 3. - The distribution of negative pairs in similarity. The stacked bars represent the ratio of two kinds of bars.
Fig. 3.

The distribution of negative pairs in similarity. The stacked bars represent the ratio of two kinds of bars.

Fig. 4. - The network structure of CDR.
Fig. 4.

The network structure of CDR.

Training Process

We employ the Adam optimizer for network training, with an initial learning rate of 0.001. The learning rate is successively decayed by 10% when the number of epochs reaches 0.8E and 0.9E. E denotes the total number of epochs, which is set as 1000 in our implementation. The batch size is set as B=n/10, where n is the size of the dataset. Because the trained model is used to embed the training dataset only, all data points are used for training. We adopt a multi-stage training strategy: 1) we use the NT-Xent to warm up the model for 150 epochs; 2) we train the model with gradient redefinition for 700 epochs; and 3) we use the NT-Xent again to obtain a stable embedding with 150 epochs. When users specify link-based constraints, we sample 30% data points for fine-tuning the model. We use fewer epochs than the initial training to update the embeddings. The default is 30 epochs. The first 85% epochs use the proposed gradient redefinition, and the last 15% epochs use the adapted NT-Xent.

3.5 The Prototype Interface

To demonstrate the effectiveness of the proposed interactive visual cluster analysis, we develop a visual interface that integrates CDR with a set of interactive visualizations (Fig. 5). The interface provides a control panel (Fig. 5(a)) for steering parameters. Embedding results are represented in a scatterplot view (Fig. 5(b)). If the dataset contains images, thumbnails of randomly sampled images can be displayed simultaneously in the picture view. In addition, a parallel coordinates plot is utilized to explore the high-dimensional features of data points (Fig. 5(c)). If the dimensionality of a dataset is too high, users can apply Principal Component Analysis [1] to reduce the dimensionality so that only the major information is represented. The link board records the link-based constraints specified by users (Fig. 5(d)). The context of each constrained link is displayed in a thumbnail, where the link is highlighted in a heatmap of the scatterplot view. The four views are coordinated to support the CDR while allowing users to interactively examine the dimensional features and steering the embedding results through link-based interactions.

Fig. 5. - Interactive visual cluster analysis with CDR.
Fig. 5.

Interactive visual cluster analysis with CDR.

SECTION 4

Evaluation and Results

4.1 Quantitative Evaluation

In this section, we use quantitative experiments to evaluate CDR without user steering. The experimental environment uses a desktop PC with Intel Core i9-7900X (3.30 GHz), NVIDIA GeForce GTX 1080 Ti, 128 GB memory, and Windows 10 installed.

Datasets

We use 10 real-world datasets including Animals [26], Cifar10 [46], Indian Food [4], Isolet [33], MNIST [51], Stanford Dogs [44], Texture [7], USPS [38], Weathers [77] and WiFi [12]. The types of data contain image data, tabular data, and text data. All of them have at least two clusters and have class labels as the ground truth.

DR Techniques

For comparison, we select five non-parametric DR techniques, i.e., ISOMAP [73], Landmark ISOMAP (LISOMAP) [70], t-SNE [76], AtSNE [34], and UMAP [55]; and four parametric DR techniqcues, i.e., Parametric t-SNE (PtSNE) [75], Parametric UMAP (PUMAP) [67], Deep Recursive Embedding (DRE) [92], and the Dimensionality Reduction by Learning an Invariant Mapping (Dr-LIM) [36]. Although there are many contrastive learning applications in computer vision, to the best of our knowledge, DrLIM is the only contrastive learning technique for general DR tasks. We employ the implementation of ISOMAP, t-SNE, and UMAP in the sklearn library. For other techniques, we use the code published by their authors. Because we cannot find the code of LISOMAP, we implement it according to the paper. In the ablation experiments, we test the contrastive learning framework with Cross Entropy (CE), NT-Xent (NX-CDR), and gradient redefinition (CDR). All results are without user steering.

Parameters

We set the perplexity of t-SNE, AtSNE and PtSNE as 30, and the size of kNN neighbor in UMAP, PUMAP, ISOMAP and LISOMAP as k=15, following the settings in previous studies [55], [67], [76], [84]. The other parameters of these techniques are set as their default values. In consistency with UMAP, we set k=15 in CE, NX-CDR, and CDR. UMAP, t-SNE and AtSNE are optimized with 1000 iterations to ensure convergence on all tested datasets. For all the parametric techniques, we use the same network architecture as described in Section 3.4 and also trained them for 1000 epochs.

4.1.1 Measures

From literature review [32], [68], we employ six measures to evaluate the DR techniques. Specifically, we select Trustworthiness & Continuity [43] for preserving neighborhood structures, kNN classifier accuracy [27] & Neighbor Hit [59] for correcting false neighbors, and Distance Consistency [71] & Silhouette Coefficient [65] for visual cluster separation. For the first four measures that need to specify the kNN, we set k=15, which is consistent with the tested techniques.

Trustworthiness measures how much the kNN neighborhood of a point in the embedding space reflects the true neighborhood in the high-dimensional space.

Continuity measures how much the kNN neighborhood of a point in the high-dimensional space is preserved in the embedding space.

k-NN classifier accuracy (kNN-CA) measures the accuracy of the kNN-based classification in the embedding space. For each point, we predict its class label by majority voting of its kNN in the embedding space. A high kNN-CA close to 1 indicates a high consistency between points and their kNN in terms of classification.

Neighbor Hit (NH) measures the proportion of points in the kNN that belong to the same cluster as their center point in the embedding space. A high NH that is close to 1 indicates good quality. On the contrary, a low NH that is close to 0 indicates that many false neighbors are preserved in the embedding space.

Distance Consistency (DSC) is the proportion of points which have the same class as their nearest class centroids. DSC is the best separation measure according to Sedlmair and Aupetit [68].

Silhouette Coefficient (SC) measures the difference of between-class and within-class average distances normalized by the maximum of them. Compared to DSC which measures the point-class relationship, SC evaluates separation from a class-class perspective. It is also widely used in measuring visual class separation.

Except for SC, which is ranged from −1 to 1, the other five measures are normalized to [0, 1]. Higher value refers to higher quality for all six measures. It is worth noting that we use class labels as the ground truth clusters in kNN-CA, NH, DSC, and SC. Although this is a common practice in evaluating DR techniques [60], [92], the possible class-cluster mismatching [8] should be aware. In our experiments, we choose datasets with well-known cluster structures to address this issue.

Table 1. Comparisons of trustworthiness
Table 1.- Comparisons of trustworthiness
Table 2. Comparisons of continuity
Table 2.- Comparisons of continuity
Table 3. Comparisons of KNN-CA
Table 3.- Comparisons of KNN-CA
Table 4. Comparisons of neighbor hit
Table 4.- Comparisons of neighbor hit
Table 5. Comparisons of DSC
Table 5.- Comparisons of DSC
Table 6. Comparisons of SC
Table 6.- Comparisons of SC
Table 7. Comparisons of running time(s)
Table 7.- Comparisons of running time(s)

4.1.2 Comparison Results

Table 1 to Table 6 present the quantitative comparison of well-known existing techniques with CDR on 10 datasets with respect to each of the six measures. From Table 1 and Table 2, it is observed that CDR achieves the best average Trustworthiness than other techniques, but its average Continuity is slightly lower than t-SNE, UMAP, PtSNE, and PUMAP. The main reason is that the calculation of Continuity includes false neighbors. Removing false neighbors in the embedding space, although improves the embedding quality, results in a decrease of Continuity. This is verified by the results on correcting false neighbors. According to Table 3 and Table 4, CDR achieves the best performance in most datasets in terms of kNN-CA and Neighbor Hit, demonstrating its improved neighborhood accuracy and ability to correct false neighbors. In terms of visual cluster separation, Table 5 and Table 6 show that CDR outperforms all other techniques in DSC and SC. Fig. 6 shows the embedding results of the selected techniques on three typical datasets. CDR presents a much clearer visual cluster separation in the embedding results. In conclusion, CDR achieves comparable quality to t-SNE and UMAP in terms of preserving neighborhood structures. Considering the false neighbors, CDR outperforms all other techniques in terms of preserving correct neighborhood structures. CDR also achieves the best performance in terms of visual cluster separation.

4.1.3 Results of Ablation Experiments

We compare the performance of CE, NX-CDR, and CDR to evaluate the effectiveness of the adapted NT-Xent and the proposed gradient redefinition. In terms of preserving neighborhood structures, NX-CDR performs better than CE. CDR achieves higher average Trustworthiness but lower average Continuity than NX-CDR (see Table 1 and Table 2). In terms of correcting false neighbors, NX-CDR also performs better than CE. CDR performs better than NX-CDR (see Table 3 and Table 4). Again, the results suggest that Continuity is distorted by false neighbors. In terms of visual cluster separation, CDR outperforms CE and NX-CDR in DSC and SC for all datasets (see Table 5 and Table 6). In conclusion, the adapted NT-Xent has positive effects on preserving correct neighborhood structures. The proposed gradient redefinition is effective in improving visual cluster separation.

4.1.4 Runtime Performance

Table 7 shows the running time of the evaluated techniques. Among them AtSNE and UMAP are the fastest with the cost of slightly decreased quality than t-SNE. CDR runs faster than original t-SNE, but much slower than other non-parametric techniques. Among parametric techniques, CDR runs faster than PtSNE and PUMAP, which are the parametric version of t-SNE and UMAP. In our experiments, we use the same backbone network in DRE and CDR. Therefore, they have similar time performances. DrLIM achieves better time performance than CDR because it has a simpler loss function. Generally, parametric techniques run slower than non-parametric ones due to their training process. However, it is worthwhile to develop parametric techniques not only because of the better embedding qualities, but also parametric techniques offer several opportunities to DR applications. We will discuss these opportunities in Section 5.

4.2 User Study

We have conducted a user study to evaluate the ability of CDR without user steering to present well-separated clusters in terms of visual perception. In particular, we focus on the cluster identification task to test whether projected clusters are well separated and easily recognized by users. We consider two state-of-the-art DR techniques, t-SNE and UMAP, as baseline techniques and compare them with CDR. We put forward two hypotheses to verify:

  • H1: CDR performs better in identifying clusters;

  • H2: CDR consumes less time for identifying clusters;

Tasks

We designed a controlled experiment for evaluating the performance of cluster identification. They were tested using the three DR techniques on 20 datasets. 10 of them are the same as described Section 4.1. The remainings are described in the appendix. The controlled experiment was based on the cluster identification task (E1) described in Xia et al. [84]. We followed the same task design and asked each participant to complete 3(techniques)×20(datasets)=60 trials for identifying clusters.

Participants, Apparatus, and Procedure

We recruited 20 participants (16 males and 4 females) for the user study, aged 23 to 27. None of them reported color blindness or color weakness. All the participants are graduate students with research experience in data visualization. To remove any bias from the analysis processes, we adopted the same testing application described in Xia et al. [84] and integrated CDR and the baseline techniques. All the participants took the experiments one by one on standard desktop computers in our research lab. Each computer is equipped with a standard keyboard, a mouse, and a DELL 27-inch U2720QM monitor. The computer has a 2,560×1,440 screen resolution and chrome browser. The two experiments were conducted based on the same procedure described in Xia et al. [84]. In the controlled experiment, instead of randomly selecting datasets and DR techniques for each trial, we adopted a Latin square design where the datasets were sorted based on a Latin square upon which one of the three DR techniques was randomly selected for a trial. Compared with Xia et al. [84], this design minimizes the potential learning effects resulting from visualizing the datasets in certain orders. To evaluate the results of the two experiments, we adopted similar analysis conditions, statistical measures, and analysis approaches described in Xia et al. [84]. Here, the Shapiro-Wilk test and Paired Wilcoxon test were added to validate the normality in precision/recall and completion time, respectively.

Results

The results of precision and recall values for the controlled experiment are displayed in Fig. 7(a) and Fig. 7(b). Among the three techniques, CDR has the best precision (80.68%) and recall (69.29%). There is a relatively small difference between t-SNE and UMAP on precision (65.64% and 66.25%, respectively) and recall (55.49% and 55.50%, respectively). The Friedman tests show statistical significance of precision (X2(2)=14.8,p<0.05) and recall (X2(2)=17.5,p<0.05). Therefore, hypothesis H1 is confirmed. In terms of the average completion time (Fig. 7(c)), both CDR and UMAP show better performance (14.481s and 14.603s, respectively) than t-SNE (16.592s). The Friedman tests show statistical significance of average completion time (X2(2)=10.8,p<0.05). However, although CDR has a slightly shorter completion time than UMAP, there is no statistical significance between them. As a consequence, the hypothesis H2 was partially confirmed.

Fig. 6. - Comparison among different techniques on datasets isolet, MNIST and texture.
Fig. 6.

Comparison among different techniques on datasets isolet, MNIST and texture.

We conducted a post-interview to subjectively compare CDR with t-SNE and UMAP by reviewing their results side by side for each dataset. All participants agree that CDR performs better than t-SNE and UMAP in terms of presenting visually separated cluster patterns. They point out that the clearer visual separation makes them more confident in identifying clusters. However, they also feel somewhat uncertain about the results, as the ground truth is unknown. Some participants point out that providing the quality measures, such as Trustworthiness and Continuity, can strengthen their confidence.

Fig. 7. - The results of the user study. (a)-(c) the statistics of precision, recall, and completion time for the objective task.
Fig. 7.

The results of the user study. (a)-(c) the statistics of precision, recall, and completion time for the objective task.

4.3 Case Studies

In this section, we demonstrate how users can use the visual interface and the link-based interactions to steer the embeddings for visual cluster analysis. Using the visual interface, we carried out analyses on two datasets: an image dataset Animals [26] and a tabular dataset WiFi [12].

Animals

The Animals dataset contains 10000 animal images. Each image is encoded into a 512-dimensional feature vector as mentioned in Section 3.4. Using the default parameters, the data were embedded into the scatterplot view, as shown in Fig. 1(e). 11 clusters were observed. However, there was a much smaller cluster at the top left corner. Hoovering over points in this cluster and its neighboring two clusters, we found both the small cluster and the cluster at the top left corner contained butterfly images. Switched to the image mode, we found that the larger cluster contained images where butterflies dominate the foreground, while the smaller cluster contained images where the backgrounds dominate. That is why they formed two clusters. However, considering the interest is in animals, we would like to merge the two clusters into one. Therefore, we added a must link between the two clusters. As merging two clusters is essential to align the two clusters' centers, we chose the two ends of the must link to be the butterfly images near the two cluster centers. After re-embedding, the two clusters were successfully merged (Fig. 1(g)). Some spider images were observed at the boundary of the merged cluster (e.g., Fig. 1(g) A). We thus further improved the results by adding more links. We first added a cannot link between one of the spider images at the boundary and a nearby butterfly image to push away the mis-clustered spider images. We then added a must link between one of the mis-clustered spider images and a spider image near the center of the spider cluster to attract them together. Fig. 1(h) shows the results. With only the three links added, our method successfully merged the two butterfly clusters and separated the spider images. This demonstrated the effectiveness of adding links to improve clustering results. However, added links are expected to assist the similarity measure to improve the clustering results, rather than being the dominant factor to determine the clustering results. We thus expect some resistance of our method to “bad” links added. To verify this, we added a must link between two significantly different clusters: chicken and dog. As shown at the top of Fig. 8(b), the overall clustering pattern remained unchanged. It was not until three must links were added (Fig. 8(b)) then the chicken and the dog clusters were merged (Fig. 8(c)), in comparison to the only one must link needed to merge the two butterfly clusters. This demonstrated some degree of resistance of our method to “bad” links. Meanwhile, the Trustworthiness and Continuity measures were also significantly decreased (from 0.910 to 0.757 and 0.910 to 0.853. respectively).

Fig. 8. - The case of the animals dataset. (a) The species of animals in each cluster. (b) Three must links are imposed between chicken and dog cluster in three interactions. (c) Updated embedding result.
Fig. 8.

The case of the animals dataset. (a) The species of animals in each cluster. (b) Three must links are imposed between chicken and dog cluster in three interactions. (c) Updated embedding result.

Fig. 9. - The case of the WiFi dataset. (a)-(c) the initial embedding, embedding after one cannot link, and embedding after one must link, respectively. (d)-(f) the parallel coordinates plots of C1, C2, and C3, respectively.
Fig. 9.

The case of the WiFi dataset. (a)-(c) the initial embedding, embedding after one cannot link, and embedding after one must link, respectively. (d)-(f) the parallel coordinates plots of C1, C2, and C3, respectively.

WiFi

We then moved on to analyze the clustering results on the WiFi dataset. This dataset contains 2000 vectors recording the signal strengths at 2000 locations in four rooms (500 locations in each room). There are 7 routers in the space, from which the received signal strengths constitute the 7 dimensions of each vector. A good clustering of these vectors is expected to correspond to the four different rooms. Fig. 9(a) shows the embedding results using the default parameters. Five clusters were observed rather than four as expected. The problem seemed to lie in the three clusters (C1, C2, C3) that were closely positioned at the bottom right. We would like to find the reasons and thus had a closer examination of their parallel coordinates. It is observed that the data points in C1 and C2 did exhibit similar patterns in parallel coordinates (Fig. 9(d), (e)). The distributions along the dimensions had large overlaps, especially along A0, A1, A2, A3, and A4. This means a large number of data points (locations) in C1 and C2 had similar distances to these routers. We considered that is why C1 and C2 were closely positioned in the embedding space. To refine the CDR to better distinguish locations in the two clusters, we added a cannot link. The re-embedding results showed the successful separation of the two clusters (Fig. 9(b)). We then examined C3. C3 had much smaller number of data points, which indicated it was separated from its main cluster. Comparing the parallel coordinates between C3 (Fig. 9(f)) and C1/C2, it can be observed that C3 had very different distributions along A0 and A3 compared with C1 and C2. However, along the other 5 dimensions, the distributions of C3 were fully enclosed by the distributions of C2. Together with the closest positions of C3 and C2 in the embedding space, these were strong indications that the locations in C3 should be in the same room as those in C2. Their differences along A0 and A3 explained their separation, but these were more likely due to obstacles in the room. Based on the reasoning, we would like to refine the CDR to merge C2 and C3. We thus added a must link between C2 and C3. After re-embedding, the results had four well -separated clusters as expected (Fig. 9(c)), demonstrating the improved dimensionality reduction for visual cluster analysis.

Lesson Learned

Our case studies show that the link-based interaction is very efficient. To merge similar clusters or separate dissimilar clusters, only one or two links are needed. Regarding “bad” links, such as must links between dissimilar clusters, our technique shows some degree of resistance. Three or more links are required for counterfactual interactions. Users should be aware of the power of the interactions when exploring cluster patterns. A suggestion is using the embedding quality measures as the indication of whether the interaction is correct.

SECTION 5

Discussions

Opportunities of Parametric Techniques

Although parametric techniques are generally slower than non-parametric ones, they provide an explicit representation of the learned manifold mapping, which can be reused in new datasets and shared among multiple users. This brings several application opportunities for parametric techniques, such as fast embedding of new out-of-sample data [67], [75], federated learning-based joint projection [83], and reusing edited embedding on similar datasets. These applications are non-trivial for non-parametric techniques.

Comparison Between Parametric and Non-Parametric Dr Techniques

On the one hand, the parametric DR techniques run much slower than non-parametric ones, such as AtSNE and UMAP. The main reason is that the training process for parametric techniques is time-consuming. On the other hand, the proposed CDR achieves better embedding quality than all evaluated techniques in terms of both accuracy and visual cluster separation. Critical issues of non-parametric techniques, including the false neighbors and missing neighbors. With well-designed contrastive settings, we alleviated this issue and thus improved the embedding quality. We also would like to point out that only a parametric framework cannot address the issues of false neighbors and missing neighbors. For example, PtSNE [75] and PUMAP [67] performs worse than their non-parametric counterparts, t-SNE, and UMAP, respectively. The proposed CDR is the first parametric DR technique that achieves comparable embedding quality with state-of-the-art non-parametric techniques.

SECTION 6

Conclusion

In this paper, we propose an interactive visual cluster analysis approach. We introduce contrastive learning into dimensionality reduction, propose a gradient redefinition technique, and enable link-based interactions on the embedding. The experiments and user study show that the proposed CDR outperforms existing DR techniques in preserving neighborhood structures, correcting false neighbors, and visually separating clusters. Case studies demonstrate the effectiveness of the link-based interactions.

ACKNOWLEDGMENTS

The authors would like to thank the helpful comments from the anonymous reviewers. This work is supported by the National Natural Science Foundation of China (No. 61872389, 62132017, 61872388), and the High Performance Computing Center of Central South University.

 引用

References is not available for this document.