HVT: Collection of functions used to build hierarchical topology preserving maps

Zubin Dowlaty, Shubhra Prakash, Sangeet Moy Das, Shantanu Vaidya, Praditi Shah, Srinivasan Sudarsanam, Somya Shambhawi, Pon Anureka Seenivasan, Vishwavani, Bidesh Ghosh, Alimpan Dey

2024-09-10

1. Abstract

The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis, see Figure 1 as an example of a 3D torus map generated from the package. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:

  1. Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.

  2. Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called mathematical embeddings) coordinates into the desired output dimension.

  3. Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map useful for semi-supervised tasks.

  4. Scoring: Scoring new data sets or test data and recording their assignment using the map objects from the above steps, in a sequence of maps if required.

  5. Temporal Analysis and Visualization: A Collection of new functions that leverages the capacity of the HVT package by analyzing time series data for its underlying patterns, calculation of transitioning probabilities and the visualizations for the flow of data over time.

The HVT package allows creation of visually stunning tessellations, showcasing the power of topology preserving maps. below is an image depicting a captivating tessellation of a torus.

Figure 1:  Heatmap Visualization of a Torus with 900 Cells

Figure 1: Heatmap Visualization of a Torus with 900 Cells

2. Vignettes

Following are the links to the vignettes for the HVT package:

  1. HVT Vignette: Contains descriptions of the functions used for vector quantization and construction of Hierarchical Voronoi Tessellations for data analysis.

  2. HVT Model Diagnostics Vignette: Contains descriptions of functions used to perform model diagnostics and validation for the HVT model.

  3. HVT Scoring Cells with Layers using scoreLayeredHVT: Contains descriptions of the functions used for scoring cells with layers based on a sequence of maps using scoreLayeredHVT.

  4. Temporal Analysis and Visualization: Leveraging Time Series Capabilities in HVT: Contains descriptions of the functions used for analyzing time series data and its flow maps.

  5. Visualizing LLM Embeddings using HVT (Hierarchical Voronoi Tessellation): Contains the implementation and analysis of hierarchical clustering using the clustHVT function to evaluate and visualize token embeddings generated by OpenAI.

  6. Implementation of t-SNE and UMAP in trainHVT function: Contains enhancements to the trainHVT function with advanced dimensionality reduction techniques such as t-SNE and UMAP, and includes a table of evaluation metrics to improve analysis, visualization, and interpretability.

3. Version History

HVT (v24.9.1) | What’s New?

4th September, 2024

In this version of the HVT package, the following new features and vignettes have been introduced:

Features

  1. Implementation of t-SNE and UMAP in trainHVT: This update incorporates dimensionality reduction methods like t-SNE and UMAP in the trainHVT function, complementing the existing Sammon’s projection. It also enables the visualization of these techniques across all hierarchical levels within the HVT framework.

  2. Implementation of dimensionality reduction evaluation metrics: This update introduces highly effective dimensionality reduction evaluation metrics as part of the output list of the trainHVT function. These metrics are organized into two levels: Level 1 (L1) and Level 2 (L2). The L1 metrics address key areas of dimensionality reduction which are mentioned below, by ensuring comprehensive evaluation and performance.

  1. Introduction of clustHVT function: In this update, we introduced a new function called clustHVT specifically designed for Hierarchical clustering analysis. The function performs clustering of cells exclusively when the hierarchy level is set to 1, determining the optimal number of clusters by evaluating various indices. Based on user input, it conducts hierarchical clustering using AGNES with the default ward.D2 method. The output includes a dendrogram and an interactive 2D clustered HVT map that reveals cell context upon hovering. This function is not applicable when the hierarchy level is greater than 1.

Vignettes

  1. Implementation of t-SNE and UMAP in trainHVT function: This vignette showcases the integration of t-SNE and UMAP in the trainHVT function, offering a comprehensive guide on how to apply and visualize these dimensionality reduction techniques. It also covers the dimensionality reduction evaluation metrics and provides insights into their interpretation.

  2. Visualizing LLM Embeddings using HVT (Hierarchical Voronoi Tessellation): This vignette will outline the process of analyzing OpenAI-generated token embeddings using the HVT package, covering data compression, visualization, and hierarchical clustering, as well as comparing domain name assignments for clusters. It examines HVT’s effectiveness in preserving contextual relationships between embeddings. Additionally, it provides a brief overview of the newly added clustHVT function and its parameters.

HVT (v24.5.2)

2nd May, 2024

In this version of the HVT package, the following new features have been introduced:

  1. Updated Nomenclature: To make the function names more consistent and understandable/intuitive, we have renamed the functions throughout the package. Given below are a few instances.
  1. Restructured Functions: The functions have been rearranged and grouped into new sections which are highlighted on the index page of the package’s PDF documentation. Given below are a few instances.
  1. Enhancements: The pre-existed functions, hvtHmap and exploded_hmap, have been combined and incorporated into the plotHVT function. Additionally, plotHVT now includes the ability to perform 1D plotting.

  2. Temporal Analysis

Below are the new functions and their brief descriptions:

HVT (v23.11.02)

17th November, 2023

This version of the HVT package offers functionality to score cells with layers based on a sequence of maps created using scoreLayeredHVT. Given below are the steps to create the successive set of maps.

  1. Map A - The output of trainHVT function which is trained on parent data.

  2. Map B - The output of trainHVT function which is trained on the ‘data with novelty’ created from removeNovelty function.

  3. Map C - The output of trainHVT function which is trained on the ‘data without novelty’ created from removeNovelty function.

The scoreLayeredHVT function uses these three maps to score the test datapoints.

Let us try to understand the steps with the help of the diagram below

Figure 2: Data Segregation for scoring based on a sequence of maps using scoreLayeredHVT()

Figure 2: Data Segregation for scoring based on a sequence of maps using scoreLayeredHVT()

HVT (v22.12.06)

06th December, 2022

This version of the HVT package offers features for both training an HVT model and eliminating outlier cells from the trained model.

  1. Training or Compression: The initial step entails training the parent data using the trainHVT function, specifying the desired compression percentage and quantization error.

  2. Remove novelty cells: Following the training process, outlier cells can be identified manually from the 2D hvt plot. These outlier cells can then be inputted into the removeNovelty function, which subsequently produces two datasets in its output: one containing ‘data with novelty’ and the other containing ‘data without novelty’.

4. Installation of HVT (v24.9.1)

library(devtools)
devtools::install_github(repo = "Mu-Sigma/HVT")