This tutorial will show you how to make an upper-bounded mean absolute percentage point (UMAP) plot in R. This is a nice graphical way of looking at distributions and can be used for estimating the confidence interval around your average or median, as well as testing assumptions about the population distribution.
The “how to install umap in r” is a step-by-step guide on how to install the UMAP package in R.
UMAP, or “Uniform Manifold Approximation and Projection,” is a dimensionality reduction approach similar to tSNE. UMAP is a non-linear dimension reduction method that is often used to visualize large datasets. In this tutorial, we’ll learn how to use UMAP in R to do dimensionality reduction and how to use ggplot2 in R to create a UMAP plot.
Data and Packages Loading
We will use Palmer Penguin dataset to make a tSNE plot in R. We will perform umap using the R package umap. Let us load the packages needed and set the simple b&w theme for ggplot2 using theme_set() function.
#install.packages library(tidyverse) library(palmerpenguins) (“umap”) theme set(theme bw(18)) library(umap)
We will utilize numerical columns and disregard non-numerical columns as meta data to execute UMAP using Palmer Penguin’s dataset (like we did it for doing tSNE analysis in R). Let’s start by removing any missing data and creating a unique row ID.
penguins <- penguins %>% drop_na() %>% select(-year)%>% mutate(ID=row_number()) ## # A tibble: 6 x 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## ## 1 Adelie Torge… 39.1 18.7 181 3750 male ## 2 Adelie Torge… 39.5 17.4 186 3800 fema… ## 3 Adelie Torge… 40.3 18 195 3250 fema… ## 4 Adelie Torge… 36.7 19.3 193 3450 fema… ## 5 Adelie Torge… 39.3 20.6 190 3650 male ## 6 Adelie Torge… 38.9 17.8 181 3625 fema… ## # … with 1 more variable: ID
Let’s create a dataframe with all category variables and a unique row ID for each of them.
penguins_meta <- penguins %>% select(ID, species, island, sex)
Using the umap package to perform UMAP
Let’s use the is.numeric() function with select() to choose numerical columns, then use the scale() function to standardize the data before using the umap() method to do tSNE.
set.seed(142) umap_fit <- penguins %>% select(where(is.numeric)) %>% column_to_rownames(“ID”) %>% scale() %>% umap()
The layout variable in the umap result object is a list object, and it has two umap components that we are interested in. The components may be extracted and saved in a dataframe. In addition, we combine the UMAP components with the data’s meta data.
umap_df <- umap_fit$layout %>% as.data.frame()%>% rename(UMAP1=”V1″, UMAP2=”V2″) %>% mutate(ID=row_number())%>% inner_join(penguins_meta, by=”ID”) umap_df %>% head() ## UMAP1 UMAP2 ID species island sex ## 1 -7.949633 -1.387130 1 Adelie Torgersen male ## 2 -6.850185 -1.685802 2 Adelie Torgersen female ## 3 -6.753245 -2.485241 3 Adelie Torgersen female ## 4 -9.327034 -1.900235 4 Adelie Torgersen female ## 5 -10.353931 -1.381105 5 Adelie Torgersen male ## 6 -7.273715 -1.689724 6 Adelie Torgersen female
Scatter plot between two UMAP components (UMAP plot).
We may create a UMAP plot, which is a scatter plot with the two UMAP components colored by the variables of interest. Color is controlled by the species variable, while form is controlled by the sex variable in this example.
umap_df %>% ggplot(aes(x = UMAP1, y = UMAP2, color = species, shape = sex))+ geom_point()+ labs(x = “UMAP1”, y = “UMAP2”, subtitle = “UMAP plot”) ggsave(“UMAP_plot_example1.png”)
Our UMAP plot looks like this. Note, UMAP is unsupervised technique and has nicely identified three groups corresponding the species variable in the data. UMAP plot in R: Example 1
Example 2 of the UMAP plot in R
We utilized the same UMAP components in the second example of UMAP plot, but this time we included facetting depending on island variable to better observe the link between species and island.
umap_df %>% ggplot(aes(x = UMAP1, y = UMAP2, color = species)) + geom_point(size=3, alpha=0.5)+ facet_wrap(~island)+ labs(x = “UMAP1”, y = “UMAP2″, subtitle=”UMAP plot”)+ theme(legend.position=”bottom”) ggsave(“UMAP_plot_example2.png”) Example 2 of the UMAP plot in R
To discover probable sample mixup issues or outliers, use the UMAP plot.
One of the most important benefits of unsupervised/dimensionality approaches like as UMAP or tSNE is that they may identify patterns in data and drive us to reconsider our dataset annotations. For example, a few Chinstrap Penguin samples (in green) exist inside the Adelie samples in this Palmer penguin data (in red). It might indicate sample annotation errors or outliers.
library(ggforce) umap_df %>% ggplot(aes(x = UMAP1, y = UMAP2, color = species, shape = sex)) + geom_point() + labs(x = “UMAP1”, y = “UMAP2″, subtitle=”UMAP plot”) + geom_circle(aes(x0 = -6, y0 = -1.8, r = 0.65), color = “green”, inherit.aes = FALSE) ggsave(“umap_plot_to_identify_outlier_samples.png”) UMAP Plot to Identify Potential sample mix-ups
Related
The “ggplot umap” is a function that allows you to plot UMAP in R. UMAP stands for uniform matrix of normals, and it is a common way to represent the surface area of the Earth.
Related Tags
- umap r tutorial
- error: umap: number of neighbors must be smaller than number of items
- plot umap
- umap visualization r
- umap clustering