R PCA – Need Help Interpreting Featureplot Legend

pcar

Previously, I asked this question on StackOverflow and was directed here: https://stackoverflow.com/questions/71797502/need-help-interpreting-featureplot-legend

I am using the FeaturePlot command which is part of Seurat to compare the distribution of two genes. This is using non-scaled data. Can anyone help me interpret the legend? I do not know where the axis values are coming from.

Example:

I am not sure what else I can include to make this easier. If anyone needs additional details please let me know.

Thanks!

**Update 04/11 – more details

This is RNA seq data. I am exploring which genes are present in this cell population.

Seurat has a feature that allows you to cluster together similar populations of cells based on their features (I.E gene expression). The clustering here isn't great as I am working with a very small number of cells but we have 3 populations.

To get to this UMAP I have normalized and scaled my data using a command in the Seurat packaged called SCTransform.

So going back to my earlier question, I have used this umap and overlayed the gene expression (here just SHH, SANT1) to better assess the relationship between the clustering and the genes being expressed in those clusters. Experimentally, I know some of these genes should be expressed most abundantly in a particular type of cell and I am checking if Seurat picks that up.

Going back to my original example. In the fourth plot, I do not know why the X and Y axis are both 10. In fact, they are always both 10 regardless of which genes I am looking at. It seems to me like they are measuring color? Or rather the intensity of the gene expression in a particular cell and this measurement happens to be static.

I hope this helps! If anything else is needed just ask.

Updated 04/15/22

Big thanks to dipetkov as their idea led me to the answer!

My original question was interpreting this kind of plot:

Seurat assigns values to cells based off their gene expression. When looking at a single gene these values can be 0, 4, and 9. When you're looking at a plot that features two genes overlapping, the expression can include 40, 44, 49, 90, 94, and 99.

This is what things look like in R when analyzing a single gene's (CHIR) expression:

When we analyze two genes (CHIR & IWP2) the numbers look like this:

These values naturally relate to their corresponding color on the plot. The values 40, 44, 49, 90, 94, and 99 are actually just the expression of each gene in that particular cell side-by-side. A cell with a CHIR and IWP2 expression of 9 would then be 99 (yellow). Whereas a cell with a CHIR expression of 9 and a IWP2 expression of 0 would be 9 (09). This is because the first number represents red and the second number represents green. The opposite of the previous example, high IWP2 and no CHIR, would be 90.

Lastly, there does not seem to be any intermediate colors other than yellow. A cell expressing a low amount of CHIR and IWP2 (44) would be colored light green as green takes priority over red here.

TLDR: The X and Y axis are misleading. Seurat assigns values to cells based on their gene expression. These numbers can only be 0, 4, 9, 40, 44, 49, 90, 94, and 99. Numbers above 9 (excluding 90) are two gene colors combined.

Best Answer

Seurat is a very specialized R package, so it's probably best to create an issue on GitHub to ask this question.

In the meantime, I'll show you how to figure out what data is shown in the plot. I don't know anything about cell biology, so it will be up to you to figure out what the data means.

library("Seurat")
library("tidyverse")

# I use a small dataset that comes with the Seurat package.
data("pbmc_small")

# Run UMAP map on first 5 PCs
pbmc_small <- RunUMAP(
  object = pbmc_small,
  dims = 1:5
)

# Generate patchwork of 4 ggplots
p <- FeaturePlot(
  object = pbmc_small,
  features = c("PPBP", "IGLL5"),
  reduction = "umap",
  blend = TRUE
)

# We can extract and look at the plots one by one
p1 <- p[[1]]
p2 <- p[[2]]
p3 <- p[[3]]
# The last plot is the color legend and is not interesting
p4 <- p[[4]]

p1

# The plot has the two UMAP dimensions, UMAP_1 and UMAP_2,
# on the x and y axis and colors the points according to PPBP.
head(p1$data)
#>                  UMAP_1    UMAP_2 ident PPBP
#> ATGCCAGAACGACT 4.692863  1.759652     0    0
#> CATGGCCTGTGCAT 5.494108  1.453728     0    0
#> GAACCTGATGAACC 2.188469 -5.069190     0    4
#> TGACTGGATTCTCA 4.183846  3.815155     0    0
#> AGTCAGACTGCACA 4.731087  1.388607     0    0
#> TCTGATACACGTGT 4.880636  1.954429     0    0

# What does `PPBP` mean? I have no idea but I'd guess
# it's a scale for gene expression:
# * No expression (PPBP = 0) in 67 cells.
# * Medium expression (PPBP = 4) in 3 cells.
# * High expression (PPBP = 9) in 10 cells.
p1$data %>%
  count(PPBP)
#>   PPBP  n
#> 1    0 67
#> 2    4  3
#> 3    9 10

^{Created on 2022-04-14 by the reprex package (v2.0.1)}

Best Answer

Related Solutions

Solved – How to find which variables are most correlated with the first principal component

Solved – When you do PCA (or any dimensionality reduction), what is “the number of dimensions”

Related Question