Plotting the branches of a trimmed phylogeny on the original tree topology

R
methods
phylogeny
Author

Iain R. Moodie

Published

October 15, 2024

In a current research project, I have a single large phylogeny and tip data for a large number of traits. However, not all tips have data available for all traits. I want to visualise this patchy data coverage by overlaying the sub phylogeny of tips with data for a given trait, over the full phylogeny. This would also be useful to visualise which tips haven been dropped from a phylogeny. However, as far as I can tell, such a method is not directly implemented in any of the big phylo R packages. Here is my version:

First we simualte a tree using phytools::pbtree(). To simulate the patchy trait coverage, we draw (in this example) 4 random samples of the tree’s tips. These are the tips with data for each of the 4 simulated traits.

Code
set.seed(666)

full_tree <- phytools::pbtree(n = 200)

sample_tip_labels <- function(tree, num_samples = 4) {
  tip_labels <- tree$tip.label
  sample_sizes <- length(full_tree$tip.label) * c(0.05, 0.1, 0.3, 0.5)
  samples <- lapply(sample_sizes, function(size) sample(tip_labels, size))
  return(samples)
}

tips_with_data <- sample_tip_labels(full_tree)

Here’s the phylogeny in full:

Code
ggtree::ggtree(full_tree, layout = "circular") +
  ggtree::geom_tiplab(size = 3)

And here is the subtree for each of the traits, produced by dropping tips/branches for which we lack data for a given trait:

Code
lapply(
  tips_with_data,
  function(tips) ggtree::ggtree(
    ape::keep.tip(
      full_tree, tips
    ),
    layout = "circular"
  ) +
  ggtree::geom_tiplab(size = 3)
) |>
  patchwork::wrap_plots()

I find it quite hard to tell which tips or clades have been dropped by looking at these trees, even with the tip labels. So instead, let’s plot these trees over the original tree, retaining the original topology. Here’s my attempt at that:

Code
plot_subtree <- function(tree, tips, layout = "circular") {

  get_substree_nodes <- function(tree, tips) {
    mrca <- phangorn::mrca.phylo(tree, tips)
    nodes <-
      unique(
        unlist(
          lapply(
            seq_along(tips), 
            FUN = function(i){phangorn::Ancestors(tree, tips[i], "all")}
          )
        )
      )
    return(nodes[nodes>mrca])
  }

  nodes_to_colour <- get_substree_nodes(tree, tips)

  tree_dat <-
    tree |> 
    tidytree::as_tibble() |> 
    dplyr::mutate(
      colour_nodes = dplyr::case_when(
        node %in% nodes_to_colour ~ TRUE,
        label %in% tips ~ TRUE,
        TRUE ~ FALSE
      )
    ) |>
    tidytree::as.treedata()

  return(
    ggtree::ggtree(tree_dat, ggplot2::aes(colour = colour_nodes), layout = layout) +
      ggtree::geom_tippoint(ggplot2::aes(colour = colour_nodes, shape = colour_nodes)) +
      ggplot2::scale_colour_manual(values = c("grey85", "black")) +
      ggplot2::scale_shape_manual(values = c(NA, 16)) +
      ggplot2::theme(legend.position = "none")
  )
}

plots_circular <-
  lapply(
    tips_with_data,
    function(tips) plot_subtree(full_tree, tips)
  )

patchwork::wrap_plots(plots_circular)

I think the output makes it easy to quickly visually assess what clades have data for a given trait, and which traits are particularly data poor/rich. I understand that this plot shows the same data as a phylogeny and heatmap style plot as shown in the treedata book or the phytools blog, but for my particular purposes this style of plot has additional value, due to the large number of traits.

That’s all!