Introduction to Using ECHO

The echo.find package provides a function (echo_find()) designed to find rhythms from expression data using extended harmonic oscillators. To read more about our inital work on this project and cite us, see ECHO: an application for detection and analysis of oscillators identifies metabolic regulation on genome-wide circadian output by H. De los Santos, et al. (2020) Further, for users who prefer an interface more than coding, as well as built-in visualizations, our GitHub repository can be found here. There, you can find a shiny application for finding rhythms and automatically visualizing results, with features such as Venn diagrams, heat maps, gene expression plots (with or without replicates visualized), and parameter density graphs. A FAQ for possible user errors can also be found there.

Also, it should be noted that starting points have been improved since echo.find v4.0. For more information on these new starting points/algorithm improvements and comparisons with the Bioinformatics version, please see ECHO_V4_Update.pdf on the GitHub.

In this vignette, we’ll walk through an example of how to use echo.find, and how to choose from the several different built-in methods of preprocessing.

Loading and Examining Data

We’ll start by loading our library, which contains the echo_find() function. It also has an example dataframe, expressions, which we’ll be using throughout this vignette. Here we’ll look at the first few rows and columns of our dataset.

library(echo.find)
head(expressions[,1:5])
##   Gene.Name      CT2.1      CT2.2      CT2.3      CT4.1
## 1  Sample 1  1.6331179  1.4976053  1.5138102  1.3095535
## 2  Sample 2 -0.6303192 -0.6027464 -0.5105009 -0.5062033
## 3  Sample 3         NA -0.7802214 -0.7767950  0.2847617
## 4  Sample 4  0.4659923  0.4940659         NA  0.1018655
## 5  Sample 5  0.7026372  0.6405812  1.0235155  1.7199453
## 6  Sample 6  0.9261508  0.8858768  0.8035570         NA

Note the data format: its first column first column has gene labels/names, and all other columns have numerical expression data. This expression data is ordered by time point then by replicate, and has evenly spaced time points. Any missing data has cells left blank. In order to use the echo_find() function, data must be in this format. Now, let’s look at one the data expressions, Sample 2. Here we plot each of the replicates in a different color, then plot the difference between them in gray.

library(ggplot2)

tp <- seq(2,48,by=2) # our time points
num_reps <- 3 # number of replicates

samp <- 2 # sample we want to look at
ex.df <- expressions[samp,-1] # expression data for the first sample

# our visualization data frame       
ribbon.df <- data.frame(matrix(ncol = 3+num_reps, nrow = length(tp)))
# assigning column names
colnames(ribbon.df) <- c("Times","Min","Max", 
                         paste(rep("Rep",num_reps),c(1:num_reps), sep=".")) 
ribbon.df$Times <- tp
# getting min values of replicates
ribbon.df$Min <- sapply(seq(1,ncol(ex.df), by = num_reps),
                        function(x) min(unlist(ex.df[,c(x:(num_reps-1+x))]), na.rm = TRUE))
# getting max values of replicates
ribbon.df$Max <- sapply(seq(1,ncol(ex.df), by = num_reps),
                        function(x) max(unlist(ex.df[,c(x:(num_reps-1+x))]), na.rm = TRUE))
# assign each of the replicates to the visualization data frame
for (i in 1:num_reps){ 
  ribbon.df[,3+i] <- t(ex.df[,seq(i,ncol(ex.df),by=num_reps)])
}

# color names
color_bar <- c("Rep.1"="red","Rep.2"="blue","Rep.3"="green")

# visualize, with shading for each row
p <- ggplot(data = ribbon.df,aes(x=Times))+ # declare the dataframe and main variables
  geom_ribbon(aes(x=Times, ymax=Max, ymin=Min, colour="Original"),
              fill = "gray", alpha = 0.5)+ # create shading
  ggtitle(expressions[samp,1])+ # gene name is title
  scale_color_manual("",values=color_bar)+
  scale_fill_manual("",values=color_bar)+
  theme(plot.title = element_text(hjust = .5),
        legend.position = "bottom",legend.direction = "horizontal")+
  labs(x="Hours", y="Expression") #Label for axes
# add specific replicate lines 
for (i in 1:num_reps){
  p <- p +
    geom_line(data = ribbon.df,
              aes_string(x="Times",y=paste("Rep",i,sep = ".")),
              colour=color_bar[i])
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
suppressWarnings(p) # to ignore warnings for missing values
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

It very clearly has an oscillitory pattern with a small amount of damping, making echo_find() the perfect function for our dataset.

Running echo_find()

So we begin by assigning our parameters and running the echo_find() function. In this first run, we look for rhythms between 20 and 26 hours, with no preprocessing, assigning these results to a new dataframe.

# echo_find() parameters

genes <- expressions
begin <- 2 # first time point
end <- 48 # last time point
resol <- 2 # time point resolution
num_reps <- 3 # number of replicates
low <- 20 # low period seeking
high <- 26 # high period seeking
run_all_per <- FALSE # we are not looking for all periods
paired <- FALSE # these replicates are unrelated, that is, a replicate being 
  # called "replicate 1" at time point 2 means nothing
rem_unexpr <- FALSE # do not remove unexpressed genes
# we do not assign rem_unexpr_amt, since we're not removing unexpressed genes
is_normal <- FALSE # do not normalize
is_de_linear_trend <- FALSE # do not remove linear trends
is_smooth <- FALSE # do not smooth the data

results <- echo_find(genes = genes, begin = begin, end = end, resol = resol, 
  num_reps = num_reps, low = low, high = high, run_all_per = run_all_per,
  paired = paired, rem_unexpr = rem_unexpr, is_normal = is_normal,
  is_de_linear_trend = is_de_linear_trend, is_smooth = is_smooth)

head(results[,1:16])
##   Gene Name Convergence Iterations Amplitude.Change.Coefficient
## 1  Sample 1           0          0                  -0.05319493
## 2  Sample 2           0          0                   0.01320716
## 3  Sample 3           0          0                  -0.07006987
## 4  Sample 4           0          0                  -0.10549713
## 5  Sample 5           0          0                  -0.04007378
## 6  Sample 6           0          0                   0.11028233
##   Oscillation Type Initial.Amplitude Radian.Frequency   Period Phase Shift
## 1           Forced         1.2113578        0.3141593 20.00000  -0.4740208
## 2         Harmonic         0.7143799        0.2749550 22.85169   2.4479292
## 3           Forced         1.6853526        0.2811149 22.35095   3.7599389
## 4           Forced         0.7119285        0.2416610 26.00000   6.7165400
## 5           Forced         1.4596162        0.3141593 20.00000  -1.6465263
## 6           Damped         0.8471170        0.2416610 26.00000  10.8530577
##   Hours Shifted Equilibrium Value Slope       Tau      P-Value BH Adj P-Value
## 1      1.508855        0.32913626    NA 0.8894112 4.750294e-25   1.425088e-24
## 2     13.948668        0.12870158    NA 0.9011140 1.138728e-25   4.554914e-25
## 3      8.975854       -0.09077893    NA 0.9781573 5.940921e-30   7.129105e-29
## 4     24.206766       -0.08338100    NA 0.7870119 5.818796e-20   8.728194e-20
## 5      5.241056        0.44750943    NA 0.8989084 6.154595e-26   3.692757e-25
## 6      7.089738        0.43873442    NA 0.7967642 2.025281e-20   3.471911e-20
##   BY Adj P-Value
## 1   4.422349e-24
## 2   1.413486e-24
## 3   2.212311e-28
## 4   2.708542e-19
## 5   1.145940e-24
## 6   1.077407e-19

Now we can see that the results data frame has information about the parameters, including forcing coefficient values (whether the oscillation is damped, driven, harmonic, etc.) and p-values. Let’s look at how the fit and parameters turned out for our initial sample. Here we add the fitted values to our plot in black and print the parameters to the console.

# assign the fit to the visualization data frame
ribbon.df$Fit <- t(results[samp,(17+(length(tp)*num_reps)):ncol(results)])

# visualize, with shading for each row
# add Fit line
p <- p +
  geom_line(data = ribbon.df,
            aes_string(x="Times",y="Fit"),
            colour="black")

suppressWarnings(p) # to ignore warnings for missing values
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

# print sample's parameters
cat(paste0("Gene Name: ",results$`Gene Name`[samp],"\n",
           "Convergence:", results$Convergence[samp],"\n",
           "Iterations:",results$Iterations[samp],"\n",
           "Forcing Coefficient:", results$Forcing.Coefficient[samp],"\n",
           "Oscillation Type:",results$`Oscillation Type`[samp],"\n",
           "Amplitude", results$Amplitude[samp],"\n",
           "Radian.Frequency:",results$Radian.Frequency[samp],"\n",
           "Period:",results$Period[samp],"\n",
           "Phase Shift:",results$`Phase Shift`[samp],"\n",
           "Hours Shifted:",results$`Hours Shifted`[samp],"\n",
           "P-Value:",results$`P-Value`[samp],"\n",
           "BH Adj P-Value:",results$`BH Adj P-Value`[samp],"\n",
           "BY Adj P-Value:",results$`BY Adj P-Value`[samp],"\n"))
## Gene Name: Sample 2
## Convergence:0
## Iterations:0
## Forcing Coefficient:
## Oscillation Type:Harmonic
## Amplitude0.0132071614058592
## Radian.Frequency:0.274955003256275
## Period:22.851685667721
## Phase Shift:2.44792917755378
## Hours Shifted:13.9486682700991
## P-Value:1.13872848886594e-25
## BH Adj P-Value:4.55491395546377e-25
## BY Adj P-Value:1.4134857624926e-24

This fit matches pretty closely to the trend, which is emphasized by the very low adjusted p-value. As we predicted, the oscillation is also damped, which is shown by the positive forcing coefficient and the designation of the oscillation type.

Now let’s see how preprocessing affects the results. Here we search for all possible periods, using the default values for low and high, as well as allowing for all our preprocessing options: removing unexpressed genes, normalizing, removing linear trends, and smoothing.

run_all_per <- TRUE # looking for all possible periods
rem_unexpr <- TRUE # remove unexpressed genes
rem_unexpr_amt <- 70 # percentage of unexpressed genes
is_normal <- TRUE # normalize
is_de_linear_trend <- TRUE # remove linear trends
is_smooth <- TRUE # smooth the data

# we're using the default values of low and high, since we're looking for all periods
results <- echo_find(genes = genes, begin = begin, end = end, resol = resol, 
  num_reps = num_reps, run_all_per = run_all_per, paired = paired, 
  rem_unexpr = rem_unexpr, rem_unexpr_amt = rem_unexpr_amt, is_normal = is_normal,
  is_de_linear_trend = is_de_linear_trend, is_smooth = is_smooth)

head(results[,1:16])
##   Gene Name Convergence Iterations Amplitude.Change.Coefficient
## 1  Sample 1           0          0                  -0.05153645
## 2  Sample 2           0          0                   0.01158224
## 3  Sample 3           0          0                  -0.07418176
## 4  Sample 4           0          0                  -0.07811707
## 5  Sample 5           0          0                  -0.02671833
## 6  Sample 6           0          0                   0.03607892
##   Oscillation Type Initial.Amplitude Radian.Frequency   Period Phase Shift
## 1           Forced         0.6164627        0.3243511 19.37155  -0.6585024
## 2         Harmonic         1.5772249        0.2731878 22.99951   2.5290535
## 3           Forced         0.4876710        0.2788346 22.53374   3.5663709
## 4           Forced         0.4262501        0.2191192 28.67474   7.1947247
## 5         Harmonic         0.9433863        0.3249428 19.33628   4.2981209
## 6           Damped         1.6033815        0.1655005 37.96476   5.9626842
##   Hours Shifted Equilibrium Value        Slope       Tau      P-Value
## 1      2.030215       0.002159044  0.003476106 0.9381728 1.095403e-27
## 2     13.741944       0.095058426  0.005083800 0.9371975 1.240717e-27
## 3      9.743463      -0.217673167 -0.024666260 0.7694577 3.766646e-19
## 4     24.514726       0.110275406 -0.007324020 0.9157424 1.861411e-26
## 5      6.108965      -0.267900933  0.010171317 0.8799940 6.327600e-25
## 6      1.936557      -0.040058348 -0.035910190 0.7245971 3.695378e-17
##   BH Adj P-Value BY Adj P-Value
## 1   4.962868e-27   1.540083e-26
## 2   4.962868e-27   1.540083e-26
## 3   5.022195e-19   1.558493e-18
## 4   5.045212e-26   1.565635e-25
## 5   1.084731e-24   3.366150e-24
## 6   4.031322e-17   1.251004e-16

Since we’ve now searched for all possible periods, periods can now fall outside our predetermined range of 20 to 26 that we set in our first run. Let’s see how this affected the fit and parameters of the sample we looked at.

rep_genes <- results[samp,17:(16+(length(tp)*num_reps))]

 # getting min values of replicates
ribbon.df$Min <- sapply(seq(1,ncol(rep_genes), by = num_reps), 
                        function(x) min(unlist(rep_genes[,c(x:(num_reps-1+x))]),
                                        na.rm = TRUE))
 # getting max values of replicates
ribbon.df$Max <- sapply(seq(1,ncol(rep_genes), by = num_reps),
                        function(x) max(unlist(rep_genes[,c(x:(num_reps-1+x))]),
                                        na.rm = TRUE))
for (i in 1:num_reps){ # assign each of the replicates
  ribbon.df[,3+i] <- t(rep_genes[,seq(i,ncol(rep_genes),by=num_reps)])
}
# assign the fit to the visualization data frame
ribbon.df$Fit <- t(results[samp,(17+(length(tp)*num_reps)):ncol(results)])

# visualize, with shading for each row
p <- ggplot(data = ribbon.df,aes(x=Times))+ # declare the dataframe and main variables
  geom_ribbon(aes(x=Times, ymax=Max, ymin=Min, colour="Original"),
              fill = "gray", alpha = 0.5)+ # create shading
  ggtitle(expressions[samp,1])+ # gene name is title
  scale_color_manual("",values=color_bar)+
  scale_fill_manual("",values=color_bar)+
  theme(plot.title = element_text(hjust = .5),
        legend.position = "bottom",legend.direction = "horizontal")+
  labs(x="Hours", y="Expression") #Label for axes
# add specific replicate lines 
for (i in 1:num_reps){
  p <- p +
    geom_line(data = ribbon.df,
              aes_string(x="Times",y=paste("Rep",i,sep = ".")),
              colour=color_bar[i])
}

# add Fit line
p <- p +
  geom_line(data = ribbon.df,
            aes_string(x="Times",y="Fit"),
            colour="black")

suppressWarnings(p) # to ignore warnings for missing values
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

# print sample's parameters
cat(paste0("Gene Name: ",results$`Gene Name`[samp],"\n",
           "Convergence:", results$Convergence[samp],"\n",
           "Iterations:",results$Iterations[samp],"\n",
           "Forcing Coefficient:", results$Forcing.Coefficient[samp],"\n",
           "Oscillation Type:",results$`Oscillation Type`[samp],"\n",
           "Amplitude", results$Amplitude[samp],"\n",
           "Radian.Frequency:",results$Radian.Frequency[samp],"\n",
           "Period:",results$Period[samp],"\n",
           "Phase Shift:",results$`Phase Shift`[samp],"\n",
           "Hours Shifted:",results$`Hours Shifted`[samp],"\n",
           "P-Value:",results$`P-Value`[samp],"\n",
           "BH Adj P-Value:",results$`BH Adj P-Value`[samp],"\n",
           "BY Adj P-Value:",results$`BY Adj P-Value`[samp],"\n"))
## Gene Name: Sample 2
## Convergence:0
## Iterations:0
## Forcing Coefficient:
## Oscillation Type:Harmonic
## Amplitude0.011582239502997
## Radian.Frequency:0.273187835833973
## Period:22.9995061383265
## Phase Shift:2.52905346225012
## Hours Shifted:13.7419436464624
## P-Value:1.24071709814817e-27
## BH Adj P-Value:4.9628683925927e-27
## BY Adj P-Value:1.54008261904479e-26

We can see that the fit hasn’t changed very much, but that the smoothing has gotten the replicates much closer to each other. This smoothing has also reduced the amount of damping in the system: the forcing coefficient has decreased by about .002.

Other selections are also available. Confidence interval calculations with either bootstrapping or jackknifing with a set random seed for reproducibility are also available through the run_conf, which_conf, and seed options respectively. However, it is important to note that this will increase the time length of runs. Further, adjustments for the harmonic and overexpressed cutoffs can be adjusted through the harm_cut and over_cut parameters, respectively, though the defaults are recommended.

Now that you understand the basics of using the echo_find() function, feel free to experiment and play around with this vignette and the example data. Good luck with your rhythm searches!