In this vignette a PARAFAC model is created for the
Fujita2023 data. This is done by first processing the count
data. Subsequently, the appropriate number of components are determined.
Then the PARAFAC model is created and visualized.
The data cube in Fujita2023$data contains unprocessed
counts. The function processDataCube() performs the
processing of these counts with the following steps:
compositions::clr() function with a pseudo-count of one
(on all features, prior to selection based on sparsity).The outcome of processing is a new version of the dataset called
processedFujita. Please refer to the documentation of
processDataCube() for more information.
A critical aspect of PARAFAC modelling is to determine the correct
number of components. We have developed the functions
assessModelQuality() and
assessModelStability() for this purpose. First, we will
assess the model quality and specify the minimum and maximum number of
components to investigate and the number of randomly initialized models
to attempt for each number of components.
Note: this vignette reflects a minimum working example for analyzing
this dataset due to computational limitations in automatic vignette
rendering. Hence, we only look at 1-3 components with 5 random
initializations each. These settings are not ideal for real datasets.
Please refer to the documentation of assessModelQuality()
for more information.
# Setup
minNumComponents = 1
maxNumComponents = 3
numRepetitions = 3 # number of randomly initialized models
numFolds = 4 # number of jack-knifed models
ctol = 1e-5
maxit = 200
numCores= 1
# Plot settings
colourCols = c("", "Genus", "")
legendTitles = c("", "Genus", "")
xLabels = c("Replicate", "Feature index", "Time point")
legendColNums = c(0,5,0)
arrangeModes = c(FALSE, TRUE, FALSE)
continuousModes = c(FALSE,FALSE,TRUE)
# Assess the metrics to determine the correct number of components
qualityAssessment = assessModelQuality(processedFujita$data, minNumComponents, maxNumComponents, numRepetitions, ctol=ctol, maxit=maxit, numCores=numCores)The overview plot showcases the number of iterations, the sum-of-squared error, the CORCONDIA and the variance explained for 1-3 components.
The overview plots show that we can reach ~40% explained variation if we
take 3 components. The CORCONDIA for those models are ~98, which is well
above the minimum requirement of 60. Based on this overview, either 2 or
3 components seems fine.
Next, we investigate the stability of the models when jack-knifing
out samples using assessModelStability(). This will give us
more information to choose between 2 or 3 components.
stabilityAssessment = assessModelStability(processedFujita, minNumComponents=1, maxNumComponents=3, numFolds=numFolds, considerGroups=FALSE,
                                           groupVariable="", colourCols, legendTitles, xLabels, legendColNums, arrangeModes,
                                           ctol=ctol, maxit=maxit, numCores=numCores)
stabilityAssessment$modelPlots[[1]]
The three-component model is stable and can be safely chosen as the
final model.
Since a three-component model is the most appropriate for the
Fujita2023 dataset, we can now select one of the random
initializations from the assessModelQuality() output as the
final model. The selected model corresponds to the one that explained
the largest amount of variation.
numComponents = 3
modelChoice = which(qualityAssessment$metrics$varExp[,numComponents] == max(qualityAssessment$metrics$varExp[,numComponents]))
finalModel = qualityAssessment$models[[numComponents]][[modelChoice]]Finally, we visualize the model using
plotPARAFACmodel().
plotPARAFACmodel(finalModel$Fac, processedFujita, 3, colourCols, legendTitles, xLabels, legendColNums, arrangeModes,
  continuousModes = c(FALSE,FALSE,TRUE),
  overallTitle = "Fujita PARAFAC model")You will observe that the loadings for some modes in some components
are all negative. This is due to sign flipping: two modes having
negative loadings cancel out but describe the same thing as two positive
loadings. The flipLoadings() function automatically
performs this procedure and also sorts the components by how much
variation they describe.
finalModel = flipLoadings(finalModel, processedFujita$data)
plotPARAFACmodel(finalModel$Fac, processedFujita, 3, colourCols, legendTitles, xLabels, legendColNums, arrangeModes,
  continuousModes = c(FALSE,FALSE,TRUE),
  overallTitle = "Fujita PARAFAC model")