scRNA-seq analysis with Cellenics®

Preface

Having in-house bioinformaticians who can process and analyse your single cell RNA sequencing data is a luxury that many research groups want but don't have. The alternative is either to queue for services provided by bioinformaticians embedded in other groups or outsource the tasks to specialised companies. The former often becomes bottlenecks of your research pipeline, and the latter can only be viable with sufficient funds. You might feel like running out of options, but don't worry, here comes Cellenics®. It is an open-source analysis tool by Biomage and is free for academic users. Let's explore this cloud-based analysis tool together.

As a bioinformatician myself, I enjoy building my own analysis pipelines for single cell datasets. However, I don't think that programming skills should be standing between the researchers and their datasets. Our ultimate goal is to understand the scientific implication of the data and use that knowledge to push our research further and benefit more people. To this end, we need a tool that is easy to use and quick to generate results (plots, graphs, and differentially expressed gene lists) and Cellenics® is such a tool. Let me show you what Cellenics® can do for you and your datasets.

My big question is: what are the transcriptomic differences between the developing and adult mouse brains at single cell level? (We could be asking a more complex question here but the datasets normally require some wrangling, defeating our main purpose.)

Step 1: Set up your project on Cellenics®.

When you open the Cellenics®, you will be taken to the Data Management panel. Click the Create New Project button to start your very first project. Give it a self-explaining name and clear description.

Step 2: Add your data.

The datasets used here were downloaded from the 10X Single Cell Gene Expression Datasets: 1k Brain Nuclei from an E18 Mouse and 2k Brain Nuclei from an Adult Mouse (>8 weeks) and we need the Gene/cell matrix (raw) files. In the main panel of project details, click the Add samples button to upload your data. In the new window, choose the correct Technology (10X Chromium) and drag-drop folders containing the 3 count matrix files into the indicated box. (It's the output folders you normally get from the cellranger run, containing 3 files: barcodes.tsv, genes.tsv, and matrix.mtx.) The list of files to be uploaded will be shown under the box. The Upload button is now activated, hit it.

Step 3: Add metadata.

If you have multiple datasets from different groups, e.g. diseased vs control, developmental vs adult, male vs female, human vs mouse, provide clear labels here. In our case, click Add metadata button and provide the new metadata track name age. As the age column appears, you can edit the values, adult for the adult data and developmental for the E18 data. Click Go to Data Processing once you finish adding the metadata. The Data Processing panel is now open and showing the launching progress. It may take a while if you have a large number of datasets and you can opt for an email alert for when the launch is completed.

Step 4: Process data.

The process status is shown on the top-right of the panel, with forward and backward arrows for each step: 1. Classifier filter; 2. Cell size distribution filter; 3. Mitochondrial content filter; 4. Number of genes vs UMI filter; 5. Doublet filter; 6. Data integration; 7. Configure embedding. (The orange highlight indicates the current step.)

Firstly, we filter out empty droplets that contain ambient RNAs instead of cells using a false discovery rate of 0.01. This step is disabled if one of the datasets appears to be pre-filtered. In our case, the knee plots show the droplet UMI numbers against their ranks. The droplets to keep are highlighted in green (Figure 1 and Figure 2). You might have noticed that on the top-right of the graph, you can choose to save it as either SVG or PNG file. On the right side, you can change the filtering FDR thresholds and style your plots.

Figure 1: Classifier filtering for adult mouse brain single nuclei RNA-seq data

Figure 2: Classifier filtering for developmental mouse brain single nuclei RNA-seq data

Hit the forward arrow to go to the next filter, the Cell size distribution filter. As we can see, both datasets have now only high quality cells (Figure 3 and Figure 4).

Figure 3: Cell size distribution filtering for adult mouse brain single nuclei RNA-seq data

Figure 4: Cell size distribution filtering for developmental mouse brain single nuclei RNA-seq data

Hit the forward arrow to go to the next filter, the Mitochondrial content filter. As we can see, both datasets have some cells with high percentage of mitochondrial genes, indicating poor quality dead cells (Figure 5 and Figure 6).

Figure 5: Mitochondrial content filtering for adult mouse brain single nuclei RNA-seq data

Figure 6: Mitochondrial content filtering for developmental mouse brain single nuclei RNA-seq data

Hit the forward arrow to go to the next filter, the Number of genes vs UMIs filter. As we can see, both datasets have a few outliers (Figure 7 and Figure 8).

Figure 7: Number of genes vs UMIs filtering for adult mouse brain single nuclei RNA-seq data

Figure 8: Number of genes vs UMIs filtering for developmental mouse brain single nuclei RNA-seq data

Hit the forward arrow to go to the next filter, the Doublet filter. As we can see, both datasets have a few cells with high probabilities of being doublets (Figure 9 and Figure 10).

Figure 9: Doublet filtering for adult mouse brain single nuclei RNA-seq data

Figure 10: Doublet filtering for developmental mouse brain single nuclei RNA-seq data

Hit the forward arrow to go to the next step, Data integration. Since we only have two small datasets, it takes no time to finish. On the top half of the panel, you can see the clustering map. On the bottom half, you can see the settings used for data integration (Harmony with LogNormalize) and dimensionality reduction (Number of Principal Components and % variation explained). You will see three graphs available for you: the clustering map, the frequency plot, and the elbow plot (Figure 11).

Figure 11: Data integration. Left top, elbow plot showing the optimal number of principal components; left bottom, stacked bar plot showing the proportion of cells contributed from either the adult or the developmental mouse brain in each cluster; right, clustering map showing all the cells after integration using Harmony.

Hit the forward arrow to go to the next step, Configure embedding. On the top half of the panel, you can see the clustering map. On the bottom half, you can see the settings used for embedding (UMAP with minimum distance 0.3) and clustering (Louvain with resolution 0.8). You will see four graphs available for you that are coloured by Louvain clusters, sample (same as Figure 11 right), mitochondrial fraction reads, and cell doublet score (Figures 12 - 15).

Figure 12 UMAP coloured by clusters.

Figure 13 UMAP coloured by samples.

Figure 14 UMAP coloured by Mitochondrial fraction read.

Figure 15 UMAP coloured by cell doublet score.

Hit the blue tick button to finish the QC process and apply any filters that we've not rerun. Then, we can start exploring the data.

Step 5: Explore data.

Once the data processing is completed, you will be led to the Data Exploration panel and see four blocks: UMAP (showing the Louvain clusters in different colours), Cell sets and Metadata (showing check boxes for defining, annotating and visualising specific populations), Heatmap (showing top expressed genes in each parameter), and Genes (showing a list of genes ranked by their dispersion scores and providing options for differentially expressed gene analysis).

UMAP You can mouse over the UMAP plot and get the cell id and cluster number for each single cell (Figure 16). Then you can use the rectangle or lasso tool to subset/select a subgroup of cells to define a custom cell set of your interest (in our case 'my_cluster', Figure 17).

Cell sets and Metadata You can see apart from the original louvain cluster, a new check box under Custom cell sets, 'my_cluster' (Figure 18). You can edit the name and colour of each cluster. For more complicated subsetting, you can toggle on and off the square boxes, which will activate a selection of logic actions, combining, intersecting, and complementing. You can also use metadata to colour and subset your data (Figure 19).

Figure 16 UMAP showing louvain clusters

Figure 17 UMAP coloured by custom cell set

Figure 18 Check boxes for each individual cluster

Figure 19 Check boxes for each metadata group

Heatmap It shows the cluster markers. You can add additional metadata tracks other than the Louvain cluster (Figure 20). The heatmap can also be grouped by other metadata (Figure 21).

Genes You have two tabs, Gene list and Differential expression. Genes are ranked based on their dispersion scores and you can visualise the expression on UMAP by toggling the eye icon (Figure 22 - 23).

Figure 20 Adding multiple metadata tracks to your heat map

Figure 21 Heatmap grouped by age/samples instead of the louvain cluster.

Figure 22 Gene list with Hist1h2ap turned on

Figure 23 UMAP coloured with Hist1h2ap expression

Differential expression is where most users' interest lies. Going back to our question: what are the transcriptomic differences between the developing and adult mouse brains at single cell level? We can use this tool to start answering this question. Ideally, we would need multiple samples in each group, however, in our case we only have n = 1 for each group. We use 'Compare a selected cell set between samples/groups' since we want to compare the adult and developmental (Figure 24). The system has kindly suggested that the sample number is low and only logFC values will be calculated (but not p values) and the results should be used with caution (Figure 25).

Figure 24 Settings for differentially expressed gene analysis

Figure 25 Warning message for the low sample number

Hit Compute. You will see a list of genes ranked on the logFC, starting from the ones that are upregulated in the developmental brain, including Cd24a, Igfbpl1, Ube2c, Eomes, and Cdk1 (Figure 26). To get the downregulated genes, simply use the small up arrow to rearrange from the smallest values to the largest (Figure 27). You can also use advanced filters to focus on the top DEGs with high absolute logFC (Figure 28). With these genes, we can do Export as CSV and Pathway analysis.

Figure 26 DEGs upregulated in the developmental brain

Figure 27 DEGs downregulated in the developmental brain

Figure 28 Settings of Advanced filters for thresholding

Let's use the filter shown in Figure 28 and hit Pathway analysis. You can see a new pop-up window offering two external services, pantherdb and enrichr (Figure 29). We use enrichr with the top 20 genes as an example here. Hit Launch and you will be guided to the results at Enrichr (Figure 30). To look at the top 20 upregulated genes in the developmental brain, we can just re-sort the list and rerun the pathway analysis (Figure 31). The pathways associated with the upregulated genes in the developing brain are largely related to cell cycle and mitosis, which makes biological sense. Hit Export as CSV and you will see a pop-up window suggesting this feature is not yet available and you need to go to the volcano plot to get the list of DEGs (Figure 32).

Figure 29 Settings for pathway analysis

Figure 30 Enrichr outputs using the top 20 downregulated genes as input.

Figure 31 Enrichr outputs using the top 20 upregulated genes as input.

Figure 32 Information for exporting CSV

Hit volcano plot link and a new window will open. It has a main plot panel for visualising plots (empty when opening for the first time), and a side Controls panel for settings. Repeat what you've done for the differential expression analysis and export the genes. Even though the p-values are not calculated and no datapoints shown on the plot, you can still download the CSV files by hitting the Export as CSV button.

Step 6: Visualise data.

Hit the Plots and Tables tab. There are three horizontal panels: Cell sets & metadata (including Categorical Embedding, Frequency Plot, and Trajectory Analysis), Gene expression (including Continuous Embedding, Marker Heatmap, Custom Heatmap, Violin Plot, Dot Plot, and Normalised Expression Matrix), and Differential expression. I will go through a few examples here and you can explore the rest yourself.

  • Frequency Plot (Figure 33) - a good tool to visualise differences in sample composition, contributions from each cluster. You can see that the adult sample mainly contains cells from cluster 0 and cluster 1, while cluster 2, 3 , 4, and 5 cells are exclusively in the developmental sample. It is likely cluster 0 and 1 cells are more mature, thus found mostly in the adult tissue.

  • Dot plot - a great tool to visualise gene expression at different metadata levels. Let's add the genes in Figure 27, Cd24a, Igfbpl1, Ube2c, Eomes, Cdk1, Pbk, and Tubb2b, and plot their expression in each cluster. (I changed the Colour Schemes to Spectral for better visualisation.) You can see these genes have higher expression levels (average expression value and the percentage of cells expressing the specific gene) in cluster 2, 3, 4, and 5 (Figure 34), the same clusters that are specific to the developmental sample from our frequency plot (Figure 33). To make this more obvious, hit Select data and group cells by age (Figure 35).

  • Trajectory Analysis - I would be confident using this analysis in a study with multiple time points, like our current case, because you will need to decide on the starting point. We can start from the developmental cells and end in the adult cells. I clicked on a white dot in the middle of the developmental cells to mark the start and hit Calculate (Figure 36). (I changed the Colour Schemes to Inferno this time.) Now you can see the gradual change from developmental cells to adult cells and determine the relative maturity levels of different louvain clusters.

Figure 33 Sample cluster composition

Figure 34 Dotplot of top DEGs upregulated in the developmental brain (grouped by louvain cluster)

Figure 35 Dotplot of top DEGs upregulated in the developmental brain (grouped by age)

Figure 36 Pseudotime analysis (the red dot is the root for calculation)

Step 7: Create your data story.

Step 1- 6 is just the beginning. You can carry out complex analyses at different granularity levels and ultimately tell a good story based on your research and data analysis. With the flexibility and functionality provided by Cellenics®, you are ready to tell that story to your colleagues, your students, the reviewers of your papers and grants, to everyone out there who is interested in your research.

So what are you waiting for?