Technology & AI

How to Build a Single Pipeline for RNA-seq Analysis with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery

In this tutorial, we develop a workflow for single-cell RNA-seq analysis using Scanpy on the PBMC-3k benchmark dataset. We first load the dataset, check its structure, and use quality control checks to evaluate gene counts, total counts, mitochondrial content, and ribosomal gene signals. We then filtered out low-quality cells and genes, detected duplicates with Scrublet, normalized the data, applied log transformation, and identified differentially expressed genes for downstream analysis. Also, we derive cell cycle stages, reverse unwanted technical variation, scale the data, and reduce dimensionality using PCA, UMAP, and t-SNE. We also group the cells with the Leiden algorithm, identify the marker genes, characterize the cell population using canonical PBMC markers, evaluate the trajectory structure with PAGA and diffusion pseudotime, calculate the interferon response score, and finally save the fully analyzed object of AnnData for future use.

Copy the CodeCopiedUse a different browser
!pip install -q scanpy leidenalg python-igraph scrublet


import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor="white", figsize=(5, 5))
sc.logging.print_header()


adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)


adata.var["mt"]   = adata.var_names.str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
sc.pp.calculate_qc_metrics(
   adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)


sc.pl.violin(
   adata,
   ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
   jitter=0.4, multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

We include the necessary libraries for single-cell analysis and import Scanpy, NumPy, Pandas, Matplotlib, and alert controls. We load the PBMC-3k benchmark dataset, annotate the gene names, and examine the AnnData object structure. We then calculated quality control metrics for mitochondrial and ribosomal genes and visualized the quality patterns of the counts using violin and scatter plots.

Copy the CodeCopiedUse a different browser
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 2500, :].copy()
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()


sc.pp.scrublet(adata)
print("Predicted doublets:", int(adata.obs["predicted_doublet"].sum()))
adata = adata[~adata.obs["predicted_doublet"], :].copy()


adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)


sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata.raw = adata
adata = adata[:, adata.var.highly_variable].copy()

We filter out low-quality cells and rarely detected genes to improve the reliability of the dataset. We use Scrublet through Scanpy to identify predicted doubles and remove them before deeper analysis. We then retain the raw counts, normalize expression values, apply a log transformation, select the most variable genes, and retain only the most informative features.

Copy the CodeCopiedUse a different browser
s_genes = ["MCM5","PCNA","TYMS","FEN1","MCM2","MCM4","RRM1","UNG","GINS2",
          "MCM6","CDCA7","DTL","PRIM1","UHRF1","HELLS","RFC2","NASP",
          "RAD51AP1","GMNN","WDR76","SLBP","CCNE2","UBR7","POLD3","MSH2",
          "ATAD2","RAD51","RRM2","CDC45","CDC6","EXO1","TIPIN","DSCC1",
          "BLM","CASP8AP2","USP1","CLSPN","POLA1","CHAF1B","E2F8"]
g2m_genes = ["HMGB2","CDK1","NUSAP1","UBE2C","BIRC5","TPX2","TOP2A","NDC80",
            "CKS2","NUF2","CKS1B","MKI67","TMPO","CENPF","TACC3","SMC4",
            "CCNB2","CKAP2L","CKAP2","AURKB","BUB1","KIF11","ANP32E",
            "TUBB4B","GTSE1","KIF20B","HJURP","CDCA3","CDC20","TTK",
            "CDC25C","KIF2C","RANGAP1","NCAPD2","DLGAP5","CDCA2","CDCA8",
            "ECT2","KIF23","HMMR","AURKA","PSRC1","ANLN","LBR","CKAP5",
            "CENPE","NEK2","G2E3","CBX5","CENPA"]
s_genes   = [g for g in s_genes   if g in adata.var_names]
g2m_genes = [g for g in g2m_genes if g in adata.var_names]
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)


sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)


sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)


sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs=40)

We define S and G2/M phase marker elements and retain only those present in the dataset. We scored each cell by cell cycle stage, regressed unwanted variance on total mitochondrial counts and percentages, and scaled the downstream modeling data. We then use PCA, examine the explained variance, construct a spatial graph, and generate the UMAP and t-SNE embeddings.

Copy the CodeCopiedUse a different browser
sc.tl.leiden(adata, resolution=0.5, flavor="igraph", n_iterations=2, directed=False)
sc.pl.umap(adata, color="leiden", legend_loc="on data", title="Leiden clusters")
sc.pl.tsne(adata, color="leiden", legend_loc="on data", title="t-SNE clusters")


sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)


result   = adata.uns["rank_genes_groups"]
groups   = result["names"].dtype.names
top_df   = pd.DataFrame({g: result["names"][g][:10] for g in groups})
print("nTop 10 markers per cluster:n", top_df)


marker_genes = {
   "B-cell":           ["CD79A", "MS4A1"],
   "CD8 T-cell":       ["CD8A", "CD8B"],
   "CD4 T-cell":       ["IL7R", "CD4"],
   "NK":               ["GNLY", "NKG7"],
   "CD14 Monocyte":    ["CD14", "LYZ"],
   "FCGR3A Monocyte":  ["FCGR3A", "MS4A7"],
   "Dendritic":        ["FCER1A", "CST3"],
   "Megakaryocyte":    ["PPBP"],
}
sc.pl.dotplot(adata, marker_genes, groupby="leiden", standard_scale="var")
sc.pl.stacked_violin(adata, marker_genes, groupby="leiden", swap_axes=True)

We apply Leiden clustering to group cells based on the local graph and visualize the clusters in UMAP and t-SNE plots. We performed differential expression analysis using the Wilcoxon test to identify the top marker elements for each cluster. We then used PBMC canonical marker elements to support cell type annotation with dots and stacked violin plots.

Copy the CodeCopiedUse a different browser
sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, color="leiden", threshold=0.1)


sc.tl.umap(adata, init_pos="paga")
sc.pl.umap(adata, color="leiden", legend_loc="on data")


sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep="X_diffmap")
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == adata.obs["leiden"].cat.categories[0])[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["leiden", "dpt_pseudotime"], legend_loc="on data")


ifn_genes = ["ISG15", "IFI6", "IFIT1", "IFIT3", "MX1", "OAS1", "STAT1", "IRF7"]
ifn_genes = [g for g in ifn_genes if g in adata.raw.var_names]
sc.tl.score_genes(adata, gene_list=ifn_genes, score_name="IFN_score")
sc.pl.umap(adata, color="IFN_score", cmap="viridis")


adata.write("pbmc3k_analyzed.h5ad")
print("n Analysis complete — saved to pbmc3k_analyzed.h5ad")
print(adata)

We use PAGA to model the connections between Leiden clusters and also run UMAP using the PAGA graph to obtain a clear trajectory structure. We calculate diffusion and diffusion pseudotime maps to examine possible progression patterns across cell conditions. We also calculate the interferon gene set score, visualize it in UMAP, and save the final analyzed object as a .h5ad file.

In conclusion, we built an end-to-end Scanpy pipeline for single-cell RNA-seq analysis, converting raw PBMC data into interpretable biological information. We cleaned and preprocessed the dataset, removed noisy cells and duplicates, selected informative genes, and generated meaningful embeddings to visualize cellular architecture. We then used Leiden clustering and differential expression analysis to find marker genes and link the clusters to known immune cell types. By adding PAGA, diffusion pseudotime, and custom gene set estimation, we extended the workflow beyond basic clustering and demonstrated how Scanpy supports deep biological interpretation. In the end, we had a .h5ad database containing processed data, annotations, scores, clusters, and visual analysis results, ready for cross-checking or reporting.


Check it out Full Codes with Notebook here. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?Connect with us

The post How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery appeared first on MarkTechPost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button