Fix duplicates issue when querying expression data by ShixiangWang · Pull Request #631 · BioinformaticsFMRP/TCGAbiolinks

ShixiangWang · 2024-06-19T07:25:02Z

I was trying to obtain STAR gene expression data from multiple datasets including TCGA, TARGET, and CPTAC. Some errors happened when preparing some projects, and they are mainly related to duplicated records, especially in the CPTAC-3. I tested the code and made a workaround. This PR also amended some data checks to make sure the data preparation process went smoothly.

o Preparing output
-------------------
Downloading data for project CPTAC-3
Of the 2340 files for download 2340 already exist.
All samples have been already downloaded
Removing duplicated cases (with older updated time)
 => 41 records removed

Code to query the data.

library(TCGAbiolinks)
#devtools::load_all("/Volumes/EPan/Repos/TCGAbiolinks/")
#install.packages("/Volumes/EPan/Repos/TCGAbiolinks/", repos = NULL, type = "source")

prjs = TCGAbiolinks::getGDCprojects()
tcga_prjs = prjs$project_id[startsWith(prjs$project_id, "TCGA")]
targ_prjs = prjs$project_id[startsWith(prjs$project_id, "TARGET")]
cpta_prjs = prjs$project_id[startsWith(prjs$project_id, "CPTAC")]

# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

dir.create("data")
dir.create("data/count")
dir.create("data/tpm")

# GDC data download and prepare ------------------------------------------------

for (i in c(tcga_prjs, targ_prjs, cpta_prjs)) {
  message("Handling ", i)
  # if (file.exists(glue::glue("data/tpm/{i}.exp.tpm.rds"))) {
  #   message("handled already, skipping...")
  #   next()
  # }
  query.exp.hg38 <- GDCquery(
    project = i,
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts"
  )

  GDCdownload(query.exp.hg38)
  #debug(GDCprepare)
  #debug(readTranscriptomeProfiling)
  #debug(colDataPrepare)
  expdat <- GDCprepare(
    query = query.exp.hg38
  )
  gc()

  message("saving...")
  saveRDS(SummarizedExperiment::assays(expdat)[["unstranded"]], file = glue::glue("data/count/{i}.exp.count.rds"))
  saveRDS(SummarizedExperiment::assays(expdat)[["tpm_unstrand"]], file = glue::glue("data/tpm/{i}.exp.tpm.rds"))
  message("done")
  gc()
}

mtran-code · 2026-01-14T19:09:19Z

Thanks, this PR solved an issue I had running the following script (CPTAC-3):

library(TCGAbiolinks)

project_id <- "CPTAC-3"

data_directory <- "data"
results_directory <- "results"
files_per_chunk <- 20

query <- GDCquery(
  project = project_id,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts",
  access = "open",
  data.format = "TSV",
  experimental.strategy = "RNA-Seq"
)

GDCdownload(
  query = query,
  method = "api",
  directory = data_directory,
  files.per.chunk = files_per_chunk
)

se <- GDCprepare(
  query = query,
  save = FALSE,
  directory = data_directory,
  summarizedExperiment = TRUE,
  remove.files.prepared = FALSE
)

saveRDS(
  object = se,
  file = file.path(results_directory, paste0(project_id, ".rds")),
  compress = TRUE
)

@tiagochst
Issue was with GDCprepare erroring due to duplicate row names when adding clinical data.

ShixiangWang and others added 5 commits April 11, 2022 16:27

use data.table to speed up

c0b084d

convert dt to tibble, make consistent with previous version

22ef748

Update prepare.R

465a171

Update prepare.R

8a19a37

Merge branch 'master' of https://github.com/ShixiangWang/TCGAbiolinks

eb072cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicates issue when querying expression data#631

Fix duplicates issue when querying expression data#631
ShixiangWang wants to merge 5 commits intoBioinformaticsFMRP:masterfrom
ShixiangWang:master

ShixiangWang commented Jun 19, 2024

Uh oh!

mtran-code commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShixiangWang commented Jun 19, 2024

Uh oh!

mtran-code commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants