Skip to content

Fix duplicates issue when querying expression data#631

Open
ShixiangWang wants to merge 5 commits intoBioinformaticsFMRP:masterfrom
ShixiangWang:master
Open

Fix duplicates issue when querying expression data#631
ShixiangWang wants to merge 5 commits intoBioinformaticsFMRP:masterfrom
ShixiangWang:master

Conversation

@ShixiangWang
Copy link
Copy Markdown

I was trying to obtain STAR gene expression data from multiple datasets including TCGA, TARGET, and CPTAC. Some errors happened when preparing some projects, and they are mainly related to duplicated records, especially in the CPTAC-3. I tested the code and made a workaround. This PR also amended some data checks to make sure the data preparation process went smoothly.

o Preparing output
-------------------
Downloading data for project CPTAC-3
Of the 2340 files for download 2340 already exist.
All samples have been already downloaded
Removing duplicated cases (with older updated time)
 => 41 records removed

Code to query the data.

library(TCGAbiolinks)
#devtools::load_all("/Volumes/EPan/Repos/TCGAbiolinks/")
#install.packages("/Volumes/EPan/Repos/TCGAbiolinks/", repos = NULL, type = "source")

prjs = TCGAbiolinks::getGDCprojects()
tcga_prjs = prjs$project_id[startsWith(prjs$project_id, "TCGA")]
targ_prjs = prjs$project_id[startsWith(prjs$project_id, "TARGET")]
cpta_prjs = prjs$project_id[startsWith(prjs$project_id, "CPTAC")]

# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

dir.create("data")
dir.create("data/count")
dir.create("data/tpm")

# GDC data download and prepare ------------------------------------------------

for (i in c(tcga_prjs, targ_prjs, cpta_prjs)) {
  message("Handling ", i)
  # if (file.exists(glue::glue("data/tpm/{i}.exp.tpm.rds"))) {
  #   message("handled already, skipping...")
  #   next()
  # }
  query.exp.hg38 <- GDCquery(
    project = i,
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts"
  )

  GDCdownload(query.exp.hg38)
  #debug(GDCprepare)
  #debug(readTranscriptomeProfiling)
  #debug(colDataPrepare)
  expdat <- GDCprepare(
    query = query.exp.hg38
  )
  gc()

  message("saving...")
  saveRDS(SummarizedExperiment::assays(expdat)[["unstranded"]], file = glue::glue("data/count/{i}.exp.count.rds"))
  saveRDS(SummarizedExperiment::assays(expdat)[["tpm_unstrand"]], file = glue::glue("data/tpm/{i}.exp.tpm.rds"))
  message("done")
  gc()
}

@mtran-code
Copy link
Copy Markdown

Thanks, this PR solved an issue I had running the following script (CPTAC-3):

library(TCGAbiolinks)

project_id <- "CPTAC-3"

data_directory <- "data"
results_directory <- "results"
files_per_chunk <- 20

query <- GDCquery(
  project = project_id,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts",
  access = "open",
  data.format = "TSV",
  experimental.strategy = "RNA-Seq"
)

GDCdownload(
  query = query,
  method = "api",
  directory = data_directory,
  files.per.chunk = files_per_chunk
)

se <- GDCprepare(
  query = query,
  save = FALSE,
  directory = data_directory,
  summarizedExperiment = TRUE,
  remove.files.prepared = FALSE
)

saveRDS(
  object = se,
  file = file.path(results_directory, paste0(project_id, ".rds")),
  compress = TRUE
)

@tiagochst
Issue was with GDCprepare erroring due to duplicate row names when adding clinical data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants