Using the GenomicDataCommons package

Last updated: 2023-11-07

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 4434f01

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 4434f01. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.2/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/sleuth.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  code/multiz100way/
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
    Untracked:  data/netmhciipan.out.gz
    Untracked:  export/davetang039sblog.WordPress.2023-06-30.xml
    Untracked:  export/output/
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/graph.Rmd
    Modified:   analysis/gsva.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gdc.Rmd) and HTML (docs/gdc.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	4434f01	Dave Tang	2023-11-07	Filter for open access
html	fb01562	Dave Tang	2023-11-06	Build site.
Rmd	c759821	Dave Tang	2023-11-06	Additional clinical data
html	59cfc19	Dave Tang	2023-11-06	Build site.
Rmd	a9ee937	Dave Tang	2023-11-06	Additional cancers
html	87ee57e	Dave Tang	2023-11-06	Build site.
Rmd	f22f94c	Dave Tang	2023-11-06	Link to cases
html	131a349	Dave Tang	2023-11-01	Build site.
Rmd	705fefa	Dave Tang	2023-11-01	Treatments
html	2f8ef49	Dave Tang	2023-11-01	Build site.
Rmd	75030f1	Dave Tang	2023-11-01	Treatment type
html	8fca622	Dave Tang	2023-11-01	Build site.
Rmd	3fc037e	Dave Tang	2023-11-01	Using the GenomicDataCommons package

Introduction

About the GDC:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardisation of genomic and clinical data from cancer research programs. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonised using a common set of bioinformatics pipelines, so that the data can be directly compared. As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonises these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The GenomicDataCommons Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the GDC.

See The GDC API page.

Installation

Install the GenomicDataCommons package using BiocManager.

if (! "GenomicDataCommons" %in% installed.packages()[, 1]){
  BiocManager::install("GenomicDataCommons")
}
library(GenomicDataCommons)
packageVersion("GenomicDataCommons")

[1] '1.26.0'

Getting started

Check status to see if we can query the GDC.

GenomicDataCommons::status()

$commit
[1] "023da73eee3c17608db1a9903c82852428327b88"

$data_release
[1] "Data Release 38.0 - August 31, 2023"

$status
[1] "OK"

$tag
[1] "5.0.6"

$version
[1] 1

stopifnot(GenomicDataCommons::status()$status=="OK")

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds open gene expression files quantified as raw counts using STAR from TCGA ovarian cancer patients.

ge_manifest <- files() %>%
  filter(cases.project.project_id == 'TCGA-OV') %>% 
  filter(type == 'gene_expression' ) %>%
  filter(access == 'open') %>%
  filter(analysis.workflow_type == 'STAR - Counts')  %>%
  manifest()

DT::datatable(ge_manifest)

The gdcdata function is used to download GDC files.

fnames <- lapply(ge_manifest$id[1:3], gdcdata)
fnames

[[1]]
                                                                                                          96aca0af-a776-460d-95ff-87e364e4ac99 
"~/.cache/GenomicDataCommons/96aca0af-a776-460d-95ff-87e364e4ac99/21ff9928-00f0-4b96-8d70-35e9bfad5d40.rna_seq.augmented_star_gene_counts.tsv" 

[[2]]
                                                                                                          b668c86b-fa56-4d39-9529-5b47081a3faa 
"~/.cache/GenomicDataCommons/b668c86b-fa56-4d39-9529-5b47081a3faa/41bdbd88-b4b2-4884-8a44-b34656ae4156.rna_seq.augmented_star_gene_counts.tsv" 

[[3]]
                                                                                                          60678f17-e3d7-40cd-99ff-73706497968a 
"~/.cache/GenomicDataCommons/60678f17-e3d7-40cd-99ff-73706497968a/03c8e4fe-1e07-4ea3-a154-c17c2e8af508.rna_seq.augmented_star_gene_counts.tsv"

Files are downloaded and stored in the directory specified by gdc_cache().

gdc_cache()

[1] "~/.cache/GenomicDataCommons"

Tally the total number of available STAR gene counts that are open for download.

open_star_manifest <- files() %>%
    filter(analysis.workflow_type == 'STAR - Counts') %>%
    filter(access == 'open') %>%
    manifest()

dim(open_star_manifest)

[1] 23111    16

Metadata queries

Queries in the GenomicDataCommons package follow the four metadata endpoints available at the GDC; there are four convenience functions that each create GDCQuery objects:

projects()
cases()
files()
annotations()

Four endpoints: projects, cases, files, and annotations that have various associated fields. These are the default fields.

endpoints <- c("projects", "cases", "files", "annotations")
sapply(endpoints, default_fields)

$projects
 [1] "dbgap_accession_number" "disease_type"           "intended_release_date" 
 [4] "name"                   "primary_site"           "project_autocomplete"  
 [7] "project_id"             "releasable"             "released"              
[10] "state"                 

$cases
 [1] "aliquot_ids"              "analyte_ids"             
 [3] "case_autocomplete"        "case_id"                 
 [5] "consent_type"             "created_datetime"        
 [7] "days_to_consent"          "days_to_lost_to_followup"
 [9] "diagnosis_ids"            "disease_type"            
[11] "index_date"               "lost_to_followup"        
[13] "portion_ids"              "primary_site"            
[15] "sample_ids"               "slide_ids"               
[17] "state"                    "submitter_aliquot_ids"   
[19] "submitter_analyte_ids"    "submitter_diagnosis_ids" 
[21] "submitter_id"             "submitter_portion_ids"   
[23] "submitter_sample_ids"     "submitter_slide_ids"     
[25] "updated_datetime"        

$files
 [1] "access"                         "acl"                           
 [3] "average_base_quality"           "average_insert_size"           
 [5] "average_read_length"            "channel"                       
 [7] "chip_id"                        "chip_position"                 
 [9] "contamination"                  "contamination_error"           
[11] "created_datetime"               "data_category"                 
[13] "data_format"                    "data_type"                     
[15] "error_type"                     "experimental_strategy"         
[17] "file_autocomplete"              "file_id"                       
[19] "file_name"                      "file_size"                     
[21] "imaging_date"                   "magnification"                 
[23] "md5sum"                         "mean_coverage"                 
[25] "msi_score"                      "msi_status"                    
[27] "pairs_on_diff_chr"              "plate_name"                    
[29] "plate_well"                     "platform"                      
[31] "proc_internal"                  "proportion_base_mismatch"      
[33] "proportion_coverage_10x"        "proportion_coverage_10X"       
[35] "proportion_coverage_30x"        "proportion_coverage_30X"       
[37] "proportion_reads_duplicated"    "proportion_reads_mapped"       
[39] "proportion_targets_no_coverage" "read_pair_number"              
[41] "revision"                       "stain_type"                    
[43] "state"                          "state_comment"                 
[45] "submitter_id"                   "tags"                          
[47] "total_reads"                    "tumor_ploidy"                  
[49] "tumor_purity"                   "type"                          
[51] "updated_datetime"               "wgs_coverage"                  

$annotations
 [1] "annotation_autocomplete" "annotation_id"          
 [3] "case_id"                 "case_submitter_id"      
 [5] "category"                "classification"         
 [7] "created_datetime"        "entity_id"              
 [9] "entity_submitter_id"     "entity_type"            
[11] "legacy_created_datetime" "legacy_updated_datetime"
[13] "notes"                   "state"                  
[15] "status"                  "submitter_id"           
[17] "updated_datetime"

Available fields for each endpoint.

all_fields <- sapply(endpoints, available_fields)
names(all_fields) <- endpoints

sapply(all_fields, length)

   projects       cases       files annotations 
         22        1001        1022          30

These fields can be used for filtering purposes.

head(all_fields$files)

[1] "access"                      "acl"                        
[3] "analysis.analysis_id"        "analysis.analysis_type"     
[5] "analysis.created_datetime"   "analysis.input_files.access"

Use the facet function to aggregate on values used for a particular field.

files() %>% facet("access") %>% aggregations()

$access
  doc_count        key
1    678416 controlled
2    325331       open

Use grep to search for fields of interest, for example “project”.

grep("project", all_fields$files, ignore.case = TRUE, value = TRUE)

 [1] "cases.project.dbgap_accession_number"        
 [2] "cases.project.disease_type"                  
 [3] "cases.project.intended_release_date"         
 [4] "cases.project.name"                          
 [5] "cases.project.primary_site"                  
 [6] "cases.project.program.dbgap_accession_number"
 [7] "cases.project.program.name"                  
 [8] "cases.project.program.program_id"            
 [9] "cases.project.project_id"                    
[10] "cases.project.releasable"                    
[11] "cases.project.released"                      
[12] "cases.project.state"                         
[13] "cases.tissue_source_site.project"

Look for “days_to_collection”.

grep("collection", all_fields$cases, ignore.case = TRUE, value = TRUE)

[1] "samples.days_to_collection"     "samples.tissue_collection_type"

Look for “workflow_type”.

grep("workflow_type", all_fields$cases, ignore.case = TRUE, value = TRUE)

[1] "files.analysis.metadata.read_groups.read_group_qcs.workflow_type"
[2] "files.analysis.workflow_type"                                    
[3] "files.downstream_analyses.workflow_type"

Look for “treatment”.

grep("treatment", all_fields$cases, ignore.case = TRUE, value = TRUE)

 [1] "diagnoses.prior_treatment"                         
 [2] "diagnoses.treatments.chemo_concurrent_to_radiation"
 [3] "diagnoses.treatments.created_datetime"             
 [4] "diagnoses.treatments.days_to_treatment_end"        
 [5] "diagnoses.treatments.days_to_treatment_start"      
 [6] "diagnoses.treatments.initial_disease_status"       
 [7] "diagnoses.treatments.number_of_cycles"             
 [8] "diagnoses.treatments.reason_treatment_ended"       
 [9] "diagnoses.treatments.regimen_or_line_of_therapy"   
[10] "diagnoses.treatments.route_of_administration"      
[11] "diagnoses.treatments.state"                        
[12] "diagnoses.treatments.submitter_id"                 
[13] "diagnoses.treatments.therapeutic_agents"           
[14] "diagnoses.treatments.treatment_anatomic_site"      
[15] "diagnoses.treatments.treatment_arm"                
[16] "diagnoses.treatments.treatment_dose"               
[17] "diagnoses.treatments.treatment_dose_units"         
[18] "diagnoses.treatments.treatment_effect"             
[19] "diagnoses.treatments.treatment_effect_indicator"   
[20] "diagnoses.treatments.treatment_frequency"          
[21] "diagnoses.treatments.treatment_id"                 
[22] "diagnoses.treatments.treatment_intent_type"        
[23] "diagnoses.treatments.treatment_or_therapy"         
[24] "diagnoses.treatments.treatment_outcome"            
[25] "diagnoses.treatments.treatment_type"               
[26] "diagnoses.treatments.updated_datetime"             
[27] "follow_ups.diabetes_treatment_type"                
[28] "follow_ups.haart_treatment_indicator"              
[29] "follow_ups.immunosuppressive_treatment_type"       
[30] "follow_ups.reflux_treatment_type"                  
[31] "follow_ups.risk_factor_treatment"

Note that each entry above is separated by a period (.); this indicates the hierarchical structure. Summarise the top level fields by using sub.

unique(sub("^(\\w+)\\..*", "\\1", all_fields$cases))

 [1] "aliquot_ids"              "analyte_ids"             
 [3] "annotations"              "case_autocomplete"       
 [5] "case_id"                  "consent_type"            
 [7] "created_datetime"         "days_to_consent"         
 [9] "days_to_lost_to_followup" "demographic"             
[11] "diagnoses"                "diagnosis_ids"           
[13] "disease_type"             "exposures"               
[15] "family_histories"         "files"                   
[17] "follow_ups"               "index_date"              
[19] "lost_to_followup"         "portion_ids"             
[21] "primary_site"             "project"                 
[23] "sample_ids"               "samples"                 
[25] "slide_ids"                "state"                   
[27] "submitter_aliquot_ids"    "submitter_analyte_ids"   
[29] "submitter_diagnosis_ids"  "submitter_id"            
[31] "submitter_portion_ids"    "submitter_sample_ids"    
[33] "submitter_slide_ids"      "summary"                 
[35] "tissue_source_site"       "updated_datetime"

All aggregations are only on one field at a time.

files() %>% facet(c("type", "data_format")) %>% aggregations()

$data_format
   doc_count               key
1     188265               tsv
2     184432               vcf
3     163225               maf
4     149745               bam
5     123119               txt
6      52733             bedpe
7      32898               svs
8      32708              idat
9      24236               cel
10     24002           bcr xml
11     11324               pdf
12     10755       bcr ssf xml
13      2884 bcr auxiliary xml
14      1051       bcr omf xml
15       805          cdc json
16       602        bcr biotab
17       568       bcr pps xml
18       215         jpeg 2000
19        74               mex
20        70              xlsx
21        36              hdf5

$type
   doc_count                           key
1     197177    annotated_somatic_mutation
2     149745                 aligned_reads
3      98319          structural_variation
4      94773       simple_somatic_mutation
5      71861           copy_number_segment
6      69806          copy_number_estimate
7      46580               gene_expression
8      34661   aggregated_somatic_mutation
9      34408              mirna_expression
10     33113                   slide_image
11     32708      masked_methylation_array
12     26978        biospecimen_supplement
13     24236    submitted_genotyping_array
14     23135     simple_germline_variation
15     16657       masked_somatic_mutation
16     16354        methylation_beta_value
17     13898           clinical_supplement
18     11324              pathology_report
19      7906            protein_expression
20       108 secondary_expression_analysis

Aggregate on a sub-field.

cases() %>%
  filter(files.access == 'open') %>%
  facet("diagnoses.treatments.treatment_type") %>%
  aggregations()

$diagnoses.treatments.treatment_type
   doc_count                                          key
1      12170                       radiation therapy, nos
2      11994                  pharmaceutical therapy, nos
3        465                                 chemotherapy
4        520        stem cell transplantation, autologous
5        296                                 surgery, nos
6        171                   targeted molecular therapy
7        168           immunotherapy (including vaccines)
8         96                     radiation, external beam
9         53                      brachytherapy, low dose
10        38                              hormone therapy
11        33                     brachytherapy, high dose
12        14        stem cell transplantation, allogeneic
13         9                   radiation, 2d conventional
14         7                      radiation, 3d conformal
15         6  radiation, intensity-modulated radiotherapy
16         5      radiation, stereotactic/gamma knife/srs
17         3                    stereotactic radiosurgery
18         1                     ablation, radiofrequency
19         1                      external beam radiation
20         1 peptide receptor radionuclide therapy (prrt)
21         1                       radiation, proton beam
22     30737                                     _missing

Facet on open analysis.workflow_type.

files() %>%
  filter(access == 'open') %>%
  facet("analysis.workflow_type") %>%
  aggregations()

$analysis.workflow_type
   doc_count                                                  key
1      49062                   SeSAMe Methylation Beta Estimation
2      45258                                              DNAcopy
3      34408                                BCGSC miRNA Profiling
4      23164                                               ASCAT2
5      23111                                        STAR - Counts
6      21264                                               ASCAT3
7      16522 Aliquot Ensemble Somatic Variant Merging and Masking
8      10677                                    ABSOLUTE LiftOver
9       8776                                             AscatNGS
10       108                                Seurat - 10x Chromium
11        38                          CellRanger - 10x Raw Counts
12        36                     CellRanger - 10x Filtered Counts
13     92907                                             _missing

Facet on open experimental_strategy.

files() %>%
  filter(access == 'open') %>%
  facet("experimental_strategy") %>%
  aggregations()

$experimental_strategy
   doc_count                         key
1     100363            Genotyping Array
2      49062           Methylation Array
3      34408                   miRNA-Seq
4      23111                     RNA-Seq
5      21348                Tissue Slide
6      16075                         WXS
7      11765            Diagnostic Slide
8       8776                         WGS
9       7906 Reverse Phase Protein Array
10       447         Targeted Sequencing
11       182                   scRNA-Seq
12     51888                    _missing

Files

All BAM files are under controlled access.

files() %>%
  filter(data_format == 'bam') %>%
  facet("access") %>%
  aggregations()

$access
  doc_count        key
1    149745 controlled

All VCF files are also under controlled access.

files() %>%
  filter(data_format == 'vcf') %>%
  facet("access") %>%
  aggregations()

$access
  doc_count        key
1    184432 controlled

Mutation Annotation Format (MAF) are openly available. These files are tab-delimited text files with aggregated mutation information from VCF files.

files() %>%
  filter(access == 'open') %>%
  filter(experimental_strategy == 'WXS') %>%
  facet("data_format") %>%
  aggregations()

$data_format
  doc_count key
1     16075 maf

Project

Project fields.

all_fields$projects

 [1] "dbgap_accession_number"                               
 [2] "disease_type"                                         
 [3] "intended_release_date"                                
 [4] "name"                                                 
 [5] "primary_site"                                         
 [6] "program.dbgap_accession_number"                       
 [7] "program.name"                                         
 [8] "program.program_id"                                   
 [9] "project_autocomplete"                                 
[10] "project_id"                                           
[11] "releasable"                                           
[12] "released"                                             
[13] "state"                                                
[14] "summary.case_count"                                   
[15] "summary.data_categories.case_count"                   
[16] "summary.data_categories.data_category"                
[17] "summary.data_categories.file_count"                   
[18] "summary.experimental_strategies.case_count"           
[19] "summary.experimental_strategies.experimental_strategy"
[20] "summary.experimental_strategies.file_count"           
[21] "summary.file_count"                                   
[22] "summary.file_size"

Use projects to fetch project information and ids to list all available projects.

projects() %>% results_all() -> project_info

sort(ids(project_info))

 [1] "APOLLO-LUAD"               "BEATAML1.0-COHORT"        
 [3] "BEATAML1.0-CRENOLANIB"     "CDDP_EAGLE-1"             
 [5] "CGCI-BLGSP"                "CGCI-HTMCP-CC"            
 [7] "CGCI-HTMCP-DLBCL"          "CGCI-HTMCP-LC"            
 [9] "CMI-ASC"                   "CMI-MBC"                  
[11] "CMI-MPC"                   "CPTAC-2"                  
[13] "CPTAC-3"                   "CTSP-DLBCL1"              
[15] "EXCEPTIONAL_RESPONDERS-ER" "FM-AD"                    
[17] "GENIE-DFCI"                "GENIE-GRCC"               
[19] "GENIE-JHU"                 "GENIE-MDA"                
[21] "GENIE-MSK"                 "GENIE-NKI"                
[23] "GENIE-UHN"                 "GENIE-VICC"               
[25] "HCMI-CMDC"                 "MATCH-B"                  
[27] "MATCH-N"                   "MATCH-Q"                  
[29] "MATCH-Y"                   "MATCH-Z1D"                
[31] "MMRF-COMMPASS"             "MP2PRT-ALL"               
[33] "MP2PRT-WT"                 "NCICCR-DLBCL"             
[35] "OHSU-CNL"                  "ORGANOID-PANCREATIC"      
[37] "REBC-THYR"                 "TARGET-ALL-P1"            
[39] "TARGET-ALL-P2"             "TARGET-ALL-P3"            
[41] "TARGET-AML"                "TARGET-CCSK"              
[43] "TARGET-NBL"                "TARGET-OS"                
[45] "TARGET-RT"                 "TARGET-WT"                
[47] "TCGA-ACC"                  "TCGA-BLCA"                
[49] "TCGA-BRCA"                 "TCGA-CESC"                
[51] "TCGA-CHOL"                 "TCGA-COAD"                
[53] "TCGA-DLBC"                 "TCGA-ESCA"                
[55] "TCGA-GBM"                  "TCGA-HNSC"                
[57] "TCGA-KICH"                 "TCGA-KIRC"                
[59] "TCGA-KIRP"                 "TCGA-LAML"                
[61] "TCGA-LGG"                  "TCGA-LIHC"                
[63] "TCGA-LUAD"                 "TCGA-LUSC"                
[65] "TCGA-MESO"                 "TCGA-OV"                  
[67] "TCGA-PAAD"                 "TCGA-PCPG"                
[69] "TCGA-PRAD"                 "TCGA-READ"                
[71] "TCGA-SARC"                 "TCGA-SKCM"                
[73] "TCGA-STAD"                 "TCGA-TGCT"                
[75] "TCGA-THCA"                 "TCGA-THYM"                
[77] "TCGA-UCEC"                 "TCGA-UCS"                 
[79] "TCGA-UVM"                  "TRIO-CRU"                 
[81] "VAREPOP-APOLLO"            "WCDT-MCRPC"

The results() method will fetch actual results.

projects() %>% results(size = 10) -> my_proj

str(my_proj, max.level = 1)

List of 9
 $ id                    : chr [1:10] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ primary_site          :List of 10
 $ dbgap_accession_number: chr [1:10] "phs000528" "phs000465" NA NA ...
 $ project_id            : chr [1:10] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ disease_type          :List of 10
 $ name                  : chr [1:10] "HIV+ Tumor Molecular Characterization Project - Cervical Cancer" "Acute Myeloid Leukemia" "AACR Project GENIE - Contributed by Johns Hopkins Sidney Kimmel Comprehensive Cancer Center" "AACR Project GENIE - Contributed by Memorial Sloan Kettering Cancer Center" ...
 $ releasable            : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ state                 : chr [1:10] "open" "open" "open" "open" ...
 $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
 - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

my_proj$project_id

 [1] "CGCI-HTMCP-CC" "TARGET-AML"    "GENIE-JHU"     "GENIE-MSK"    
 [5] "GENIE-VICC"    "GENIE-MDA"     "TCGA-MESO"     "TARGET-ALL-P3"
 [9] "TCGA-UVM"      "TCGA-KICH"

Clinical data

The gdc_clinical function:

The NCI GDC has a complex data model that allows various studies to supply numerous clinical and demographic data elements. However, across all projects that enter the GDC, there are similarities. This function returns four data.frames associated with case_ids from the GDC.

Accessing clinical data.

case_ids <- cases() %>% results(size=10) %>% ids()
clindat <- gdc_clinical(case_ids)
names(clindat)

[1] "demographic" "diagnoses"   "exposures"   "main"

Demographic.

idx <- apply(clindat$demographic, 2, function(x) all(is.na(x)))
DT::datatable(clindat$demographic[, !idx])

Diagnoses data.

idx <- apply(clindat$diagnoses, 2, function(x) all(is.na(x)))
DT::datatable(clindat$diagnoses[, !idx])

Exposures data.

idx <- apply(clindat$exposures, 2, function(x) all(is.na(x)))
DT::datatable(clindat$exposures[, !idx])

Main data.

idx <- apply(clindat$main, 2, function(x) all(is.na(x)))
DT::datatable(clindat$main[, !idx])

Cases

Find all files related to a specific case, or sample donor.

case1 <- cases() %>% results(size=1)
str(case1, max.level = 1)

List of 25
 $ id                      : chr "935ca1d3-2445-4f59-95a6-19f3311c1900"
 $ lost_to_followup        : chr "No"
 $ slide_ids               :List of 1
 $ submitter_slide_ids     :List of 1
 $ days_to_lost_to_followup: logi NA
 $ disease_type            : chr "Squamous Cell Neoplasms"
 $ analyte_ids             :List of 1
 $ submitter_id            : chr "HTMCP-03-06-02345"
 $ submitter_analyte_ids   :List of 1
 $ days_to_consent         : logi NA
 $ aliquot_ids             :List of 1
 $ submitter_aliquot_ids   :List of 1
 $ created_datetime        : chr "2019-11-21T18:06:42.617487-06:00"
 $ diagnosis_ids           :List of 1
 $ sample_ids              :List of 1
 $ consent_type            : logi NA
 $ submitter_sample_ids    :List of 1
 $ primary_site            : chr "Cervix uteri"
 $ submitter_diagnosis_ids :List of 1
 $ updated_datetime        : chr "2020-04-28T11:49:05.699379-05:00"
 $ case_id                 : chr "935ca1d3-2445-4f59-95a6-19f3311c1900"
 $ index_date              : chr "Diagnosis"
 $ state                   : chr "released"
 $ portion_ids             :List of 1
 $ submitter_portion_ids   :List of 1
 - attr(*, "row.names")= int 1
 - attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"

Sample IDs.

case1$sample_ids

$`935ca1d3-2445-4f59-95a6-19f3311c1900`
[1] "f7706af8-c4e6-4e94-95f1-b6b4901dfe28"
[2] "bb3365f7-7bf9-46c6-ac60-4b7e77268ed8"
[3] "a35a4c87-86f9-4400-b43a-2b0999c69c19"

All case fields.

case_fields <- available_fields("cases")

Grep case_fields.

grep("sample_ids", case_fields, value = TRUE)

[1] "sample_ids"           "submitter_sample_ids"

grep("sample_type", case_fields, value = TRUE)

[1] "samples.sample_type"    "samples.sample_type_id"

grep("workflow_type", case_fields, value = TRUE)

[1] "files.analysis.metadata.read_groups.read_group_qcs.workflow_type"
[2] "files.analysis.workflow_type"                                    
[3] "files.downstream_analyses.workflow_type"

Get case data.

n_star_cases <- cases() %>%
  filter(files.analysis.workflow_type == 'STAR - Counts') %>%
  filter(files.access == 'open') %>%
  count()

star_cases <- cases() %>%
  filter(files.analysis.workflow_type == 'STAR - Counts') %>%
  filter(files.access == 'open') %>%
  results(size = n_star_cases)

sapply(star_cases, length)

                      id         lost_to_followup                slide_ids 
                   19101                    19101                    19101 
     submitter_slide_ids days_to_lost_to_followup             disease_type 
                   19101                    19101                    19101 
             analyte_ids             submitter_id    submitter_analyte_ids 
                   19101                    19101                    19101 
         days_to_consent              aliquot_ids    submitter_aliquot_ids 
                   19101                    19101                    19101 
        created_datetime            diagnosis_ids               sample_ids 
                   19101                    19101                    19101 
            consent_type     submitter_sample_ids             primary_site 
                   19101                    19101                    19101 
 submitter_diagnosis_ids         updated_datetime                  case_id 
                   19101                    19101                    19101 
              index_date                    state              portion_ids 
                   19101                    19101                    19101 
   submitter_portion_ids 
                   19101

case_id is the same as id.

table(star_cases$case_id == star_cases$id)


 TRUE 
19101

One case ID to multiple sample IDs.

head(star_cases$sample_ids, 3)

$`9453db51-fff8-4a78-a29c-bb9151e9bd2a`
[1] "6662a85c-37b7-48b1-a8c6-f00171bb8226"
[2] "9bab246d-4a0d-4f28-ba1f-56b19a6f93bb"
[3] "6b8ea6bb-d10b-474a-9b4b-f406285dfb2f"

$`9485e946-f569-46fb-b77e-e5af68f7961a`
[1] "e3f781a2-f087-4abb-8f36-af799e837557"
[2] "cc8c2432-4107-4b5a-9452-3c536dac8baf"
[3] "330292a0-80dd-4fc4-a64c-4fce119dcbb6"

$`981300da-9136-402a-88df-2c76b1e3ad87`
[1] "42c67b29-94a1-4520-9122-b2daa02a03ad"
[2] "9d351761-59cb-40f7-aee2-ce2c6365acc2"
[3] "9276070c-cab5-4ba3-978d-2d18976a8758"

Sample IDs to case IDs.

sample_id_len <- sapply(star_cases$sample_ids, length)
my_ids <- rep(names(sample_id_len), sample_id_len)
sample_id_lookup <- data.frame(
  sample_ids = unlist(star_cases$sample_ids),
  case_id = my_ids,
  row.names = NULL
)

head(sample_id_lookup)

                            sample_ids                              case_id
1 6662a85c-37b7-48b1-a8c6-f00171bb8226 9453db51-fff8-4a78-a29c-bb9151e9bd2a
2 9bab246d-4a0d-4f28-ba1f-56b19a6f93bb 9453db51-fff8-4a78-a29c-bb9151e9bd2a
3 6b8ea6bb-d10b-474a-9b4b-f406285dfb2f 9453db51-fff8-4a78-a29c-bb9151e9bd2a
4 e3f781a2-f087-4abb-8f36-af799e837557 9485e946-f569-46fb-b77e-e5af68f7961a
5 cc8c2432-4107-4b5a-9452-3c536dac8baf 9485e946-f569-46fb-b77e-e5af68f7961a
6 330292a0-80dd-4fc4-a64c-4fce119dcbb6 9485e946-f569-46fb-b77e-e5af68f7961a

TCGA

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.

TCGA nomenclature

Data from TCGA (gene expression, copy number variation, clinical information, etc.) are available via the Genomic Data Commons (GDC). Primary sequence data (stored in BAM files) are under controlled accession and data access should be requested via dbGaP and should be done by the PI.

Study Abbreviation	Study Name
LAML	Acute Myeloid Leukemia
ACC	Adrenocortical carcinoma
BLCA	Bladder Urothelial Carcinoma
LGG	Brain Lower Grade Glioma
BRCA	Breast invasive carcinoma
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma
CHOL	Cholangiocarcinoma
LCML	Chronic Myelogenous Leukemia
COAD	Colon adenocarcinoma
CNTL	Controls
ESCA	Esophageal carcinoma
FPPP	FFPE Pilot Phase II
GBM	Glioblastoma multiforme
HNSC	Head and Neck squamous cell carcinoma
KICH	Kidney Chromophobe
KIRC	Kidney renal clear cell carcinoma
KIRP	Kidney renal papillary cell carcinoma
LIHC	Liver hepatocellular carcinoma
LUAD	Lung adenocarcinoma
LUSC	Lung squamous cell carcinoma
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
MESO	Mesothelioma
MISC	Miscellaneous
OV	Ovarian serous cystadenocarcinoma
PAAD	Pancreatic adenocarcinoma
PCPG	Pheochromocytoma and Paraganglioma
PRAD	Prostate adenocarcinoma
READ	Rectum adenocarcinoma
SARC	Sarcoma
SKCM	Skin Cutaneous Melanoma
STAD	Stomach adenocarcinoma
TGCT	Testicular Germ Cell Tumors
THYM	Thymoma
THCA	Thyroid carcinoma
UCS	Uterine Carcinosarcoma
UCEC	Uterine Corpus Endometrial Carcinoma
UVM	Uveal Melanoma

Table source.

From https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html

A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. Refer to the following figure for an illustration of how metadata identifiers comprise a barcode. An aliquot barcode contains the highest number of identifiers. For example:

Aliquot barcode: TCGA-G4-6317-02A-11D-2064-05 Participant: TCGA-G4-6317 Sample: TCGA-G4-6317-02

Fetch projects.

projects() %>% results(size=100) -> my_projects
str(my_projects, max.level = 1)

List of 9
 $ id                    : chr [1:82] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ primary_site          :List of 82
 $ dbgap_accession_number: chr [1:82] "phs000528" "phs000465" NA NA ...
 $ project_id            : chr [1:82] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ disease_type          :List of 82
 $ name                  : chr [1:82] "HIV+ Tumor Molecular Characterization Project - Cervical Cancer" "Acute Myeloid Leukemia" "AACR Project GENIE - Contributed by Johns Hopkins Sidney Kimmel Comprehensive Cancer Center" "AACR Project GENIE - Contributed by Memorial Sloan Kettering Cancer Center" ...
 $ releasable            : logi [1:82] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ state                 : chr [1:82] "open" "open" "open" "open" ...
 $ released              : logi [1:82] TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, "row.names")= int [1:82] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

Project IDs.

my_projects$id

 [1] "CGCI-HTMCP-CC"             "TARGET-AML"               
 [3] "GENIE-JHU"                 "GENIE-MSK"                
 [5] "GENIE-VICC"                "GENIE-MDA"                
 [7] "TCGA-MESO"                 "TARGET-ALL-P3"            
 [9] "TCGA-UVM"                  "TCGA-KICH"                
[11] "TARGET-WT"                 "TARGET-OS"                
[13] "TCGA-DLBC"                 "GENIE-UHN"                
[15] "APOLLO-LUAD"               "CDDP_EAGLE-1"             
[17] "EXCEPTIONAL_RESPONDERS-ER" "MP2PRT-WT"                
[19] "CGCI-HTMCP-DLBCL"          "CMI-MPC"                  
[21] "WCDT-MCRPC"                "TCGA-CHOL"                
[23] "TCGA-UCS"                  "TCGA-PCPG"                
[25] "CPTAC-2"                   "TCGA-CESC"                
[27] "TCGA-LIHC"                 "TCGA-ACC"                 
[29] "CMI-MBC"                   "TCGA-BRCA"                
[31] "CPTAC-3"                   "TCGA-COAD"                
[33] "TCGA-GBM"                  "TCGA-TGCT"                
[35] "NCICCR-DLBCL"              "TCGA-LGG"                 
[37] "FM-AD"                     "GENIE-GRCC"               
[39] "CTSP-DLBCL1"               "TARGET-CCSK"              
[41] "GENIE-NKI"                 "TARGET-ALL-P1"            
[43] "MATCH-N"                   "TRIO-CRU"                 
[45] "CMI-ASC"                   "TARGET-RT"                
[47] "ORGANOID-PANCREATIC"       "MATCH-Z1D"                
[49] "MATCH-B"                   "VAREPOP-APOLLO"           
[51] "MATCH-Q"                   "BEATAML1.0-CRENOLANIB"    
[53] "MATCH-Y"                   "OHSU-CNL"                 
[55] "CGCI-HTMCP-LC"             "TARGET-NBL"               
[57] "TCGA-SARC"                 "TCGA-PAAD"                
[59] "TCGA-LUAD"                 "TCGA-PRAD"                
[61] "MP2PRT-ALL"                "TCGA-LUSC"                
[63] "TCGA-LAML"                 "TCGA-SKCM"                
[65] "HCMI-CMDC"                 "BEATAML1.0-COHORT"        
[67] "TCGA-BLCA"                 "TCGA-READ"                
[69] "TCGA-UCEC"                 "TCGA-THCA"                
[71] "TCGA-OV"                   "TCGA-KIRC"                
[73] "MMRF-COMMPASS"             "GENIE-DFCI"               
[75] "TCGA-HNSC"                 "TCGA-ESCA"                
[77] "CGCI-BLGSP"                "TARGET-ALL-P2"            
[79] "TCGA-STAD"                 "REBC-THYR"                
[81] "TCGA-KIRP"                 "TCGA-THYM"

Ovarian serous cystadenocarcinoma

Available (i.e. open) STAR metadata.

get_star_metadata <- function(proj){
  files() %>%
    filter(cases.project.project_id == proj) %>% 
    filter(analysis.workflow_type == 'STAR - Counts') %>%
    filter(access == 'open') %>%
    GenomicDataCommons::select(
      c(
        default_fields('files'),
        "cases.case_id",
        "cases.samples.sample_type",
        "cases.samples.sample_id"
      )
    ) %>%
    results_all()
}

ov_star <- get_star_metadata("TCGA-OV")

str(ov_star, max.level = 1)

List of 17
 $ id                   : chr [1:429] "96aca0af-a776-460d-95ff-87e364e4ac99" "b668c86b-fa56-4d39-9529-5b47081a3faa" "60678f17-e3d7-40cd-99ff-73706497968a" "38fb3b15-f838-4d5b-a830-6051067d8e2e" ...
 $ data_format          : chr [1:429] "TSV" "TSV" "TSV" "TSV" ...
 $ cases                :List of 429
 $ access               : chr [1:429] "open" "open" "open" "open" ...
 $ file_name            : chr [1:429] "21ff9928-00f0-4b96-8d70-35e9bfad5d40.rna_seq.augmented_star_gene_counts.tsv" "41bdbd88-b4b2-4884-8a44-b34656ae4156.rna_seq.augmented_star_gene_counts.tsv" "03c8e4fe-1e07-4ea3-a154-c17c2e8af508.rna_seq.augmented_star_gene_counts.tsv" "7d14ddef-7a1b-4515-9536-3fc4a9b85702.rna_seq.augmented_star_gene_counts.tsv" ...
 $ submitter_id         : chr [1:429] "41b13518-8f35-4369-8ff2-2b694d3e0091" "a542eb04-0978-42b1-b5e6-8473ddf04526" "57c246c4-1c0c-470e-8914-696bb7815c02" "d13d4d38-b242-4308-90d7-43aa485abcb5" ...
 $ data_category        : chr [1:429] "Transcriptome Profiling" "Transcriptome Profiling" "Transcriptome Profiling" "Transcriptome Profiling" ...
 $ acl                  :List of 429
 $ type                 : chr [1:429] "gene_expression" "gene_expression" "gene_expression" "gene_expression" ...
 $ file_size            : int [1:429] 4240026 4259621 4244112 4241087 4251142 4252582 4257244 4272506 4236251 4250530 ...
 $ created_datetime     : chr [1:429] "2021-12-13T20:45:56.142462-06:00" "2021-12-13T20:47:21.099504-06:00" "2021-12-13T20:44:24.979694-06:00" "2021-12-13T20:49:39.972683-06:00" ...
 $ md5sum               : chr [1:429] "c8b0b56114b382ae7855c47092aaf391" "9fe002d9d9512b99ad44edfa0c0bcd37" "625d9b63d9a37c3a80afd29db8ea6641" "f0c4926d57469765026470b21876a8bd" ...
 $ updated_datetime     : chr [1:429] "2022-01-19T14:47:35.686434-06:00" "2022-01-19T14:47:42.525493-06:00" "2022-01-19T14:47:22.611372-06:00" "2022-01-19T14:47:15.461468-06:00" ...
 $ file_id              : chr [1:429] "96aca0af-a776-460d-95ff-87e364e4ac99" "b668c86b-fa56-4d39-9529-5b47081a3faa" "60678f17-e3d7-40cd-99ff-73706497968a" "38fb3b15-f838-4d5b-a830-6051067d8e2e" ...
 $ data_type            : chr [1:429] "Gene Expression Quantification" "Gene Expression Quantification" "Gene Expression Quantification" "Gene Expression Quantification" ...
 $ state                : chr [1:429] "released" "released" "released" "released" ...
 $ experimental_strategy: chr [1:429] "RNA-Seq" "RNA-Seq" "RNA-Seq" "RNA-Seq" ...
 - attr(*, "row.names")= int [1:429] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "class")= chr [1:3] "GDCfilesResults" "GDCResults" "list"

Examine a single case.

str(ov_star$cases$`96aca0af-a776-460d-95ff-87e364e4ac99`)

'data.frame':   1 obs. of  2 variables:
 $ case_id: chr "9446e349-71e6-455a-aa8f-53ec96597146"
 $ samples:List of 1
  ..$ :'data.frame':    1 obs. of  2 variables:
  .. ..$ sample_id  : chr "1d568bd2-d658-40fa-a341-daa4d2a5bb22"
  .. ..$ sample_type: chr "Primary Tumor"

Case IDs are unique.

length(unique(ov_star$id)) == length(ov_star$id)

[1] TRUE

Each case ID contains samples.

ov_star$cases$`96aca0af-a776-460d-95ff-87e364e4ac99`

                               case_id      samples
1 9446e349-71e6-455a-aa8f-53ec96597146 1d568bd2....

Build data frame.

sapply(ov_star$cases, function(x) x$samples) |>
  do.call(rbind.data.frame, args = _) -> ov_star_cases

dim(ov_star_cases)

[1] 429   2

Sample types.

table(ov_star_cases$sample_type)


  Primary Tumor Recurrent Tumor 
            421               8

Get additional case data for OV.

get_case_metadata <- function(proj){
  treatment_fields <- grep("treatment", available_fields("cases"), ignore.case = TRUE, value = TRUE)
  sample_fields <-  grep("samples.sample_", available_fields("cases"), ignore.case = TRUE, value = TRUE)
  
  cases() %>%
    filter(project.project_id == proj) %>% 
    GenomicDataCommons::select(
      c(
        default_fields('cases'),
        sample_fields,
        treatment_fields
      )
    ) %>%
    results_all()
}

ov_cases <- get_case_metadata("TCGA-OV")

str(ov_cases, max.level = 1)

List of 22
 $ id                     : chr [1:608] "cce34351-1700-405b-818f-a598f63a33e8" "cd49126a-ec15-43fa-9e43-3f7460d43f2b" "cd6e5d3d-1c86-40dd-9cb3-b2e2075dec56" "cddbac56-2861-46a5-98a3-df32ab69d5da" ...
 $ slide_ids              :List of 608
 $ submitter_slide_ids    :List of 608
 $ disease_type           : chr [1:608] "Cystic, Mucinous and Serous Neoplasms" "Cystic, Mucinous and Serous Neoplasms" "Cystic, Mucinous and Serous Neoplasms" "Cystic, Mucinous and Serous Neoplasms" ...
 $ analyte_ids            :List of 608
 $ submitter_id           : chr [1:608] "TCGA-31-1955" "TCGA-13-1504" "TCGA-24-1469" "TCGA-04-1353" ...
 $ submitter_analyte_ids  :List of 608
 $ aliquot_ids            :List of 608
 $ submitter_aliquot_ids  :List of 608
 $ diagnoses              :List of 608
 $ created_datetime       : logi [1:608] NA NA NA NA NA NA ...
 $ diagnosis_ids          :List of 608
 $ samples                :List of 608
 $ sample_ids             :List of 608
 $ submitter_sample_ids   :List of 608
 $ primary_site           : chr [1:608] "Ovary" "Ovary" "Ovary" "Ovary" ...
 $ submitter_diagnosis_ids:List of 608
 $ updated_datetime       : chr [1:608] "2019-08-16T15:20:09.988356-05:00" "2019-08-06T14:40:41.923992-05:00" "2019-08-06T14:41:05.270815-05:00" "2019-08-06T14:40:06.221317-05:00" ...
 $ case_id                : chr [1:608] "cce34351-1700-405b-818f-a598f63a33e8" "cd49126a-ec15-43fa-9e43-3f7460d43f2b" "cd6e5d3d-1c86-40dd-9cb3-b2e2075dec56" "cddbac56-2861-46a5-98a3-df32ab69d5da" ...
 $ state                  : chr [1:608] "released" "released" "released" "released" ...
 $ portion_ids            :List of 608
 $ submitter_portion_ids  :List of 608
 - attr(*, "row.names")= int [1:608] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"

Treatment type.

cases() %>%
  filter(project.project_id == 'TCGA-OV') %>% 
  filter(files.access == 'open') %>%
  facet("diagnoses.treatments.treatment_type") %>%
  aggregations()

$diagnoses.treatments.treatment_type
  doc_count                         key
1       587 pharmaceutical therapy, nos
2       587      radiation therapy, nos
3        21                    _missing

Check out the treatments.

str(ov_cases$diagnoses$`cce34351-1700-405b-818f-a598f63a33e8`$treatments)

List of 1
 $ :'data.frame':   2 obs. of  16 variables:
  ..$ treatment_intent_type     : logi [1:2] NA NA
  ..$ updated_datetime          : chr [1:2] "2019-07-31T16:17:41.335989-05:00" "2019-07-31T16:17:41.335989-05:00"
  ..$ treatment_id              : chr [1:2] "93662700-6cf0-567b-af2e-8289a49e319a" "dafb6206-fade-54bb-a976-97758882f343"
  ..$ submitter_id              : chr [1:2] "TCGA-31-1955_treatment" "TCGA-31-1955_treatment_1"
  ..$ treatment_type            : chr [1:2] "Radiation Therapy, NOS" "Pharmaceutical Therapy, NOS"
  ..$ state                     : chr [1:2] "released" "released"
  ..$ therapeutic_agents        : logi [1:2] NA NA
  ..$ treatment_or_therapy      : chr [1:2] "yes" "yes"
  ..$ created_datetime          : chr [1:2] NA "2019-04-28T09:28:03.174985-05:00"
  ..$ days_to_treatment_end     : logi [1:2] NA NA
  ..$ days_to_treatment_start   : logi [1:2] NA NA
  ..$ regimen_or_line_of_therapy: logi [1:2] NA NA
  ..$ treatment_effect          : logi [1:2] NA NA
  ..$ initial_disease_status    : logi [1:2] NA NA
  ..$ treatment_anatomic_site   : logi [1:2] NA NA
  ..$ treatment_outcome         : logi [1:2] NA NA

There is no information on carboplatin or paclitaxel.

Pancreatic adenocarcinoma

Meta data for pancreatic adenocarcinoma.

paad_star <- get_star_metadata("TCGA-PAAD")

sapply(paad_star$cases, function(x) x$samples) |>
  do.call(rbind.data.frame, args = _) -> paad_star_cases

dim(paad_star_cases)

[1] 183   2

Sample types.

table(paad_star_cases$sample_type)


         Metastatic       Primary Tumor Solid Tissue Normal 
                  1                 178                   4

Treatment type.

cases() %>%
  filter(project.project_id == 'TCGA-PAAD') %>% 
  filter(files.access == 'open') %>%
  facet("diagnoses.treatments.treatment_type") %>%
  aggregations()

$diagnoses.treatments.treatment_type
  doc_count                         key
1       185 pharmaceutical therapy, nos
2       185      radiation therapy, nos

Esophageal carcinoma

Meta data for esophageal carcinoma.

esca_star <- get_star_metadata("TCGA-ESCA")

sapply(esca_star$cases, function(x) x$samples) |>
  do.call(rbind.data.frame, args = _) -> esca_star_cases

dim(esca_star_cases)

[1] 198   2

Sample types.

table(esca_star_cases$sample_type)


         Metastatic       Primary Tumor Solid Tissue Normal 
                  1                 184                  13

Treatment type.

cases() %>%
  filter(project.project_id == 'TCGA-ESCA') %>% 
  filter(files.access == 'open') %>%
  facet("diagnoses.treatments.treatment_type") %>%
  aggregations()

$diagnoses.treatments.treatment_type
  doc_count                         key
1       185 pharmaceutical therapy, nos
2       185      radiation therapy, nos

Head and Neck squamous cell carcinoma

Meta data for head and neck squamous cell carcinoma.

hnsc_star <- get_star_metadata("TCGA-HNSC")

sapply(hnsc_star$cases, function(x) x$samples) |>
  do.call(rbind.data.frame, args = _) -> hnsc_star_cases

dim(hnsc_star_cases)

[1] 566   2

Sample types.

table(hnsc_star_cases$sample_type)


         Metastatic       Primary Tumor Solid Tissue Normal 
                  2                 520                  44

Treatment type.

cases() %>%
  filter(project.project_id == 'TCGA-HNSC') %>% 
  filter(files.access == 'open') %>%
  facet("diagnoses.treatments.treatment_type") %>%
  aggregations()

$diagnoses.treatments.treatment_type
  doc_count                         key
1       528 pharmaceutical therapy, nos
2       528      radiation therapy, nos

Kidney renal clear cell carcinoma

Meta data for kidney renal clear cell carcinoma.

kirc_star <- get_star_metadata("TCGA-KIRC")

sapply(kirc_star$cases, function(x) x$samples) |>
  do.call(rbind.data.frame, args = _) -> kirc_star_cases

dim(kirc_star_cases)

[1] 614   2

Sample types.

table(kirc_star_cases$sample_type)


Additional - New Primary            Primary Tumor      Solid Tissue Normal 
                       1                      541                       72

Treatment type.

cases() %>%
  filter(project.project_id == 'TCGA-KIRC') %>% 
  filter(files.access == 'open') %>%
  facet("diagnoses.treatments.treatment_type") %>%
  aggregations()

$diagnoses.treatments.treatment_type
  doc_count                         key
1       537 pharmaceutical therapy, nos
2       537      radiation therapy, nos

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicDataCommons_1.26.0 magrittr_2.0.3           
 [3] lubridate_1.9.3           forcats_1.0.0            
 [5] stringr_1.5.0             dplyr_1.1.3              
 [7] purrr_1.0.2               readr_2.1.4              
 [9] tidyr_1.3.0               tibble_3.2.1             
[11] ggplot2_3.4.4             tidyverse_2.0.0          
[13] workflowr_1.7.1          

loaded via a namespace (and not attached):
 [1] gtable_0.3.4            xfun_0.40               bslib_0.5.1            
 [4] htmlwidgets_1.6.2       processx_3.8.2          callr_3.7.3            
 [7] tzdb_0.4.0              crosstalk_1.2.0         vctrs_0.6.4            
[10] tools_4.3.2             ps_1.7.5                bitops_1.0-7           
[13] generics_0.1.3          curl_5.1.0              stats4_4.3.2           
[16] fansi_1.0.5             pkgconfig_2.0.3         S4Vectors_0.40.1       
[19] lifecycle_1.0.3         GenomeInfoDbData_1.2.11 compiler_4.3.2         
[22] git2r_0.32.0            munsell_0.5.0           getPass_0.2-2          
[25] httpuv_1.6.12           GenomeInfoDb_1.38.0     htmltools_0.5.6.1      
[28] sass_0.4.7              RCurl_1.98-1.12         yaml_2.3.7             
[31] crayon_1.5.2            later_1.3.1             pillar_1.9.0           
[34] jquerylib_0.1.4         whisker_0.4.1           ellipsis_0.3.2         
[37] DT_0.30                 cachem_1.0.8            tidyselect_1.2.0       
[40] digest_0.6.33           stringi_1.7.12          rprojroot_2.0.3        
[43] fastmap_1.1.1           grid_4.3.2              colorspace_2.1-0       
[46] cli_3.6.1               utf8_1.2.4              withr_2.5.2            
[49] rappdirs_0.3.3          scales_1.2.1            promises_1.2.1         
[52] timechange_0.2.0        XVector_0.42.0          rmarkdown_2.25         
[55] httr_1.4.7              hms_1.1.3               evaluate_0.22          
[58] knitr_1.45              GenomicRanges_1.54.1    IRanges_2.36.0         
[61] rlang_1.1.1             Rcpp_1.0.11             glue_1.6.2             
[64] xml2_1.3.5              BiocGenerics_0.48.0     rstudioapi_0.15.0      
[67] jsonlite_1.8.7          R6_2.5.1                zlibbioc_1.48.0        
[70] fs_1.6.3