Last updated: 2023-07-21

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e5a16af. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.0/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  code/multiz100way/
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
    Untracked:  data/netmhciipan.out.gz
    Untracked:  data/test
    Untracked:  export/davetang039sblog.WordPress.2023-06-30.xml
    Untracked:  export/output/
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/graph.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/hdf5.Rmd) and HTML (docs/hdf5.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd e5a16af Dave Tang 2023-07-21 HDF5

Introduction

From Wikipedia:

Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

In keeping with this goal, the HDF libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms and programming languages. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView).

The current version, HDF5, differs significantly in design and API from the major legacy version HDF4.

Data

Download ARCHS4 HDF5 data and load using the hdf5r package.

human <- H5File$new(filename = "/data/archs4/human_gene_v2.2.h5", mode = "r")
human
Class: H5File
Filename: /data/archs4/human_gene_v2.2.h5
Access type: H5F_ACC_RDONLY
Listing:
 name  obj_type dataset.dims dataset.type_class
 data H5I_GROUP         <NA>               <NA>
 meta H5I_GROUP         <NA>               <NA>

Different groups.

human[["data"]]
Class: H5Group
Filename: /data/archs4/human_gene_v2.2.h5
Group: /data
Listing:
       name    obj_type   dataset.dims dataset.type_class
 expression H5I_DATASET 722425 x 67186        H5T_INTEGER
human[["meta"]]
Class: H5Group
Filename: /data/archs4/human_gene_v2.2.h5
Group: /meta
Listing:
    name  obj_type dataset.dims dataset.type_class
   genes H5I_GROUP         <NA>               <NA>
    info H5I_GROUP         <NA>               <NA>
 samples H5I_GROUP         <NA>               <NA>

Getting information about objects

Use the names() function to get all names of objects.

names(human)
[1] "data" "meta"

The ls() function provides more information.

human$ls()
  name     link.type  obj_type num_attrs group.nlinks group.mounted
1 data H5L_TYPE_HARD H5I_GROUP         0            1             0
2 meta H5L_TYPE_HARD H5I_GROUP         0            3             0
  dataset.rank dataset.dims dataset.maxdims dataset.type_class
1           NA         <NA>            <NA>               <NA>
2           NA         <NA>            <NA>               <NA>
  dataset.space_class committed_type
1                <NA>           <NA>
2                <NA>           <NA>

List functions.

hdf5r::list.attributes(human)
character(0)
hdf5r::list.datasets(human)
 [1] "data/expression"                    "meta/genes/biotype"                
 [3] "meta/genes/ensembl_gene_id"         "meta/genes/symbol"                 
 [5] "meta/info/author"                   "meta/info/contact"                 
 [7] "meta/info/creation-date"            "meta/info/laboratory"              
 [9] "meta/info/version"                  "meta/samples/channel_count"        
[11] "meta/samples/characteristics_ch1"   "meta/samples/contact_address"      
[13] "meta/samples/contact_city"          "meta/samples/contact_country"      
[15] "meta/samples/contact_institute"     "meta/samples/contact_name"         
[17] "meta/samples/contact_zip"           "meta/samples/data_processing"      
[19] "meta/samples/extract_protocol_ch1"  "meta/samples/geo_accession"        
[21] "meta/samples/instrument_model"      "meta/samples/last_update_date"     
[23] "meta/samples/library_selection"     "meta/samples/library_source"       
[25] "meta/samples/library_strategy"      "meta/samples/molecule_ch1"         
[27] "meta/samples/organism_ch1"          "meta/samples/platform_id"          
[29] "meta/samples/readsaligned"          "meta/samples/relation"             
[31] "meta/samples/sample"                "meta/samples/series_id"            
[33] "meta/samples/singlecellprobability" "meta/samples/source_name_ch1"      
[35] "meta/samples/status"                "meta/samples/submission_date"      
[37] "meta/samples/taxid_ch1"             "meta/samples/title"                
[39] "meta/samples/type"                 
hdf5r::list.groups(human)
[1] "data"         "meta"         "meta/genes"   "meta/info"    "meta/samples"
hdf5r::list.objects(human)
 [1] "data"                               "data/expression"                   
 [3] "meta"                               "meta/genes"                        
 [5] "meta/genes/biotype"                 "meta/genes/ensembl_gene_id"        
 [7] "meta/genes/symbol"                  "meta/info"                         
 [9] "meta/info/author"                   "meta/info/contact"                 
[11] "meta/info/creation-date"            "meta/info/laboratory"              
[13] "meta/info/version"                  "meta/samples"                      
[15] "meta/samples/channel_count"         "meta/samples/characteristics_ch1"  
[17] "meta/samples/contact_address"       "meta/samples/contact_city"         
[19] "meta/samples/contact_country"       "meta/samples/contact_institute"    
[21] "meta/samples/contact_name"          "meta/samples/contact_zip"          
[23] "meta/samples/data_processing"       "meta/samples/extract_protocol_ch1" 
[25] "meta/samples/geo_accession"         "meta/samples/instrument_model"     
[27] "meta/samples/last_update_date"      "meta/samples/library_selection"    
[29] "meta/samples/library_source"        "meta/samples/library_strategy"     
[31] "meta/samples/molecule_ch1"          "meta/samples/organism_ch1"         
[33] "meta/samples/platform_id"           "meta/samples/readsaligned"         
[35] "meta/samples/relation"              "meta/samples/sample"               
[37] "meta/samples/series_id"             "meta/samples/singlecellprobability"
[39] "meta/samples/source_name_ch1"       "meta/samples/status"               
[41] "meta/samples/submission_date"       "meta/samples/taxid_ch1"            
[43] "meta/samples/title"                 "meta/samples/type"                 

Get data

Function to get data.

wrap <- function(x){
  x <- sub("^", "[['", x)
  sub("$", "']]", x)
}

get_data <- function(obj, ds, ...){
  s <- unlist(strsplit(x = ds, split = "/"))
  txt <- paste0(obj, paste(sapply(s, wrap), collapse = "", sep = ""), ...)
  eval(parse(text = txt))
}

Check out some of the data.

get_data('human', 'meta/samples/title', '[1:5]')
[1] "OCI-LY1_48hrs_mRNAseq_3x_siNT_R1"   "OCI-LY1_48hrs_mRNAseq_3x_siNT_R2"  
[3] "OCI-LY1_48hrs_mRNAseq_3x_siNT_R3"   "OCI-LY1_48hrs_mRNAseq_3x_siBCL6_R1"
[5] "OCI-LY1_48hrs_mRNAseq_3x_siBCL6_R2"
get_data('human', 'meta/samples/sample', '[1:5]')
[1] "GSM1000981" "GSM1000982" "GSM1000983" "GSM1000984" "GSM1000985"
get_data('human', 'meta/samples/geo_accession', '[1:5]')
[1] "GSM1000981" "GSM1000982" "GSM1000983" "GSM1000984" "GSM1000985"
get_data('human', 'meta/samples/series_id', '[1:5]')
[1] "GSE29282" "GSE29282" "GSE29282" "GSE29282" "GSE29282"
get_data('human', 'meta/samples/characteristics_ch1', '[1:5]')
[1] "treatment: siNT,cell line: OCI-LY1"  
[2] "treatment: siNT,cell line: OCI-LY1"  
[3] "treatment: siNT,cell line: OCI-LY1"  
[4] "treatment: siBCL6,cell line: OCI-LY1"
[5] "treatment: siBCL6,cell line: OCI-LY1"
get_data('human', 'meta/samples/extract_protocol_ch1', '[1]')
[1] "ChIP-seq libraries were prepared using the Illumina ChIP-seq Library preparation Kit following the manufactureâ\u0080\u0099s instructions with minor modifications. Briefly 10ng of purified ChIP DNA was end repaired by conversion of overhangs to phosphorylated blunt ends. Aâ\u0080\u0099 bases were added to the 3â\u0080\u0099ends of the DNA fragments and Illumina adapters (1:30 dilution) were ligated to the ends of ChIP fragments. After adapter ligation DNA was separated by electrophoresis and size selected by isolating a gel band of 250±25bp. This fragment range corresponds to a ChIP fragment range of about 158±25bp. Size selected fragments were PCR amplified for 15cycles using Illumina genomic DNA primers 1.1 and 1.2 with the following program (30s at 98oC, 15cycles of 10 at 98oC, 30s at 65oC,30s at 72oC and 5min extension at 75oC). Libraries were quantified and validated using Agilent Technologies 2100 Bioanalyser for size, concentration and purity. Q-PCR was repeated to confirm retention of relative enrichment.                               RNAseq: Three ug of total RNA was isolated from OCI-Ly1 cells transfected using Nucleofector 96-well Shuttle system (Lonza) with siBCL6 (HSS100968) or siNT (46-2001) (Stealth RNAi, Invitrogen) at 24hrs and 48hrs after nucleofection. RNAeasy Plus Kit (Qiagen) that included a gDNA elimination step was used for RNA isolation. RNA concentration and purity were determined using Nanodrop (Thermo Scientific) and integrity was verified using Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Libraries were generated using mRNA-seq sample prep kit by Illumina. Briefly, mRNA was selected by two rounds of purification using magnetic polydT beads and then fragmented. First strand synthesis was performed using random oligos and SupersciptIII (Invitrogen). After second strand synthesis a 200bp paired-end library was prepared following the Illumina paired-end library preparation protocol."
get_data('human', 'meta/samples/source_name_ch1', '[1:5]')
[1] "Human DLBCL cel line" "Human DLBCL cel line" "Human DLBCL cel line"
[4] "Human DLBCL cel line" "Human DLBCL cel line"

Store gene symbols

Gene symbols.

human[['meta']][['genes']][['symbol']]
Class: H5D
Dataset: /meta/genes/symbol
Filename: /data/archs4/human_gene_v2.2.h5
Access type: H5F_ACC_RDONLY
Datatype: H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
Space: Type=Simple     Dims=67186     Maxdims=67186
Chunk: 2100

Store gene symbols.

my_idx <- 1:human[['meta']][['genes']][['symbol']]$maxdims
my_gene_sym <- human[['meta']][['genes']][['symbol']][my_idx]
head(my_gene_sym)
[1] "TSPAN6"   "TNMD"     "DPM1"     "SCYL3"    "C1orf112" "FGR"     

Source

Check out some sources.

unique(get_data('human', 'meta/samples/source_name_ch1', '[1:100]'))
 [1] "Human DLBCL cel line"                          
 [2] "placenta"                                      
 [3] "HeLa"                                          
 [4] "RNA-seq H1-hESC-48hr-Activin"                  
 [5] "Primary Human Umbilical Vein Endothelial Cells"
 [6] "Myeloid Leukemia"                              
 [7] "HT29 treated with 0 μM of 5-Aza"              
 [8] "HT29 treated with 5 μM of 5-Aza"              
 [9] "HT29 treated with 10 μM of 5-Aza"             
[10] "corneal endothelial cells"                     
[11] "fibroblast"                                    
[12] "iPS"                                           
[13] "Immortalized primary fibroblasts"              
[14] "NTera2/D1 cells"                               
[15] "gastric adenocarcinoma cells"                  
[16] "CNE1-MOCK_RNA-seq"                             
[17] "CNE1-BART1_RNA-seq"                            
[18] "CNE1-BART3_RNA-seq"                            
[19] "CNE1-BART7_RNA-seq"                            
[20] "iPSC"                                          
[21] "NPC"                                           
[22] "cells in culture"                              
[23] "mammary tumor xenograft"                       
[24] "Bone Marrow"                                   
[25] "Foreskin fibroblasts"                          
[26] "Human Fib X Human Fib Homokaryons"             
[27] "Fertile mature sperm"                          
[28] "bone marrow mononuclear cells"                 
[29] "HEK293"                                        

Store all sources.

my_idx <- 1:human[['meta']][['samples']][['source_name_ch1']]$maxdims
my_source <- human[['meta']][['samples']][['source_name_ch1']][my_idx]
head(my_source)
[1] "Human DLBCL cel line" "Human DLBCL cel line" "Human DLBCL cel line"
[4] "Human DLBCL cel line" "Human DLBCL cel line" "Human DLBCL cel line"

Search.

length(my_source[grepl("immune", my_source, ignore.case = TRUE)])
[1] 115
length(my_source[grepl("dendritic", my_source, ignore.case = TRUE)])
[1] 1631

Expression

Genes are columns and rows are samples.

exp_data <- human[['data']][['expression']]
exp_data
Class: H5D
Dataset: /data/expression
Filename: /data/archs4/human_gene_v2.2.h5
Access type: H5F_ACC_RDONLY
Datatype: H5T_STD_U32LE
Space: Type=Simple     Dims=722425 x 67186     Maxdims=Inf x 67186
Chunk: 1 x 2000

Sum across first ten samples.

apply(exp_data[1:10, ], 1, sum)
 [1] 102963402  85980162 109810687 105304785  97274296  83557921    468037
 [8]    519638    599381    621721

TP53 position.

tp53_idx <- match("TP53", my_gene_sym)
tp53_idx
[1] 8139

Clean up

Close.

human$close_all()