How to use

How to use KeyGenes

1. Provided “fixed” training sets

KeyGenes has been developed and tested using NGS-derived human fetal transcriptional datasets as both training and test sets (Roost et al., 2015). The datasets used were expanded with available datasets on human adult tissues/organs (Cnop et al., 2014; Fagerberg et al., 2014; Illumina Body Map 2.0).

It is worth mentioning, that NGS-derived data of more human adult organs/tissues are available (Fagerberg et al., 2014; Illumina Bodymap 2.0; Epigenomic Roadmap; ENCODE) and could be incorporated as the datasets to be used as training sets are flexible and expandable.
We provide seven basic “fixed” training sets, which are supposed to give users a headstart to assign both identity and developmental stages to their differentiated cells (test set). How to create and use a “flexible” training set is described in section 2.

1. 1 Basic training sets

For a first and general assessment and before using a “flexible” training set, the following “fixed” training sets are recommended to determine identity:

1. Fetal
The training set “fetal” contains transcriptional signatures of 21x fetal organs/tissues and the maternal endometrium.

2. Fetal wo
The training set “fetal wo” contains transcriptional signatures of 17x fetal organs, excluding the extraembryonic organs/tissues and the maternal endometrium. Male and female gonads are taken together as gonad.

1.2. Staging training sets

For a basic assignment of a developmental stage to differentiated derivatives of pluripotent stem cells (PSCs), we provide the “fetal wo” training set now separated into first and second trimester and well as an adult training set. By comparing the outcomes using the three training sets, you can determine whether your sample is closer to a particular organ in the 1st trimester, 2nd trimester or adult:

3. Fetal wo 1T
The training set “fetal wo 1T” contains transcriptional signatures of 13x 1st trimester fetal, excluding the extraembryonic organs/tissues and the maternal endometrium.

4. Fetal wo 2T
The training set “fetal wo 2T” contains transcriptional signatures of 16x 2nd trimester fetal organs, excluding the extraembryonic organs/tissues and the maternal endometrium. Male and female gonad are taken together as gonad.

5. Adult
The training set “adult” contains transcriptional signatures of 11x fetal organs, excluding the extraembryonic organs/tissues and the cervix (Fagerberg et al., 2014; Illumina Bodymap 2.0). Ovaries and testes are taken together as gonads. With the heart samples, no distinction is made between atria and ventricles.

1.3 Other training sets

6. Fetal wo islets
This training set includes the “fetal wo” training set expanded with five adult islet of Langerhans samples (Cnop et al., 2014). KeyGenes will split up the identity scores for pancreas into pancreas and adult islet. This training set has been tested on PSC-derived endocrine cells (Roost et al., 2015).

7. Fetal wo islets 2
This training set includes the “fetal wo” training set expanded with five adult islet of Langerhans samples (Cnop et al., 2014). KeyGenes will split up the identity scores for pancreas into 1T, 2T and adult islet. This training set has been tested on PSC-derived endocrine cells (Roost et al., 2015).

2. Web App

2.1. Web App using the provided “fixed” training sets

The web app allows a prediction of NGS-derived data based on one of the “fixed” training sets described above in section 1.

To run the KeyGenes Web App, follow the steps below:

1. Fill in your email address in the first field, so you will get the results in your mailbox.

2. Choose the “fixed” training set of interest from the dropdown menu.

3. Upload the test set containing the raw reads.
The provided training sets are based on Ensembl Gene IDs. Therefore, the test set must be in a specific tab-separated text file with genes (Ensembl Gene IDs, no duplicates) as rows and the queried samples as columns.
The queried samples should be labelled as follows: sample_additional information (e.g. stomach_adult patient 1). To get a meaningful prediction of the queried samples, the tissue/cell type of interest must be represented in the training set.

4. Press the submit button and check your mailbox for the results.

2.2. Web App using “flexible” training sets

We are currently working on the Web App to allow the use of flexible training sets.

3. The R-package

KeyGenes is made available as an R-package.

3.1 Installation

You can install the R-package by running the following command in R:

devtools::install_github(“DavyCats/KeyGenes”, branch=”master”)

3.2 Usage

A vignette explaining the usage of the R-package can be found here: https://davycats.github.io/KeyGenes/v0.1.html

Click Here to show the documentation for the original (and obsolete) R scripts.

4. The original R Scripts

The R scripts allow a prediction of NGS- or microarray-derived data based on one of the provided training sets described in section 1 or based on a training set defined by the user.

4.1. Requirements

The scripts of KeyGenes are based on R, therefore, R (http://www.r-project.org/) is required. In order to run KeyGenes properly, the following packages are needed:

• limma
• ggplot2
• gplots
• glmnet (Version 1.9-8)

Please note that this particular version of glmnet (Version 1.9-8) must be used (Friedman et al., 2010). Compressed files of the four packages can be downloaded from this website.

The provided training sets are based on Ensembl Gene IDs. Therefore, test sets must be in a specific tab-separated text file with genes (Ensembl Gene IDs, no duplicates) as rows and the queried samples as columns.
The queried samples should be labelled as follows: sample_additional information (e.g. stomach_adult patient 1). To get a meaningful prediction of the queried samples, the organ/tissue/cell type of interest must be represented in the training set.

4.2. Using the provided “fixed” training sets for NGS-derived data (Script 2)

KeyGenes is designed to be used on human next-generation data (Script 2, 2_KeyGenes_NGS_1.0). The provided human training sets with their associated files (foldid, top500) described above can be downloaded from this website.
To run the KeyGenes script, follow the steps below:

1. Load the packages (see section “2.1. Requirements”).

##Load Packages ##

library(limma)
library(gplots)
library(ggplot2)
library(glmnet) # Version: 1.9-8

2. Set the working directory to where the files are located.

## Working Directory ##

working_dir <- “YourWorkingDirectory”

3. Specify you training set. To each training set belongs a foldid and top500 text file.

## Training Set ##

training <- “training_fetal.txt”
foldid <- “foldid_fetal.txt”

top500 <- “top500_fetal.txt”

4. Specify the test set containing raw reads (see section “2.1. Requirements” for information on the format of the test set).

## Test Set ##

test <- “YourTestSet.txt”

5. Label the output files as you wish (see file “What is KeyGenes” for information on the output).

## Output ##

heatmap <- “KeyGenes_Heatmap.pdf”
matrix <- “KeyGenes_Matrix.pdf”
prediction <- “KeyGenes_Prediction.pdf”
classifier <- “KeyGenes_Classifier.pdf”

6. Once all the arguments are loaded (steps 1-5), run the rest of the script in one go.

4.3. Using the provided “fixed” training sets for microarray-derived data (Script 3)

KeyGenes can be used to predict microarray-derived data of human tissue samples using the fetal training set (Roost et al., 2015; Script 3, 3_KeyGenes_MA_1.0). However, only data derived from Affymetrix and Illumina platforms haven been tested. Furthermore, the other provided training sets have not been tested on microarray-derived data of differentiated derivatives of human pluripotent stem cells.
For microarray data of differentiated derivatives of pluripotent stem cells, it is recommended to use also http://cellnet.hms.harvard.edu/.
The only difference to Script 2 is that in the section “Training Set”, an additional argument (housekeeper) must be loaded. This argument remains the same if the training set is changed. The file “housekeepers_microarray.txt” can be downloaded from this website.

1. Load the packages (see section “2.1. Requirements”).

## Load Packages ##

library(limma)
library(gplots)
library(ggplot2)
library(glmnet) # Version: 1.9-8

2. Set the working directory to where the files are located.

## Working Directory ##

working_dir <- “YourWorkingDirectory”

3. Specify you training set. To each training set belongs a foldid and top500 text file. “housekeepers_microarray.txt” can be downloaded.

## Training Set ##

training <- “training_fetal.txt”
foldid <- “foldid_fetal.txt”

top500 <- “top500_fetal.txt”
housekeeper <- “houeskeepers_microarray.txt”

4. Specify the test set containing raw reads (see section “2.1. Requirements” for information on the format of the test set).

## Test Set ##

test <- “YourTestSet.txt”

5. Label the output files as you wish (see file “What is KeyGenes” for information on the output).

## Output ##

heatmap <- “KeyGenes_Heatmap.pdf”
matrix <- “KeyGenes_Matrix.pdf”
prediction <- “KeyGenes_Prediction.pdf”
classifier <- “KeyGenes_Classifier.pdf”

6. Once all the arguments are loaded (steps 1-5), run the rest of the script in one go.

4.4. How to use a “flexible” training set (Scripts 1, 2 and 3)

In order to get the most meaningful prediction for a particular experiment, KeyGenes allows to use other training sets or modify existing training sets. In order to do so, the foldid and top500 text files of new training sets must be generated in Script 1 (1_KeyGenes_top500_1.0).
There are four important things to consider:
• There need to be at least two samples per “classification”.
• The training set should be based on Ensembl Gene IDs. Therefore, the training set must be in a tab-separated text file with genes (Ensembl Gene IDs, no duplicates) as rows and the queried samples as columns.
• The samples should be labelled as follows: sample_additional information (e.g. stomach_adult patient 1). Samples of the same classification must have the same “sample” name (e.g. stomach_1 and stomach_2).
• If the training set is assembled of data from different sources, the selection of the top500 is crucial.
Theoretically, it is possible to use other gene IDs than Ensembl, and even other species, for training sets. However, this will only work for NGS-derived data (Script 2, 2_KeyGenes_NGS_1.0).

The two text file generated with Script 1 (1_KeyGenes_top500_1.0) are then loaded into Script 2 (2_KeyGenes_NGS_1.0) or Script 3 (3_KeyGenes_MA_1.0), together with the corresponding training set.

1. Load the packages (see section “2.1. Requirements”).

## Load Packages ##

library(limma)
library(glmnet) # Version: 1.9-8

2. Set the working directory to where the files are located.

## Working Directory ##

working_dir <- “YourWorkingDirectory”

3. Specify the own training set.

## Dataset ##

dataset <- “YourDataset.txt”

4. Label the output.

## Output ##

top500 <- “top500.txt”
foldid <- “foldid.txt”

5. Once all the arguments are loaded (steps 1-4), run the rest of the script in one go.

References:

Cnop, M., Abdulkarim, B., Bottu, G., Cunha, D.A., Igoillo-Esteve, M., Masini, M., Turatsinze, J.V., Griebel, T., Villate, O., Santin, I., et al. (2014). RNA sequencing identifies dysregulation of the human pancreatic islet transcriptome by the saturated fatty acid palmitate. Diabetes. 63(6): 1978-1993.
Fagerberg, L., Hallstrom, B.M., Oksvold, P., Kampf, C., Djureinovic, D., Odeberg, J., Habuka, M., Tahmasebpoor, S., Danielsson, A., Edlund, K., et al. (2014). Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Molecular & cellular proteomics: MCP. 13(2): 397-406.
Friedman J., Hastie T., and Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 33(1): 1-22.
Roost M.S., van Iperen L., Ariyurek Y., Buermans H.P., Arindrarto W., Devalla H.D., Passier R., Mummery C.L., Carlotti F., de Koning E.J.P., van Zwet E.W., Goeman J.J., and Chuva de Sousa Lopes S.M. (2015). KeyGenes, a tool to probe tissue differentiation using a human fetal transcriptional atlas. Stem Cell Reports. 4(6):1112-24.