Bioinformatics Scientist Interview Questions

236 bioinformatics scientist interview questions shared by candidates

Given a string for four alphabets (ACGT), how would you find the count of all possible 6-mers that exist within the string? What is a heap? What are your hobbies? What would be the size of a feature vector used as an input for a 3D facial recognition algorithm?
avatar

Bioinformatics Scientist

Interviewed at Human Longevity

2.8
Jun 26, 2015

Given a string for four alphabets (ACGT), how would you find the count of all possible 6-mers that exist within the string? What is a heap? What are your hobbies? What would be the size of a feature vector used as an input for a 3D facial recognition algorithm?

A strange disease has spread across the land, many people seem to be affected in a way that is yet to be understood: when they are in daylight, odd looking marks appears on their skin that appear like burning tissue. A drug company trying to understand this disease's mechanism of action sent data over to us. They took normal and lesion skin biopsies from healthy and disease individuals respectively, and performed whole genome RNA-seq profiling in order to identify and understand the disease at the gene expression level. Analysis workflow Load the data into R and make sure the count and annotation data are consistent with each other. Filter the count data for lowly-expressed genes, using the strategy of your choice. For example: only keep genes with a CPM >= 1 in at least 75% samples, in at least one of the groups. Assign the library-size normalized log-CPM data into an object from a suitable data structure/class. Save it as a binary file (.rda or .rds). Generate basic plots of your choice to investigate its main properties and comment (library sizes, expression distribution densities per sample, PCA colored per group, etc.). Based on the previous plots, look for the presence of outlier/mislabeled samples in this dataset. Try to identify and remove them from the downstream analysis. Run a differential expression analysis to find genes whose expression is different in lesion vs. normal samples. This can be done according to your preference either on the count data or the normalized log-CPM data, using an appropriate statistical method. Generate a volcano plot (x-axis is the effect size and y-axis is the p-value) for this analysis. The selected 100 most significant genes should be colored. Re-write step 6. by wrapping it up into a single function that you implement -- and document: arguments: the expression data, the sample annotations and the name of the group variable return value: a data.frame of statistics of differential expression. (bonus) Write a function that identifies the outlier(s) based on the expression data and group variable only. Pointers Installing Bioconductor For a quick introduction to RNA-seq data in limma user guide - Section 15 Differential expression analysis: with limma: limma user guide - Section 16 with DESeq2 ExpressionSet class: Video introduction Class description
avatar

Bioinformatics Scientist

Interviewed at CytoReason

4.2
Mar 31, 2022

A strange disease has spread across the land, many people seem to be affected in a way that is yet to be understood: when they are in daylight, odd looking marks appears on their skin that appear like burning tissue. A drug company trying to understand this disease's mechanism of action sent data over to us. They took normal and lesion skin biopsies from healthy and disease individuals respectively, and performed whole genome RNA-seq profiling in order to identify and understand the disease at the gene expression level. Analysis workflow Load the data into R and make sure the count and annotation data are consistent with each other. Filter the count data for lowly-expressed genes, using the strategy of your choice. For example: only keep genes with a CPM >= 1 in at least 75% samples, in at least one of the groups. Assign the library-size normalized log-CPM data into an object from a suitable data structure/class. Save it as a binary file (.rda or .rds). Generate basic plots of your choice to investigate its main properties and comment (library sizes, expression distribution densities per sample, PCA colored per group, etc.). Based on the previous plots, look for the presence of outlier/mislabeled samples in this dataset. Try to identify and remove them from the downstream analysis. Run a differential expression analysis to find genes whose expression is different in lesion vs. normal samples. This can be done according to your preference either on the count data or the normalized log-CPM data, using an appropriate statistical method. Generate a volcano plot (x-axis is the effect size and y-axis is the p-value) for this analysis. The selected 100 most significant genes should be colored. Re-write step 6. by wrapping it up into a single function that you implement -- and document: arguments: the expression data, the sample annotations and the name of the group variable return value: a data.frame of statistics of differential expression. (bonus) Write a function that identifies the outlier(s) based on the expression data and group variable only. Pointers Installing Bioconductor For a quick introduction to RNA-seq data in limma user guide - Section 15 Differential expression analysis: with limma: limma user guide - Section 16 with DESeq2 ExpressionSet class: Video introduction Class description

Viewing 41 - 50 interview questions

Glassdoor has 236 interview questions and reports from Bioinformatics scientist interviews. Prepare for your interview. Get hired. Love your job.