Given a string for four alphabets (ACGT), how would you find the count of all possible 6-mers that exist within the string? What is a heap? What are your hobbies? What would be the size of a feature vector used as an input for a 3D facial recognition algorithm?
Bioinformatics Scientist Interview Questions
236 bioinformatics scientist interview questions shared by candidates
How would you find the homologous sequences to a given sequence?
Describe your previous experiences etc
A strange disease has spread across the land, many people seem to be affected in a way that is yet to be understood: when they are in daylight, odd looking marks appears on their skin that appear like burning tissue. A drug company trying to understand this disease's mechanism of action sent data over to us. They took normal and lesion skin biopsies from healthy and disease individuals respectively, and performed whole genome RNA-seq profiling in order to identify and understand the disease at the gene expression level. Analysis workflow Load the data into R and make sure the count and annotation data are consistent with each other. Filter the count data for lowly-expressed genes, using the strategy of your choice. For example: only keep genes with a CPM >= 1 in at least 75% samples, in at least one of the groups. Assign the library-size normalized log-CPM data into an object from a suitable data structure/class. Save it as a binary file (.rda or .rds). Generate basic plots of your choice to investigate its main properties and comment (library sizes, expression distribution densities per sample, PCA colored per group, etc.). Based on the previous plots, look for the presence of outlier/mislabeled samples in this dataset. Try to identify and remove them from the downstream analysis. Run a differential expression analysis to find genes whose expression is different in lesion vs. normal samples. This can be done according to your preference either on the count data or the normalized log-CPM data, using an appropriate statistical method. Generate a volcano plot (x-axis is the effect size and y-axis is the p-value) for this analysis. The selected 100 most significant genes should be colored. Re-write step 6. by wrapping it up into a single function that you implement -- and document: arguments: the expression data, the sample annotations and the name of the group variable return value: a data.frame of statistics of differential expression. (bonus) Write a function that identifies the outlier(s) based on the expression data and group variable only. Pointers Installing Bioconductor For a quick introduction to RNA-seq data in limma user guide - Section 15 Differential expression analysis: with limma: limma user guide - Section 16 with DESeq2 ExpressionSet class: Video introduction Class description
How would you analyze the RNA seq data of human?
Where do you see yourself in next five years
About SQL and Molecular biology
General Biology and drug discovery questions (depending on experience) Hackerrank-Style coding questions.
I did not expect to be asked about sorting algorithms.
All sorts of questions about Binary search trees, Dynamic programming, Machine Learning and Probabilistic Modelling approaches ( explain MCMC, EM, etc.)
Viewing 41 - 50 interview questions