Knowledge Interview Questions

What is Data Science?

discovering hidden patterns from raw data
software engineers use these tools to build platforms for user
we use these programming tools to analyze and draw conclusions on datasets

Supervised vs. Unsupervised Learning

Supervised:

input data is labeled
uses training dataset
used for prediction
enables classification and regression
Algorithms:
- decision tree
- logistic regression
- support vector machine

Unsupervised:

input data is unlabeled
uses input data set
used for analysis
grouping things together that look like they should be together
enables Classification, Density, Estimation, Dimension Reduction
Algorithms:
- k-means
- clustering
- hierarchal clustering
- apriori algorithm

Logistic Regression

predicts binary outcome (2 outcomes) from linear combination of predictor variables
Linear graph to a sigmoid function:
- Beginning is at 0, end is at 1

Recommender Systems

Information filtering system that predicts preferences

Collaborative Filtering

recommend tracks played by other users with similar interests

Content-based Filtering

uses properties of song to recommend music with similar properties

Descriptive Statistical Analysis Techniques

Univariate
- describe data
- you can evaluate mean, mode, range
Bivariate - eg. scatterplot
- show relationship between
- positive correlation, negative
Multivariate

Normal Distribution

Bell shaped curve
symmetrical
no bias on left or right

Linear Regression

X - predictor variable (independent)
Y - criterion variable (dependent)

Finding RMSE and MSE (measures of accuracy for linear regression)

RMSE (Root Mean Square Error)
- (sqrt(sum of (predicted-actual)^2)/N
MSE (Mean Square Error / Average Square)
- (1/N)(sum of (predicted-actual)^2) N = Total number

Interpolation

Estimating value from 2 known values in a list of values

Extrapolation

approximating value by extending known data

Decision Tree

Take entire dataset as input
Calculate entropy of target variables and predictor attributes

the more different the objects in dataset, the more chaotic - higher entropy

Calculate information gain of all attributes

gain info from sorting different objects from the entropy

Choose attribute with highest info gain as root node

whichever split lowers the chaos the mot, this is the root n
Repeat until each decision node is finalized

Random Forest Model

Randomly select k features from m features (k < m)

many little decision trees

Overfitting

when model is overtrained, memorizes training set
can tell if training accuracy is high but test accuracy is low

Avoid Overfitting

Reduce variables - reduce noise
Use cross-validation techniques

eg. k-folds, cross-validation

Use regularization techniques such as LASSO to penalize certain model parameters if they're likely to cause overfitting

Feature Selection

Filter Methods

Linear Discriminant Analysis
ANOVA
Chi-Square

Wrapper Methods

Forward Selection
Backward selection
Recursive feature elimination

Dealing with missing data values:

If data set is huge

remove rows with missing values

If not

Substitute missing values with mean of dataset (using pandas)

Dimensonality Reduction

Convert set of data with vast dimensions into less dimensions
- reduces storage space
- reduces computation time
- removes redundant features (no point storing m and inches)

Maintaining Deployed Model

Monitor

needed to determine performance accuracy of models

Evaluate

evaluation metrics of current model calculated to determine if new algorithm is needed

Compare

new models compared against each other, see which performs the best

Rebuild

rebuild model

K-Means

K = # of different groups selected Algorithm:

Clusers data into k groups
Select k points at random as cluster centers
Assign objects to their closest center according to Euclidean distance
Calculate mean of all objects in each cluster
Repeat steps 2, 3, 4 until same points are assigned to each cluser

Selecting k for k-means

Elbow Method:

Graph cost function against K
Wherever there is a dip, is the elbow, and that is your k value (you want to minimize your loss function)
Starts to flatten after k value, no point in taking the ones after

P-Value

Null hypothesis
- No variation between variables
- For one variable - it is just the mean
<0.05
- strong evidence against null hypothesis
0.05
- Weak evidence against null hypothesis
- your hypothesis is probably wrong
  - since null hypothesis seems correct, that means there's no variation between variables
0.05
- marginal (could go either way)

Outliers

Drop:

Drop only if it is garbage value

if you are measuring height, and the height is a string, you can remove this

If they have extreme values, they can be removed

If you can't drop:

try different model (if it looks like curve rather than line, don't use linear model)
try normalizing data (extreme data points pulled to similar range)
use algorithms less affected by outliers (eg. random forest)

Stationary Time series data

When variance and mean of series is constant with time

Confusion Matrix:

describes performance of classification model
actual vs. predicted

Calculating Accuracies using confusing matrix

Accuracy = (True Positive + True Negative) / Total Observations

| Total=650 | Actual | Actual | Actual |
|-----------|--------|--------|--------|
| Predicted |        | p      | n      |
| Predicted | p      | 262    | 15     |
| Predicted | n      | 26     | 347    |

P and P = True positive
N and N = True Negative
Actual P and Predicted N = False Negative
Actual N and Predicted P = False Positive

Precision and Recall Rate

| Total=650 | Actual | Actual | Actual |
|-----------|--------|--------|--------|
| Predicted |        | p      | n      |
| Predicted | p      | 262    | 15     |
| Predicted | n      | 26     | 347    |

Precision = (True Positive) / (True Positive + False Positive)

Recall Rate = (True Positive) / Total Positive + False Negative)

Machine Learning Algorithms Used for Inputing Missing Values of both Categorical and Continuous Variables

K-NN

Calculate Entropy

p = # of target (usually 1) n = # of non-target (usually 0) Entropy = -(p/(p+n))*log2(p/(p+n)) - (n/(p+n))*log2(n/(p+n))

Choose algorithms based on case

Question: Probability of death from heart disease based on 3 risk factors: age, gender, blood cholesterol

Answer: Logistic Regression

Question: After studying behaviour of population, you have identified 4 specific individual types who are valuble to your study. You would like to find all users who are most similar to each individual type. Which algorithm?

Answer:

K-means clustering (grouping people together - 4 is the k value)

Choose analysis method based on case

Question: Your organization has website where visitors randomly receive 1/2 coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use?

Answer: One-Way ANOVA

Association Rules Algorithm

PreviousData Science Interview NextSE Interview

Last updated 3 years ago

Was this helpful?