**Data Analytics Interview Questions**

**1. What is the responsibility of a Data analyst?**

- Provide and support to all data analysis and coordinate with customers and staffs.
- Analyzing results and interpret data using statistical techniques and provide ongoing reports.
- Identification of new process or areas for improvement opportunities.
- Resolve business associated issues for clients and performing audit on data
- Filter and “clean” data and review computer reports.
- Securing database by developing an access system by determining user level of access.

### 2. What are the various steps in an analytics project ?

Here are various steps in an analytics project

- Data exploration
- Data preparation
- Problem definition
- Modelling
- Validation of data
- Implementation and tracking

### 3. Explain what is logistic regression?

Logistic regression is a statistical method for examining a data set in which there are one or more independent variables that defines an outcome.

### 4.List out common problems faced by data analyst?

Here I have listed some problems faced by data analyst report

- Duplicate entries
- Common misspelling
- Missing values
- Illegal values
- Varying value representations
- Identifying overlapping data

#### 5. Explain what is KPI and design of experiments ?

KPI: key performance indicator. KPI usually used to measure the success of the organisation in particular activity in which it is engaged.

**Design of experiments **

This is the initial process used to split your data, sample and set up a data for statistical analysis.

**6. Mention what are the data validation methods used by data analyst?**

**Methods of the data validation**

- Data Verification
- Data screening

#### 7. Explain what is Hierarchical Clustering Algorithm?

**Hierarchical clustering algorithm**

It combines and divides groups, creating a hierarchical structure. And it showcases the order in which groups are divided or merged.

#### 8. Explain what are the tools used in Big Data?

**Big Data Tools are:**

- Flume
- Pig
- Hadoop
- Mahout
- Sqoop
- Hive
- PolyBase

#### 9. What is logistic regression?

Logistic regression is a statistical method for examining a data set and there are one or more independent variables that defines an outcome.

#### 10. What is required to become a data analyst?

- Strong knowledge of statistical packages for analysing large datasets (for example SAS, Excel, SPSS, etc.)
- Analytical and Mathematical skills.
- Working experience with various computer programming languages.
- Technical knowledge in database types, familiarity with data warehousing and data manipulation.
- Strong skills with the ability to analyse, organize, collect and disseminate big data with accuracy.

#### 11. What is data cleansing?

Data cleansing is the process of amending or removing data in database which deals with incorrect, incomplete, improperly formatted or duplicated.

#### 12. What is a hash table?

It is a data structure used to implement an associative array, this structure can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

### 13. Explain what is an Outlier?

Outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

There are two types of outlier

- Univariate
- Multivariate

### 14. List out some of the best practices for data cleaning?

- Develop a Data Quality plan
- Standardize contact Data point of entry
- Accuracy of Dataset
- Identify Duplicates
- To handle common cleansing task create a set of utility functions/tools/scripts. It might be include, remapping values based on a CSV file or SQL database or, regex search-and-replace, blanking out all values that don’t match a regex.

### 15. What are the missing patterns that are generally observed?

Missing patterns that are generally observed:

- Missing completely at random
- Missing at random
- Missing that patterns depends on the missing value itself
- Missing that patterns depends on unobserved input variable

### 16. What are your best traits that are suitable for this position?

**Suitable skills:**

- Excellent in analytical
- Outstanding writing ability skills
- Orientation and detail
- Expert in communication skills
- Advanced knowledge current program that is considered as Microsoft and Excel.
- Problem solving ability

### 17. What is power analysis?

It is an important aspect of experimental design. Power analysis allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence.

### 18. What is Gradient Descent?

Gradient descent is a first order iterative optimization algorithm. To find out a local minimum of a function using gradient descent, one can take steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

### 19. What are categorical variables?

In this categorical variable is one that has two or more categories, but there is no intrinsic ordering of the categories. It can take one of a limited, and usually fixed, number of possible values, and assigning each individual or other unit of observation to a particular group or nominal category on the basis of any qualitative property.

### 20. What are various steps involved in an analytics project?

- Prepare Data for modelling by detecting outliers, treating missing values, transforming variables, and etc.
- Validate the model and using a new data set.
- Understanding of business problem
- Data preparation, and start running the model, analyse the result and tweak the approach. This is the iterative step till the best possible outcome is achieved.

#### 21. Which imputation method is more favorable?

Imputation is the process of replacing missing data with substituted of values. Substituting for a data point, called as” unit imputation”; when substituting for a component of a data point is known called as “item imputation. Single imputation is widely used, it does not reflect the uncertainty created by missing data at random. So, multiple imputation is more favorable then single imputation in case of data missing at random.

#### 22. Explain what is the criteria for a good data model?

- Data can be easily consumed.
- Large data changes in a good model should be scalable. Provide a Predictable performance.
- A good model can adapt to changes in requirements, but not at the expense.

#### 23. What is the curse of dimensionality?

It is the process of reducing the number of random variables under consideration, via obtaining a set of principal variables. This can be divided into feature selection and feature extraction.

#### 24. List of some best tools that can be useful for data-analysis?

- KNIME
- Solver
- io
- Wolfram Alpha’s
- Tableau public
- RapidMiner
- OpenRefine
- Google Search Operators
- Google Fusion tables
- NodeXL
- MATLAB
- Scilab
- GNU Octave

#### 25. Mention the name of the framework developed by Apache for processing large data set for an application in a distributed computing environment?

**MapReduce** and** Hadoop** are the programming framework developed by Apache.