Big Data Interview Questions
The Big Data is sets of data and it is so large or complex that traditional data processing application software’s are inadequate to deal with them. It includes challenges like analysis, data creation, capture, search, storage, sharing, visualization, transfer, query and information for privacy. Here I have listed in the top 25 Big Data Interview Questions.
1. What is Big Data?
- It describes the large volume of Data both Structured and Unstructured.
- The term Big Data refers to simply use of predicative analytics, user behavior analytics and another advanced data analytics methods.
- It is extract value from data and seldom to a particular size to data set.
- The challenge include capture, storage, search, sharing, transfer, analysis, creation.
2. Which are the essential Hadoop Tools for effective working of Big Data?
The effective Tools are as follows
- GIS tools
3. What are key steps in Big Data Solutions?
Key steps in Big Data Solutions
Ingesting Data, Storing Data (Data Modelling), and Processing data (Data wrangling, Data transformations, and querying data).
- RDBMsRelational Database Management Systems like Oracle, MySQL, etc.
- ERPs Enterprise Resource planning (ERP) systems like SAP.
- CRMCustomer Relationships Management systems like Siebel, Salesforce, etc.
- Social Media feeds and log files.
- Flat files, docs, and images.
- Data Storage Formats
- Data Modelling
- Metadata management
4. Data Analysis Process?
Five steps of Analysis Process
Step 1: Define Your Questions
Step 2: Set Clear Measurement Priorities
Step 3: Collect Data
Step 4: Analyse Data
Step 5: Interpret Results
5. What is Big Data Analysis?
- It is defined as the process of mining large structured / unstructured data sets.
- It help as to find out underlying patterns, unfamiliar and other useful information within a data leading to business benefits.
6. Name some Big Data Products?
7. Where does Big Data come from?
There are three sources of Big Data
- Social Data: It comes from social media channel’s insights on consumer behaviour.
- Machine Data: It consists of real time data generated from sensors and web logs. It tracks user behaviour online.
- Transaction Data: It generated by large retailers and B2B Companies frequent basis.
8. What is IBM’s simple explanation for Big Data’s four critical features?
Big Data features:
- Volume: Scale of Data
- Velocity: Analysis of streaming Data
- Variety: Different forms of Data
- Veracity: Uncertainly Of Data
9. How businesses could be benefitted with Big Data?
- Big data analysis helps with the business to render real time data.
- It can influence to make a crucial decision on strategies and development of the company.
- Big data helps within a large scale to differentiate themselves in the competitive environment.
10. Where the Mappers Intermediate data will be stored?
- The mapper output is stored in the local file system of each individual mapper node.
- Temporary directory location can be setup in configuration
- By the Hadoop administrator.
- The intermediate data is cleaned up after the Hadoop Job completes.
11. Differentiate between Structured and Unstructured data?
|Structured Data||Unstructured Data|
|Basis algorithms||Old algorithms|
|Spreadsheet data form machine sensors||Human language|
|SQL||Windows explorer, Mac finder screen|
12. How are file systems checked in HDFS?
- File system is used to control how data are stored and retrieved.
- Each file system have a different structure and logic properties of speed, security, flexibility, size.
- Such kind of file system designed in hardware. This file includes NTFS, UFS, XFS, HDFS.
13. What is MapReduce?
- It is a core component, Apache Hadoop Software framework.
- It is a programming model and associated implementation for processing generating large data.
- This data sets with parallel, and distributed algorithm on a cluster, each node of the cluster includes own storage.
14. What is speculative execution?
- It is an optimization technique.
- Computer system performs some task that may not be actually needed.
- This approach is employed in a variety of areas, including branch predication in pipelined processors, optimistic concurrency control in database systems.
15. Pig Latin contains different relational operations; name them?
- for each
- order by
16. Why are counters useful in Hadoop?
- Counter is an integral part of any Hadoop job.
- It is very useful gathering relevant statistics.
- Particular job consists of 150 node clusters with 150 mappers.
- Counters can be used for keeping a final count of all such records and presenting a single output.
17. What is the default block size in Hdfs?
The default block size in Hdfs is 64MB.
18. Which hardware configuration is most beneficial for Hadoop jobs?
- Dual processor core machines with 4/ 8 GB RAM
- ECC Memory conducting Hadoop Operations, it cannot considered to be low end.
- It is useful for Hadoop users, and it does not deliver any checksum errors.
- The hardware configuration depend on the process and workflow needs specific projects and it have customized.
19. What are the main distinctions between NAS and HDFS?
- The files reside on a single machine.
- It does not provide any reliability guarantees.
- Itcan store as much information as to be stored in one machine.
- All the data is stored on a single machine, all the clients must go to this machine to retrieve their data.
- It can overload the server if a large number of clients must be handled.
- It is designed to store a very large amount of information.
- This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
- HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files.
- Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported.
20. What are the industry applications of big data?
- Banking And Securities
- Communication, Media & Entertainment
- Healthcare Providers
- Retail And Wholedsale Sale
- Energy Utilities
21. How can you an application, if you run hive as a server?
It supports the ODBC Protocol
It supports the JDBC protocol
It is utilized to make calls to all hive commands using programming language such as PHP, Java, C++, Python, Ruby.
22. What kind of Data Warehouse application is suitable for Hive?
- Fast response times are not required
- Relatively static data is analysed
- The data is not changing rapidly
23. What is data serialization?
- Serialization is the way of converting object data to byte data stream for transmission.
- Network across in different nodes in a cluster / for persistent data storage.
24. How to debug Hadoop code?
- By using counters
- By web interface provided by Hadoop framework
25. What is a metadata?
Metadata is the information about the Data stored in Data Nodes such as location of the file, size of the file.