Hadoop Interview Questions

Dear readers, these are the Hadoop Interview Questions have been designed especially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Hadoop. Here I have listed in the Top 25 Hadoop Interview Questions.

Table of Contents

1. What is Hadoop?

It is an open source java based programming framework.
Hadoop supports the processing storage of extremely large data sets in distributed computing environment.

2. What are the core methods of a Reducer?

Three core methods of a Reducer:

Setup(): This method is used for configuring various parameters ( input data size, distributed cache, public void setup context).
Reduce(): Heart of the reducer always called once per key with associated task public void reduce ( key, value, context)
Cleanup(): It clean temporary files, only once at the end of the task.

3. What is Sequence File in Hadoop?

Three sequence file formats in:

Uncompressed key/value records

Block compressed key/value records- blocks separately and Compressed. in that size of the block is “configurable”
Record compressed key /value records.”values“ are compressed here.

4. What is Job Tracker role in Hadoop?

This process can be run with a separate node, not a DataNode.
Job Tracker communicates with the NameNode identify the data location.
Monitors individual task trackers submit with the overall job back to the client.
Find out the best task tracker nodes to execute tasks the given nodes.
Tracks execution of MapReduce workloads local to the solve node.

5. Explain the difference between NameNode, Checkpoint NameNode and BackupNode?

NameNode:

NameNode is the centre piece of an HDFS file system
It keeps with a directory tree of all files in the file System
In that data is being stored.

BackupNode:

It provides similar functionality with checkpoint and enforcing synchronization with NameNode.
It maintains up – to – date memory copy of the file system namespace does not require getting hold changes after regular intervals.
Backup Node needs to save the current state in memory to image file to create a new checkpoint.

Checkpoint NameNode:

Directory structure as NameNode creates checkpoints for namespace at regular intervals by downloading the image and edits file and imagining them within the local directory.
It is commonly known as secondary Node, it does not support the upload to NameNode functionality.

6. Compare Hadoop & Spark?

Criteria	Hadoop	Spark
Speed of process	Average	Excellent
Storage	HDFS	None
Frame work	Java	Open source cluster
Library	Separate tools available	Spark Core, SQL, Streaming, MLlib, Graphix

7. What are real time industry applications of Hadoop?

Industry Applications Of Hadoop Used:

Streaming processing
Managing traffic on streets
Content Management and Archiving Emails
Analysing customer data in real-time for improving business performance
Managing content, posts, images and videos on social media platforms
Advertisements Targeting Platforms are using Hadoop to capture and analyse click stream, transaction, video and social media data
Fraud detection and Prevention

8. What companies use Hadoop , any idea?

Yahoo, Facebook, Amazon, Adobe, e-Bay, Twitter.

9. What all modes Hadoop can be run in?

Hadoop can be run in three modes:

Standalone Mode:

Hadoop is used for a local file system input and output operations.

Default mode is mainly used for debugging purpose.
It does not support the use of HDFS
Configuration required for core-site. xml, mapred-site. xml, hdfs-site. xml files.

Pseudo- Distributed Mode:

Configuration for all the three files mentioned above.
Pseudo all daemons are running on one node and thus, both Master and slave node are the same.

Fully Distributed Mode:

Data is used and distributed across several nodes on a Hadoop cluster.
Separate nodes are allotted as Master and Slave.

10. What is the best hardware configuration to run Hadoop?

The Best Hardware configuration:

Hadoop jobs is done by a dual core machines or dual processors with 4GB/8GB RAM it use ECC memory.

Highly benefits using from ECC memory though it is not low end.

Memory is recommended for running Hadoop because most of the users have experienced in various checksum errors by using non ECC memory.
This configuration depends on workflow requirements and can change accordingly.

11. Explain about the indexing process in HDFS?

An Indexing process in HDFS depends on the block size.

It stores last part of the data and further points to the address so it is the next part of data chunk is stored.

12. Explain about the different catalogue tables in HBse?

ROOT TABLE:

This table tracks the META TABLE

META TABLE:

This table stores details of the regions in the system.

13. Does Flume 100% Reliability to the data flow?

Yes, it provides end to end reliability to the transactional approach in data flow.

14. What are the most common Input Formats in Hadoop?

Three most Common Input formats:

Key value Input format
Text Input format
Sequence file Input format

15. How many Input splits is made by a Hadoop Framework?

Hadoop Made By 5 Splits:

One splits for 64 files
Two splits for 65Mb files
Two splits for 127MB Files

16. What is Sqoop in Hadoop?

Sqoop is a tool used to transfer data between relational database management system (RDBMS) and Hadoop HDFS.
It is used for import data from relational database such as MYSQL, Oracle to Hadoop, and export from hadoop file system to relational database.

17. What is Hadoop Streaming?

Hadoop streaming is a generic API.
It allows writing mappers and reduces in any language.
Hadoop used streams as per UNIX standard between your application and Hadoop system.

18. What is the difference between RDBMS and Hadoop?

RDBMS	Hadoop
RDBMS is used for OLTP Processing	Hadoop is used for analytical and for big data processing.
It is a relational database management system	Node based flat structure
The database cluster uses the same data files stored in shared storage.	The storage data can be stored independently in each processing node.

19. What is Hadoop Map Reduce?

Software framework for processing of large data sets on compute clusters of commodity hardware.Data analysis uses a two-step map and reduce process.

20. What happens when a DataNode Fails?

NameNode replicates the user data to another node
Jobtracker and NameNode detect the failure
Failed node all tasks are re scheduled

21. Mention Hadoop Core components?

HDFS
MapReduce

22. For using Hadoop list the network requirements?

Password-less SSH connection
Secure Shell (SSH) for launching server processes

23. Mention what is distributed cache in Hadoop?

Distributed cache in Hadoop is a facility provided by MapReduce framework to cache files needed by applications.
It is used to cache file. It is available on each and every data nodes.

24. Explain how can you debug Hadoop code?

DeBug HadoopCode :

By using web interface provided by Hadoop framework
By using Counters

25. What are the basic parameters of a Mapper?

The basic parameters of a Mapper are:

LongWritable and Text
Text and IntWritable

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.