Hadoop Interview Questions
Dear readers, these are the Hadoop Interview Questions have been designed especially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Hadoop. Here I have listed in the Top 25 Hadoop Interview Questions.
1. What is Hadoop?
- It is an open source java based programming framework.
- Hadoop supports the processing storage of extremely large data sets in distributed computing environment.
2. What are the core methods of a Reducer?
Three core methods of a Reducer:
- Setup(): This method is used for configuring various parameters ( input data size, distributed cache, public void setup context).
- Reduce(): Heart of the reducer always called once per key with associated task public void reduce ( key, value, context)
- Cleanup(): It clean temporary files, only once at the end of the task.
3. What is Sequence File in Hadoop?
Three sequence file formats in:
- Uncompressed key/value records
- Block compressed key/value records- blocks separately and Compressed. in that size of the block is “configurable”
- Record compressed key /value records.”values“ are compressed here.
4. What is Job Tracker role in Hadoop?
- This process can be run with a separate node, not a DataNode.
- Job Tracker communicates with the NameNode identify the data location.
- Monitors individual task trackers submit with the overall job back to the client.
- Find out the best task tracker nodes to execute tasks the given nodes.
- Tracks execution of MapReduce workloads local to the solve node.
5. Explain the difference between NameNode, Checkpoint NameNode and BackupNode?
NameNode:
- NameNode is the centre piece of an HDFS file system
- It keeps with a directory tree of all files in the file System
- In that data is being stored.
BackupNode:
- It provides similar functionality with checkpoint and enforcing synchronization with NameNode.
- It maintains up – to – date memory copy of the file system namespace does not require getting hold changes after regular intervals.
- Backup Node needs to save the current state in memory to image file to create a new checkpoint.
Checkpoint NameNode:
- Directory structure as NameNode creates checkpoints for namespace at regular intervals by downloading the image and edits file and imagining them within the local directory.
- It is commonly known as secondary Node, it does not support the upload to NameNode functionality.
6. Compare Hadoop & Spark?
Criteria | Hadoop | Spark |
Speed of process | Average | Excellent |
Storage | HDFS | None |
Frame work | Java | Open source cluster |
Library | Separate tools available | Spark Core, SQL, Streaming, MLlib, Graphix |
7. What are real time industry applications of Hadoop?
Industry Applications Of Hadoop Used:
- Streaming processing
- Managing traffic on streets
- Content Management and Archiving Emails
- Analysing customer data in real-time for improving business performance
- Managing content, posts, images and videos on social media platforms
- Advertisements Targeting Platforms are using Hadoop to capture and analyse click stream, transaction, video and social media data
- Fraud detection and Prevention
8. What companies use Hadoop , any idea?
Yahoo, Facebook, Amazon, Adobe, e-Bay, Twitter.
9. What all modes Hadoop can be run in?
Hadoop can be run in three modes:
Standalone Mode:
Hadoop is used for a local file system input and output operations.
- Default mode is mainly used for debugging purpose.
- It does not support the use of HDFS
- Configuration required for core-site. xml, mapred-site. xml, hdfs-site. xml files.
Pseudo- Distributed Mode:
- Configuration for all the three files mentioned above.
- Pseudo all daemons are running on one node and thus, both Master and slave node are the same.
Fully Distributed Mode:
- Data is used and distributed across several nodes on a Hadoop cluster.
- Separate nodes are allotted as Master and Slave.
10. What is the best hardware configuration to run Hadoop?
The Best Hardware configuration:
Hadoop jobs is done by a dual core machines or dual processors with 4GB/8GB RAM it use ECC memory.
Highly benefits using from ECC memory though it is not low end.
- Memory is recommended for running Hadoop because most of the users have experienced in various checksum errors by using non ECC memory.
- This configuration depends on workflow requirements and can change accordingly.
11. Explain about the indexing process in HDFS?
An Indexing process in HDFS depends on the block size.
It stores last part of the data and further points to the address so it is the next part of data chunk is stored.
12. Explain about the different catalogue tables in HBse?
ROOT TABLE:
This table tracks the META TABLE
META TABLE:
This table stores details of the regions in the system.
13. Does Flume 100% Reliability to the data flow?
Yes, it provides end to end reliability to the transactional approach in data flow.
14. What are the most common Input Formats in Hadoop?
Three most Common Input formats:
- Key value Input format
- Text Input format
- Sequence file Input format
15. How many Input splits is made by a Hadoop Framework?
Hadoop Made By 5 Splits:
- One splits for 64 files
- Two splits for 65Mb files
- Two splits for 127MB Files
16. What is Sqoop in Hadoop?
- Sqoop is a tool used to transfer data between relational database management system (RDBMS) and Hadoop HDFS.
- It is used for import data from relational database such as MYSQL, Oracle to Hadoop, and export from hadoop file system to relational database.
17. What is Hadoop Streaming?
- Hadoop streaming is a generic API.
- It allows writing mappers and reduces in any language.
- Hadoop used streams as per UNIX standard between your application and Hadoop system.
18. What is the difference between RDBMS and Hadoop?
RDBMS | Hadoop |
RDBMS is used for OLTP Processing | Hadoop is used for analytical and for big data processing. |
It is a relational database management system | Node based flat structure |
The database cluster uses the same data files stored in shared storage. | The storage data can be stored independently in each processing node. |
19. What is Hadoop Map Reduce?
Software framework for processing of large data sets on compute clusters of commodity hardware.Data analysis uses a two-step map and reduce process.
20. What happens when a DataNode Fails?
- NameNode replicates the user data to another node
- Jobtracker and NameNode detect the failure
- Failed node all tasks are re scheduled
21. Mention Hadoop Core components?
- HDFS
- MapReduce
22. For using Hadoop list the network requirements?
- Password-less SSH connection
- Secure Shell (SSH) for launching server processes
23. Mention what is distributed cache in Hadoop?
- Distributed cache in Hadoop is a facility provided by MapReduce framework to cache files needed by applications.
- It is used to cache file. It is available on each and every data nodes.
24. Explain how can you debug Hadoop code?
DeBug HadoopCode :
- By using web interface provided by Hadoop framework
- By using Counters
25. What are the basic parameters of a Mapper?
The basic parameters of a Mapper are:
- LongWritable and Text
- Text and IntWritable