Here are some basic Interview Questions for Application/ Production Support Projects on Hadoop, which I asked during interviews of a leading MNC of India. If you are preparing for such support projects then I hope this will help you.
What is a Hadoop?
Ans: It is an open-source software framework for storing and running data and programs on commodity hardware clusters. It has numerous processing power and a substantial amount of storage for any data.
List all the files for the given hdfs path?
Ans: hdfs dfs –ls /path
How to display or check the content of a Hadoop file?
Ans: hdfs dfs -cat /Hadoop/run_log.txt
How to copy the Hadoop file to the local server?
Ans: hdfs dfs -copytoLocal /Hadoop/run_log.txt /tmp/<name of the directory>
How to update a Hadoop file?
Ans: First copy to Local server, then update it using VI editor and put it to Hadoop server. For example:
hdfs dfs –copytoLocal /Hadoop/run_log.txt /tmp/<name of the directory>
vi run_log.txt (update as per the requirement)
hdfs dfs –put –f run_log.txt /Hadoop/run_log.txt
How to delete a Hadoop file?
Ans: hdfs dfs –rm –r /Hadoop/run_log.txt
How to copy files from dir1 to dir2 in between different Hadoop servers?
Ans: hdfs dfs –cp /Hadoop/dir1/ /hadoop1/ dir2
What is the difference between “rm –r” and “rm –r –skipTrash” while deleting a Hadoop file?
Ans: rm –r will delete the file and send it to trash but skip trash will delete the file without sending it to trash.
How to create a zero size file in a given Hadoop path?
Ans: hdfs dfs –touchz /Hadoop
How to check the version of Hadoop?
Ans: Hadoop version
What are HDFS and YARN?
Ans: hdfs is a Hadoop distributed file system which is the storage unit of Hadoop responsible for storing kinds of data as blocks in a distributed environment. And, YARN is yet another resource negotiator is the processing framework in Hadoop which manages resources and provides an execution environment to the process.
What are the different types of daemons in the Hadoop cluster?
Ans:'Namenode', 'Datanode', 'Secondary namenode', 'resource manager', 'node manager', 'job history server'.
What is a block in Hadoop?
Ans: 'Hdfs' stores each data as block and distribute it across the Hadoop cluster. Default block size is 64MB.
What is map reduce?
Ans: It is a programming model that is used for processing large data sets over a cluster of computers using parallel programming.
Explain namenode in Hadoop?
Ans: It is a node where Hadoop stores all the file location information in HDFS. It keeps a record of all the files in the system and tracks the file data across the cluster or multiple machines.
Data components used by Hadoop?
Ans: Pig and Hive
Functionalities of a job tracker?
Ans: Here are the answers:
Accept job from the client.
Communicate with namenode to find the data.
Locate task tracker nodes with available slots.
Submit the work to the chosen task tracker node and monitor the progress of each task.
How to kill a yarn job?
Ans: yk (application id from Unix server)
What is the problem with small files in Hadoop?
Ans: HDFS lacks the ability to support the random reading of small files. If we are storing a small number of large files, HDFS cant handle it as many small files overload namenode since it stores the namespace of HDFS.
Difference between –rm, -rm r, -skipTrash?
Ans: rm will remove the file but directories cant be deleted by this command. –rm r recursively remove directories and files. -skipTrash is used to bypass the trash then it immediately deletes the source.
What 'getmerge' command does?
Ans: 'getmerge' is used for merging a list of files in a directory on the HDFS file system into a single local file on the local file system. Example: hadoop fs –getmerge /data/file
What is a 'tail' command in Hadoop?
Ans: This command is used to show the last 1KB of the file. Example: hadoop fs –tail /data/file
How to check space availability in HDFS?
Ans: hadoop fs –df –h /path
What are the most common input formats in Hadoop?
Ans: test input format, key-value input format and sequence file input format.
What is 'Oozie'?
Ans: 'Oozie' is a scheduler that helps to schedule jobs in Hadoop and bundles them as a single logical work. 'Oozie' job divided into two categories like 'Oozie workflow' and 'Oozie coordinator'.
What is a Hive in Hadoop?
Ans: Hive is designed to be used in Hadoop. It facilitates reading, writing and managing large datasets residing in distributed storage using SQL. We can use SQL Query to check data stored in various tables.
What is Kerberos?
Ans: Hadoop uses Kerberos as the basis for strong authentication to access distributed services. It works based on the ticketing system that allows the user/service/process to communicate its identity with others in a distributed system or a cluster.
What are the main parts of Kerberos?
Ans: Kerberos has 3 parts, like Client, Server and a Trusted 3rd party to mediate between them.
What is fsck?
Ans: It’s a hdfs command to check the health of the Hadoop file system. Example: hdfs fsck/
What does text command does in Hadoop?
Ans: It takes a source file and outputs the file in text format. Example: hdfs dfs –text /data/filename.
These questions will be helpful for you in the interviews as well as in the day to day activities related to your support projects. So, if you want a deep dive in Hadoop then I would suggest you to a couple of books. These books were really helpful for me and wish for you as well.
I may or may not be correct all time so, would request you all to suggest to me if you want any additional QnA to this article. So, please feel free to comment below.
Thanks for reading!