These are my notes from the course given by UC San Diego at Coursera online course platform. Actually, I already know these concepts and worked on these a lot while I was working with a Hadoop cluster at Vodafone Turkey. But being refreshed with the training is always a good practice. Also, please consider that there is too many important information in the course, but these are the ones that I needed to write again.
Sqoop: Used for migrating the relational database to HDFS. It has a special command which connects to MySql database and initiates running MapReduce jobs for migrating RDMS data into HDFS. By defining parameters as avrofile and warehouse path, the data will be ready to analyze over Hive or Impala queries. But, we need to put automatically created schema files to the HDFS before running the queries and then create tables using those schemas.
Hive and Impala are both SQL like scripting languages that are used to query the HDFS data. Even though they use the same metadata files, the difference is that Hive is executing queries with MapReduce jobs; on the other hand, Impala directly performs data analysis on the HDFS files. As a result, Impala is faster in query execution than Hive.
Beeline enables to create a JDBC connection to Hive tables on the terminal (shell).
With Hadoop 2, there are multiple name nodes rather than a single node as it was in the first Hadoop. It increased namespace scalability. Each name node has its own block pools. Moreover, it brings High Availability feature for the Name Nodes and the Resource Manager (to overcome single point of failure). Also, HDFS can use additional storage types such as SSD and RAM_DISK.
Hadoop 1: MasterNode (JobTracker, NameNode), Compute-Datanodes (TaskTracker)
Hadoop 2: With YARN, job scheduling and resource management are separated. Now there is a Global Resource Manager. In each node, there is a Node Manager. And for each application, there is an Application Master. For each job submitted by the client, an Application Master is assigned in a Data Node, and that Application Master allocates containers from its own data node or from other data nodes. The containers are communicating with Application Master, and Application Master is communicating with Resource Manager. It reduces the workload of Resource Manager.
For the tasks that cannot be executed or can be executed but with high a cost (lots of mappers and reducers) with classical MapReduce approach, there are special engines, namely TEZ and SPARK. TEZ engine decreases overall mappers and reducers and enables faster processing. It also supports Directed Acycled Graphs (DAG). On the other hand, Spark enables advanced DAGs, and as well as Cyclic data flows. Spark jobs can be created with Java, Scala, Python and R. The most important benefit of Spark is that it enables in-memory computing, which increases the speed of iterative algorithms such as Machine Learning algorithms.
Each mapper is assigned for each block (default size 64MB). Information of each block is stored in the memory of Name Node. So, if we decrease block size from 64MB to 32MB, the memory need will be 2x. Also, it causes to inefficient I/O usage. In a writing process, the replication is done in a rack-aware manner.
Heartbeats are sent from data notes to name node.
Checksums are stored in HDFS namespace. The checksum is used when a reading request arrives at the data node. In that case, if there is a problem in checksum size, then the read operation is realized on a different replica of that target data.
Hadoop distcp command allows parallel transfer of files.
It is possible to mount of HDFS to local client (NFS). Thus, it enables upload/download files from HDFS, and also stream data to HDFS.
Flume for collecting the streaming data and move into HDFS, Sqoop for SQL to HDFS migration.
hdfs fsck hdfs_file_path_name command collects all information for a given file through name node. It gives information about how many block and replica the file has in total, and also how healthy are the blocks.
hdfs dfs -admin command collects all necessary information for that HDFS.
Spark enables resiliency by tracking all history of partitions. So, when a problem occurred, it finds the last successfully executed step, and it retries the following procedure.
glom command enables gathering data with its structure existing on its own partition.
as a different attribute from mapreduce, the data operations are done on the objects called “partitions” in Spark. Each transformation of partition is still kept on the node where partition and data exists.
coalesce enables reducing number of partitions. it is used generally after applying the filter operation. But it is working locally. When the partitions are located on different worker nodes, coalesce can reduce the partitions for each worker node. On the other hand, the command “repartition” is working node-independently and it makes the same thing by executing overall cluster. When data is very distributed, we need to use repartition command, with the target number of partition count, which means that it can also count of partition.
shuffle provides performance increase by redistributing data globally.
reduceByKey is the efficient way of grouping and summing operations consecutively. Avoid from using groupByKey due to the fact that groupByKey collects all data related to that key to a single node, and then performs sum. But reduceByKey initially sums up the values of a key on the worker node and then transfers the intermediate results to the final unique worker node, which increases the execution time.
when we create Directed Acyclic Graph (DAG) in Spark, the nodes are RDDs and the edges are transformations.
Spark automatically determines which parts of DAG could be executed in parallel.
action is the last step of DAG which retrieves results of DAG. example action commands: take, collect, reduce, saveAsTextFile.
caching enables reusable RDDs. It is generally used for iterative machine learning algorithms.
Broadcast variable enables global variables that can be used by all partitions. For a partition, if there is a variable declaration, then in each run, the variable will be re-created. The better way is to define that variable as broadcast variable. In the background, the broadcast variable is transferred to the Executor for access of all partitions accross all worker nodes.