MapReduce vs Spark (!)

People most generally make a mistake by comparing MapReduce with Spark.

Actually, MapReduce is a programming paradigm, so we cannot compare MapReduce with Spark. But we can compare how Hadoop uses MapReduce and Spark uses MapReduce.

In Hadoop MapReduce, each job has one Map and one Reduce phase; but in Spark MapReduce, the Map and Reduce phases can be made together. Secondly, while in Hadoop MapReduce the output of jobs is written as a file, Spark writes them to the memory. As a result, it accelerates the overall execution time of the master job.

Big Data Analytics using Apache Pig

When you make some analysis on Hadoop, Apache Pig is one of the simplest ways to get and transform the data. Another alternative is Apache Hive, which seems more easy for people who already know SQL. Well, I used both, but writing scripts with Pig are better since you become able to see your data in each step of the codes. Moreover, it is more human-readable than SQL style code blocks (nested SQL, etc)

In the last two years, I wrote many Pig scripts. I would like to give some tips about Pig Scripting.

  • Use DEFINE functions to separate the file loading functions into a different Pig, which can be named as Loader.pig
  • When Pig does not provide the desired functionalities, write your own User Defined Functions with Java. For example, if you need to compare the object values, or if you want to use a sorting algorithm, then you may use your own Java codes and make them call from Pig script. This feature totally increases the flexibility of Apache Pig. When you enter the Java UDF world, then you can do everything with the collaboration of Java and Pig. Here, the main challenge is to track the objects called in UDF but you can develop yourself by making lots of trials.
  • Parameter Substitution is a prominent feature of Pig. With @declare annotations, it is possible to define custom variables. However, the dynamic value assignment is a challenge.
  • Before running in pig mode, complete your tests with the pig -x local mode with a small amount of data since it becomes inefficient to wait and see the script results in pig mode.