Turkey Elections 2018

Turkey is a very special country, and now it is time to vote for the next President. In the last 16 years, the country has been managed by a single person, who is the leader of a specific community in Turkey, the people who are not educated, or who want to grow his business by being near that President. In every election, there have been many illegalities, but according to the announced results, despite all scandals, that President achieved to sustain his leadership. That single man did everything he wants in the country for the last 16 years. In result, all of the industries, manufacturers, agriculture, technology firms, startups, academicians, teachers, I mean all of the people who involve to the economy have failed their businesses. Today, the country is importing everything from other countries, from straw to potatoes, onions. When I talk with the Italian people here, they could not believe to what they hear, because they know Turkey with its unique beautiful habitat.

Today, it is the day of new elections and now there is a new candidate who gained the support and hopes of millions. He is a former Physics teacher, who is very determined to abolish the dictatorship of last 16 years. I am not the person who calls current president him as a dictator; there are 1.2 million entries only in Google when you write his name and the word “dictator” together (link Google search). We strongly believe to the former Physics teacher that he will save Turkey. In his latest rally in Istanbul, there were more than 6 millions participants. Today is a very big day for the prosperity of Turkey. We will see the results together, and we hope for a bright future for the next generations of Turkey.

And this is a summary of the current situation in Turkey.

https://www.express.co.uk/news/world/978594/Turkey-elections-2018-erdogan-muharrem-ince-president-latest-polls-results

 

Notes: Hadoop Platform and Application Framework

These are my notes from the course given by UC San Diego at Coursera online course platform. Actually, I already know these concepts and worked on these a lot while I was working with a Hadoop cluster at Vodafone Turkey. But being refreshed with the training is always a good practice. Also, please consider that there is too many important information in the course, but these are the ones that I needed to write again.

Coursera link

Sqoop: Used for migrating the relational database to HDFS. It has a special command which connects to MySql database and initiates running MapReduce jobs for migrating RDMS data into HDFS. By defining parameters as avrofile and warehouse path, the data will be ready to analyze over Hive or Impala queries. But, we need to put automatically created schema files to the HDFS before running the queries and then create tables using those schemas.

Hive and Impala are both SQL like scripting languages that are used to query the HDFS data. Even though they use the same metadata files, the difference is that Hive is executing queries with MapReduce jobs; on the other hand, Impala directly performs data analysis on the HDFS files. As a result, Impala is faster in query execution than Hive.

Beeline enables to create a JDBC connection to Hive tables on the terminal (shell).

With Hadoop 2, there are multiple name nodes rather than a single node as it was in the first Hadoop. It increased namespace scalability. Each name node has its own block pools. Moreover, it brings High Availability feature for the Name Nodes and the Resource Manager (to overcome single point of failure). Also, HDFS can use additional storage types such as SSD and RAM_DISK.

Hadoop 1: MasterNode (JobTracker, NameNode), Compute-Datanodes (TaskTracker)

Hadoop 2: With YARN, job scheduling and resource management are separated. Now there is a Global Resource Manager. In each node, there is a Node Manager. And for each application, there is an Application Master. For each job submitted by the client, an Application Master is assigned in a Data Node, and that Application Master allocates containers from its own data node or from other data nodes. The containers are communicating with Application Master, and Application Master is communicating with Resource Manager. It reduces the workload of Resource Manager.

For the tasks that cannot be executed or can be executed but with high a cost (lots of mappers and reducers) with classical MapReduce approach, there are special engines, namely TEZ and SPARK. TEZ engine decreases overall mappers and reducers and enables faster processing. It also supports Directed Acycled Graphs (DAG). On the other hand, Spark enables advanced DAGs, and as well as Cyclic data flows. Spark jobs can be created with Java, Scala, Python and R. The most important benefit of Spark is that it enables in-memory computing, which increases the speed of iterative algorithms such as Machine Learning algorithms.

Each mapper is assigned for each block (default size 64MB). Information of each block is stored in the memory of Name Node. So, if we decrease block size from 64MB to 32MB, the memory need will be 2x. Also, it causes to inefficient I/O usage. In a writing process, the replication is done in a rack-aware manner.

Heartbeats are sent from data notes to name node.

Checksums are stored in HDFS namespace. The checksum is used when a reading request arrives at the data node. In that case, if there is a problem in checksum size, then the read operation is realized on a different replica of that target data.

Hadoop distcp command allows parallel transfer of files.

It is possible to mount of HDFS to local client (NFS). Thus, it enables upload/download files from HDFS, and also stream data to HDFS.

Flume for collecting the streaming data and move into HDFS, Sqoop for SQL to HDFS migration.

hdfs fsck hdfs_file_path_name command collects all information for a given file through name node. It gives information about how many block and replica the file has in total, and also how healthy are the blocks.

hdfs dfs -admin command collects all necessary information for that HDFS.

Spark enables resiliency by tracking all history of partitions. So, when a problem occurred, it finds the last successfully executed step, and it retries the following procedure.

glom command enables gathering data with its structure existing on its own partition.

as a different attribute from mapreduce, the data operations are done on the objects called “partitions” in Spark. Each transformation of partition is still kept on the node where partition and data exists.

coalesce enables reducing number of partitions. it is used generally after applying the filter operation. But it is working locally. When the partitions are located on different worker nodes, coalesce can reduce the partitions for each worker node. On the other hand, the command “repartition” is working node-independently and it makes the same thing by executing overall cluster. When data is very distributed, we need to use repartition command, with the target number of partition count, which means that it can also count of partition.

shuffle provides performance increase by redistributing data globally.

reduceByKey is the efficient way of grouping and summing operations consecutively. Avoid from using groupByKey due to the fact that groupByKey collects all data related to that key to a single node, and then performs sum. But reduceByKey initially sums up the values of a key on the worker node and then transfers the intermediate results to the final unique worker node, which increases the execution time.

when we create Directed Acyclic Graph (DAG) in Spark, the nodes are RDDs and the edges are transformations.

Spark automatically determines which parts of DAG could be executed in parallel.

action is the last step of DAG which retrieves results of DAG. example action commands: take, collect, reduce, saveAsTextFile.

caching enables reusable RDDs. It is generally used for iterative machine learning algorithms.

Broadcast variable enables global variables that can be used by all partitions. For a partition, if there is a variable declaration, then in each run, the variable will be re-created. The better way is to define that variable as broadcast variable. In the background, the broadcast variable is transferred to the Executor for access of all partitions accross all worker nodes.

Coursera – Deep Learning

Online courses are very important tools for improving our Data Science skills. 2 years ago, I followed Machine Learning course of Andrew Ng and I learned a lot from Prof. Ng’s clear teaching style. Although there was a Neural networks chapter in that course, it was Octave programming language based. Now, I have started to follow Deep Learning course of Prof. Ng, which is shorter than the previous one. Actually, it takes only 4 weeks to complete the course. And one of its advantages is that it is Python based, which is the most popular Data Science programming language today. The instructors have prepared Jupyter notebooks that are running on the website of Coursera, which simplifies the programming part without requiring to install packages to the local computer, etc. Now, I completed Week 3, and I strongly to recommend anyone who is interested in Data Science to register for this valuable online course.

 

Privacy-first onsite Data Analysis for Facebook apps

I was thinking in this morning, especially after the last Cambridge Analytica scandal on Facebook, that there should be a new kind of privacy-first data analysis process in Facebook without sharing the data with external companies. In the current system, the flow is: the user accepts permission request of the application, and the app owner is collecting the data on its own platform and doing analysis on it, selling it to another firm, etc… Instead, the data analysis task should be executed on the control of Facebook. In the new system, when the app wants permission, the facebook will alert: this app wants to do an analysis of your data, we will never share the data with him, the analysis that the app will do will be executed on my platform, I reviewed and controlled their codes, (like the Apple Store code review process ), and we’ll share the result of the analysis with both you and the app owner ( such as you are supporting 80% conservative party), and the app owner will also tell you how and for what purpose it will use this result.

Comparison of two face recognition software: Clarifai and Face++

Recently, I tried several products to extract demographic information from a profile image. My target was to obtain information about age, gender, and ethnicity. I found the prominent companies in the sector are Clarifai and Face++. I integrated my trial software with both products and I found Clarifai’s accuracy better than Face++. My reasons are:

  1. Clarifai provides the probability value of its predictions. (predicted gender is female with a probability %52) So, it is possible to eliminate the results having low prediction score. On the contrast, Face++ does not provide that value. This is an unwanted situation because, in binary classification technique, the prediction always has a result, even its score is not very high.
  2. Clarifai correctly predicted the ethnicity of the image below as “White”, while Face++ wrongly predicted it as “Black”. But on the other hand, Clarifai could not found the gender value correctly (female %51, male %49) while Face++ correctly marked it as male (we don’t know its probability).
  3. The disadvantage of Clarifai is its low quota for free usages. It permits only 2500 API calls per month for free accounts. But Face++ does not specify any upper limit for free accounts. It has only one single limitation, which is one single API call per second.

I hope my hands-on experience with these services will help you choose the right product.

 

 

Result of Clarifai: (https://clarifai.com/demo)

Gender: feminine (prob. score: 0.510), masculine(prob. score: 0.490)
Age: 55 (prob. score: 0.356)
Ethnicity (Multicultural appearance):  White: (prob. score: 0.981)

Result of Face++: (https://www.faceplusplus.com/attributes/#demo)

Gender: male
Age: 53
Ethnicity (Multicultural appearance): Black

Converting texts to high-res images

A very inspiring research is made at the end of 2016. With the help of deep learning, now it is possible to generate images from given texts.

Here is the link to the news and here is the link to that research paper.

Could you imagine some use cases based on this technology? I found an interesting use case.. Imagine you are in a police station, about a robbery occurred in a bank… The thief could not be found and you explain the visual profile of thief as you are the unique eyewitness of this event. At that time, a computer automatically generates the image of thief based on the visual details you describe… At the same time, the computer increases the precision of that visual by matching it with other records of past robbery events.

The Future of Education

Within the last month, the future of education was one of the main topics in Davos. There were very interesting debates, and in of them, Jack Ma (the founder of Alibaba) told that it is strongly and urgently needed to change the current education system due to the rising impact of robots. Since robots are able to obtain the knowledge, by learning from their past experiences, they will do most of the things people do today. In order to adapt ourselves to the modern world, we need to educate our children in a way that cannot be copied by robots. Rather than teaching mathematics or physics to our children, we should support their more humanistic skills such as music and art.

I agree with Jack Ma’s ideas and I think we need to think more about people’s main advantages and disadvantages over robots in the next 20 years. Today, our children start learning to code in primary school, in order to communicate better with the robots and understand their logic. But when the world will be dominated by robot activities, all the things will be changed and humans should be in a place where robots do not see them as a threat.

 

 

Mongo as Document based No-SQL Database

I started to the MongoDB developer course given online by MongoDB University. I have worked a lot with Mongo at Vodafone but I was using only 10-20% of its key features. Now at Politecnico, the things are more complex so I need to pay more attention to the performance issues. In my research project, I use MongoDB to store Tweets and perform text analysis over the records.

I currently completed the Week-1 course. I hope I will learn more in the upcoming weeks.

In which apps do I have premium membership?

Busuu– Practice in multiple languages, latest design is cool

Pacer – Good idea, weak implementation, bot accounts

Spotify– No ads, high sound quality, great playlists

Netflix- I think there is no need to explain what this app offers to people. It is quite good in movies/series streaming

Amazon Prime – No cargo fees, fast delivery, special discounts

Grammarly – Life companion in typing (I am not premium yet)

Understanding Feelings and Behaviours of English People about Brexit Referendum

In these days, my research motivation is to find some insights by analyzing Twitter data to understand how English people react to Brexit referendum. There are various researches already made about this topic, and most of them are done by universities in England such as Imperial College London and the University of Bristol. I found it as a quite interesting research topic since social media is an important environment to present our ideas to the community and there is a need for more research to understand people’s opinions. I will give more detailed information about my study in the upcoming weeks. If you have any recommendation for me, please feel free to send me an email.