Machine Learning Notepad
Monday, March 5, 2018
Introduction of ML
Machine Learning : A computer program is said to learn from experience E with respect to some task T and some performance measure P. If its performance on T as measured by P improves with experience E.
ML Algorithms :
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
4. Recommender System
Support Vector machine (SVM) : allows computer to deal with infinite number of features
Regression : Mean that our goal is to predict continuous value output
Classification : Where goal is to predict discrete value output.
Supervised Learning : Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. Y=f(X)
Unsupervised learning :Unsupervised learning is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
Sunday, November 13, 2016
Supervised and Unsupervised Learning
Supervised Learning
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into "regression" and "classification"problems. In a regression problem, we are trying to predict results within a continuousoutput, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
Example:
Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.
We could turn this example into a classification problem by instead making our output about whether the house "sells for more or less than the asking price." Here we are classifying the houses based on price into two discrete categories.
Unsupervised Learning
Unsupervised learning, on the other hand, allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
We can derive this structure by clustering the data based on relationships among the variables in the data.
With unsupervised learning there is no feedback based on the prediction results, i.e., there is no teacher to correct you. It’s not just about clustering. For example, associative memory is unsupervised learning.
Example:
Clustering: Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number that are somehow similar or related by different variables, such as word frequency, sentence length, page count, and so on.
Associative: Suppose a doctor over years of experience forms associations in his mind between patient characteristics and illnesses that they have. If a new patient shows up then based on this patient’s characteristics such as symptoms, family medical history, physical attributes, mental outlook, etc the doctor associates possible illness or illnesses based on what the doctor has seen before with similar patients. This is not the same as rule based reasoning as in expert systems. In this case we would like to estimate a mapping function from patient characteristics into illnesses.
Tuesday, November 8, 2016
Hive DML ( Insert /Update/Delete)
How to check Hive Version :
Command : ps -aux | grep -i "Hive"
and the find text "/lib/hive/lib/hive-service_VERSION_NUMBER" in the above command output.
output : /opt/cloudera/parcels/CDH-5.5.5-1.cdh5.5.5.p0.3/lib/hive/lib/hive-service-1.1.0-cdh5.5.5.jar
Here hive version is 1.1.0
If your Hive version is 0.14 or greater then you can execute dml operation on hive
tables.
However, below are the
limitations :
- 1 Table that you work with must be bucketed, declared as ORC format and has in it's table properties 'transactional'='true' (hive support ACID operations only for ORC format and transactional tables). An example of a proper table is like this:
Below sql for your handy.
1. create table testTableNew(id int ,name string ) clustered by (id)
into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');
2. insert into table testTableNew values (1,'row1'),(2,'row2'),(3,'row3');
3. update testTableNew set name = 'updateRow2' where id = 2;
4. delete from testTableNew where id = 1;
5. select * from testTableNew ;
Monday, August 29, 2016
Kafka Sample Program
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10 </artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.4.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-twitter_2.10</artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>scalalogging-slf4j_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>hbc-core</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hive.hcatalog</groupId>
<artifactId>hive-hcatalog-streaming</artifactId>
<version>0.13.0</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>0.9.1-incubating</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.9.0.0</version>
</dependency>
</dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10 </artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.4.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-twitter_2.10</artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>scalalogging-slf4j_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>hbc-core</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hive.hcatalog</groupId>
<artifactId>hive-hcatalog-streaming</artifactId>
<version>0.13.0</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>0.9.1-incubating</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.9.0.0</version>
</dependency>
</dependencies>
Friday, June 3, 2016
UBER Mode in 2.X
UBER mode in 2.X :
Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer. Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).
Uber jobs :
Uber jobs are jobs that are executed within the MapReduce ApplicationMaster rather than communicate with RM to create the mapper and reducer containers. The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.
Why is it so :
If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends mapper and reducers phase.
Can I configure/have a Uber for all MapReduce job.?
As of now, map-only jobs and jobs with one reducer are supported.
In other word :
Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs in the mapred-site.xml. Configuration options for Uber Jobs:
mapreduce.job.ubertask.enable
mapreduce.job.ubertask.maxmaps
mapreduce.job.ubertask.maxreduces
mapreduce.job.ubertask.maxbytes
Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer. Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).
Uber jobs :
Uber jobs are jobs that are executed within the MapReduce ApplicationMaster rather than communicate with RM to create the mapper and reducer containers. The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.
Why is it so :
If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends mapper and reducers phase.
Can I configure/have a Uber for all MapReduce job.?
As of now, map-only jobs and jobs with one reducer are supported.
In other word :
Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs in the mapred-site.xml. Configuration options for Uber Jobs:
mapreduce.job.ubertask.enable
mapreduce.job.ubertask.maxmaps
mapreduce.job.ubertask.maxreduces
mapreduce.job.ubertask.maxbytes
Thursday, June 2, 2016
Map Reduce Phases
Map -> Combiner -> Partitioner -> Sort -> Shuffle -> Sort -> Reduce
Map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one or multiple output key/value pairs for each input key/value pairs.
Combine phase is done by Combiners. Combiner should combine key/value pairs with the same key together. Each combiner may run zero, once or multiple times.
Shuffle and Sort phase is done by framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. Programmer may supply custom compare function for sorting and Partitioner for data split.
Partitioner decides which Reducer will get a particular key value pair.
Reducer obtains sorted key/[values list] pairs sorted by the key. Value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.
Below diagram is copied from Hadoop book.
Subscribe to:
Posts (Atom)