If you are familiar with Neural Networks than Deep Learning is nothing but big fat Neural Networks with lots of Non-Linear layers in it. If you are not, then I’ll explain what these things are.

Lets go back few years back, say 50 years. The ultimate goal of AI is to create a Machine which can replace a human in the sense that it can talk like a human, ‘listen’ like a human, ‘perceive’ like a human and reason like human. Scientists have tried using Rule based engines to do that. You ask machine a question, it consults the ‘Rule book’ and gives answer according to it. But this is not a good solution. What if the question asked is out of the ‘book’? Scientists are puzzled by this problem. So they ask a question? Why are we looking for new novel ways to design an Artificial Intelligence system, when we already have such a system present? Our brain. It is perfect machine to study and try to emulate to achieve an AI system.

They started studying the architecture of Brain and they discovered it is a big “Neural Network” with Billions and Billions of Neurons connected to each other and interacting in complex fashion to make us ‘intelligent’. They start simulating the same architecture in hardware. They name it ‘Perceptron Model’. It was giving ‘Good’ empirical results on some tasks and they declare that they have found a perfect solution to AI problem.

Mathematically a perceptron is ‘Linear’ system which separates points belonging to two ‘class’ with a straight line. For example in a face recognition problem the two classes are ‘Face’, ‘Non-Face’. And points are images.

Back to the past, Scientists are happy for creating a perfect AI solution. But soon one scientist finds out that this is a flawed system and problems with two classes cannot be separated by a line cannot be solved by this system. So scientists decide to solve this problem by stacking multiple such units in layers to solve this problem. This is how Neural Network was created.

This is a typical Neural network with three layers and each layer transforms the input using some ‘non-linear’ function. To train the network we have to tune the weights each edge between two nodes carry to modify the mapping as the mapping depends on the incoming weights. I won’t go into mathematical details, they seem to be boring. I’ll rather give an intuition on how it works.

In my previous post I mentioned that to fit a model on some points we assume some parametric form like a line, circle etc and then using data we find the parameters of that model. A big flaw in this model is that we are fixing underlying structure which we will fit on the data and we are just tuning the parameters. This is not a good approximation of any function.

If you remember Fourier Series expansion of a function, a function can be approximated by infinite sum of ‘sine’ functions of different frequencies. So a good approximation of function will be when you consider lot more terms in the basis. But taking infinite basis functions is not possible, so to work around that solution we can make the basis function itself adaptable. That means basis function itself depends on the data.

The ‘non-linear’ functions I mentioned are infact the basis functions used for ‘function approximation’ in Neural Networks. They are dependent on data. Through weights of edges incoming to them.

It was this time when Geoffrey Hinton developed Backpropagation algorithm for training Neural Networks. We have reached era of 80’s. After lot of hype suddenly scientists backed up from Neural Network due to their heavy computational and data need.

Now lets go back to mid of 20K-2010 decade. Thanks to internet we have loads of data available and due to advances in hardware we have very powerful machines too. Geoffrey Hinton showed at this time that Big Neural networks can be trained efficiently using lots of data and computation. Scientists started experimenting on this and it is then they came up with “Deep Learning”. As I explained earlier that Neural Networks are multiple perceptrons stacked together.

When you stack multiple layers like this instead of only one ‘hidden layer’ this is called Deep network. And the learning performed by such machine is called Deep learning. So you can see this is no fancy stuff. This is just plain Neural Network with big architecture.

Of course there are many complications associated with it. It is difficult to train such big networks even with high computation power. This is where all the engineering comes in and all popular architectures are some work around to this problem.

One more reason for popularity of Neural Network is that they can even learn ‘Feature Detection’ from the data. Best example of this is a Convolution Neural Network. It is a big breakthrough in the field of Computer Vision. Entire Computer Vision depends on ‘good’ features detectors from images and Computer scientists spend decades in finding good features detector which are essentially hand crafted. Deep Networks on the other hand can detect good features for the task automatically from the data. This is a big plus for Machine Learning researchers, they don’t have to design features by hand. They can just use a Neural Network for the same.

This is Deep learning in nutshell. Please comment your observations and/or any disparity you find in this article.

Cheers!

]]>

This post is an extension to the post I wrote last time. In the last post I talked a bit about linear algebra. In this post I’ll talk about Probability and it’s importance in Machine Learning. With the recent advancement of Statistical Machine Learning theory probability is the most powerful tool for analyzing Machine Learning models.

Now what is Probability? Technically Probability is a mathematical framework to deal with uncertainty. It is a way to quantify uncertainty in an event. In Machine Learning settings uncertainty is inherent. The reasons are obvious, real world data set has lots of noise in it, and the data generation process can also be biased. So we need some framework to analyse this uncertainty and then take decisions accordingly. I’ll explain how this is used in Machine Learning to quantify uncertainty. But before digging into Probability for machine learning I’ll press on the importance of one particular class of function used very frequently in Machine Learning, the Gaussian Distribution. For anyone studying Machine Learning this is the most common term they’ll encounter in their study. Gaussian Distribution is probably (with very high probability :D) the most important distribution in probability theory. It is very commonly used to model noise in the data. The reason is the Central Limit Theorem in Statistics. This theorem states that sum of any number of Random Variables will be a Gaussian Distribution (or at least will be close to it). Now noise is generally due to sum of many random events like human error, error in recording device etc. If we consider each of these events as a Random Variable then sum of all these random variable will be a Gaussian Distribution. This is the reason why Gaussian is so commonly used to model noise.

Coming back to probability, I’ll illustrate its role with an example. Consider a problem of predicting the employability of candidates based on their credential. Input to the model is the credential of the candidate ( grades, experience, relevant projects,referral, etc) and the target is a ordinal variable (each number comes in order, lowest number having least significance highest number with most significance).

This is a toy example. On x-axis is the input variable(consider it to be grades of employees, assume which for any weird reason varies sinusoidal) and on y-axis is the output(employability). Consider the thin blue line in the middle to be the ideal trend which is followed but due to some errors the Y values oscillate about that thin blue line and the readings are corrupted. Now to model this uncertainty probability is used, for any point X0 on X-axis we define a Gaussian distribution on Y variable conditioned on X to accommodate the variation in Y values.

Consider another problem, where we are given X-Ray of a patient and our goal is to predict whether he has a fracture or not. We can have training examples which covers lot of variations of possible X-rays for which some of them have fractures and some X-Rays don’t have fractures. Due to obvious reasons we cannot capture all the variations possible in an X-Ray so we train a Machine Learning model to take X-ray as input and give output the probability of having fracture given the input. This is the aim of Machine Learning, this is important to realize. Given “enough” training examples learn the pattern which they follow and then make predictions accordingly (the pattern here is what does a fractured X-ray looks like). Of course there are more technical details to learn in Probability which I’ll talk in next post. This was just to get a feel of the role of probability in Machine Learning.

Cheers!!

]]>

[Below is a guest post from Sanjeev Arora on his redesign of the traditional graduate algorithms course to be a better match for today’s students. –Boaz] For the last two years I have tried new ideas in teaching algorithms at the graduate level. The course is directed at first year CS grads,…]]>

*[Below is a guest post from Sanjeev Arora on his redesign of the traditional graduate algorithms course to be a better match for today’s students. –Boaz]*

For the last two years I have tried new ideas in teaching algorithms at the graduate level. The course is directed at first year CS grads, but is also taken by grads from related disciplines, and many advanced undergrads. (Links to course homepage, and single file with all course materials.)

The course may be interesting to you if, like me, you are rethinking the traditional choice of topics. The following were my thoughts behind the redesign:

- The environment for algorithms design and use has greatly changed since the 1980s. Problems tend to be less cleanly stated (as opposed to “bipartite matching” or “maximum flow”) and often involve high-dimensional and/or noisy inputs. Continuous optimization is increasingly important.
- As the last theory course my students (grad or…

View original post 256 more words

]]>

In this post I’ll introduce basic building blocks which are essential for mastering this subject and being able to apply Machine Learning to real world problems.

**Linear Algebra 101**

This is one among few basics which needs to be covered before getting into machine learning.

Most of you might have done this course in High School, so this will be a refresher course plus some advanced things needed for Machine Learning.

Linear algebra is branch of Mathematics which deals with a general coordinate system and interaction of planes in a generalized coordinate system (I’ll talk about what I mean by a generalized coordinate system) and apply operations on them.

I assume you all are from Mathematical background so you all might have studied Matrices at some point of your education, but have you ever thought what is actually a matrix? Think for a minute and see if you come up with an answer.

A matrix basically denotes a linear mapping between two spaces. What I mean by that is consider a matrix [sin(theta) -cos(theta); cos(theta) sin(theta)]. If you multiply any vector in a simple 2-D space with this matrix then this vector will be rotated by an angle theta (try it out). So this matrix is basically mapping any vector in a space to a vector in a space that is original space rotated by theta angle.

**Vector **

You might all have studied vectors in your high school maths course. You might think why I am writing about a concept that used to seem so abstract in high school. So for your information this is most common term you’ll hear in Machine Learning course or in many CS courses.

So vector is nothing but a collection of numbers. If you remember arrays from data structures course, that represents a vector in a computer. A vector has components, in array the ith component is the ith element in the array.

**Basis vectors**

This is another very common term you’ll see in machine learning literature. Basis vectors are nothing but a set of vectors which corresponds to the axis of the input space you are talking about. So another term here, space. Space is simply set of all possible combinations of numbers. For example input space of 2 dimensions is simply all combination of (x,y) where x and y can be any number from their domains.

This image explains it more intuitively.

**Linear independence**

A set of vectors are called linearly independent when none can be written as function of other vector.

This sums up linear independence and dependence concept.

**Norm of a vector:**

This is just a fancy word to represent length of a vector. Now length can be represented as simple sum of individual components of a vector (absolute value). This is called L1 norm. If we use euclidean distance to find length of vector it is called L2 norm vector.

**Eigenvector:**

This is bit difficult to explain. If you google it, you’ll find only methods to how to calculate it. But not what they are actually. Why they are calculated. I’ll talk about them in detail when I’ll discuss Dimensional reduction techniques.

These are very basics terms used in ML. I’ve kept it short for this post, I’ll give more details in the next post.

If you want to learn on these topics in detail I’ll recommend these :

https://www.khanacademy.org/math/linear-algebra

http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/

Till next time

Cheers !

]]>

But I believe that is not a correct way to learn Machine Learning. All successful Data Scientist’s have a PhD in this field. They have solid foundation in the subject, that’s why they are so successful in this field. So for becoming a successful Data Scientist you need to have solid understanding of Linear Algebra, Probability theory, Statistics, Computer Science (Relational Database, One Programming Language, Standard textbook algorithms, Algorithm Complexity Theory), and Optimization methods.

It’s OK if you are not an expert in these area, but basic knowledge of these subjects is essential to be able to apply Data Science to some real world problem.

Note that I’ll be using Machine Learning and Data Science interchangeably as for now you can assume both of them to be same. I’ll explain the detailed difference between these two at the right time.

So as I mentioned that every week I’ll discuss something from Data Science and ML, this week I’ll cover basics which are pre-requisites for Machine Learning.

Till next time

Cheers!

]]>

This is my first post in the series of posts I’ll be writing on Machine Learning, Big data and stuff. I’ll start by giving introduction to these fields and some terms which have been popping a lot these days.

You might be wondering why you should read a new blog when there are millions of blog about the same subject. I’ll tell you why. When I first started to learn this subject I initially thought I’ll read some blogs as you can understand things easily from a blog because the language in a blog is quite informal plus it has author’s real world experience while dealing with the subject. But when I searched on net for blogs on Machine learning I could not find a single place where I can find all those things. So I thought I’ll start a new blog where beginners in this field can find comprehensive list of algorithms in the subject explained in layman’s term, my experiences while dealing with this subject plus some codes.

But first I’ll introduce myself. I am from a small town in Rajasthan called Jhunjhunu(funny name, I know). Most people might have never heard it’s name. For those people, it is located near another small town called Pilani (apparently it comes under district Jhunjhunu but it is more famous as being home to famous school BITS Pilani). I studied from BIT Mesra and then went on to pursue master’s at BITS Pilani. At BITS I got exposed to machine learning first time formally. Although I studied Machine Learning in my Bachelor’s but I never learned it in detail. When I joined BITS I was teaching assistant for Machine Learning course, so I had to study thoroughly before explaining it to the students, so I started studying it from basics .I learned basic techniques which are commonly used in machine learning then to understand the practical applications of it I learned under the hood technology behind Google’s search, email classification and some other things which we use frequently in our daily lives but somehow it all looks like magic.I have also worked in Advanced Data Analytics and Parallel technologies lab at BITS where I learned how to apply machine learning to big data. Now unfortunately I have to discontinue master’s at BITS as courses there were not aligning with my interest and I have decide to join IIIT Hyderabad which is a good school to learn these technologies.I am hoping to explore Machine Learning and Big Data in depth there.

To help budding data scientists and enthusiasts for this field this is series of posts where I’ll write about one algorithm or one technology every week with some real world applications and some code in it too.

Till next time

Cheers!

]]>

**BIG DATA**

Internet is a wealth of data and information. With the advent of social media, cloud there is an overflow of data everywhere. It is believed that in every two days we generate data which is equivalent to data generated since beginning of time(computers here) to 2003. It has been learned that over 90% of the data has been generated in the last two years. Facebook generates Petabytes of data every single day. Phew, that’s a lot of data.

But why am I telling you these facts ?, you might have already seen these figures these days. The reason is that there is abundance of data but there is only a fraction of usable knowledge from this data. To make use of this data we need computer algorithms which can mine this haystack to find a needle which we can use to transform the data into some actionable knowledge.

Big data term was initially coined to denote the amount the data which cannot be handled by a single computer. Let me explain this. A single computer has fixed storage size, fixed memory size and fixed processing power. Assuming a standard 1 GHZ processor with 1 GB memory, it would take 1 second for this computer to perform 10^8 operations. Ok that’s lot of computer jargon. Imagine a computer sitting in a postoffice letter sorting room. Computer’s job is to sort the letters based on their address, so that letter with same address can be grouped together. In computer science this is a famous problem of sorting a bunch of numbers in ascending or descending order.

Best algorithms which sort a list of numbers takes time which is approximately proportional to the size of list ( for computer people this means O(NlogN) time algorithm, droping logN for sake of simplicity without loss of generalization). Assuming each operation of computer takes equal and unit time, for a total of 100 Million letters it would take computer total of 10^9 steps to sort the mails. We assumed it would take 1 sec to run 10^8 operations, so here it would take 10 seconds. Which is okay, no big deal, right ?. Now one day post office sees 100 billion letters to be sorted (they got lazy and stacked up letters for past month or it’s the apoclypse (end of internet) and everyone is sending letters). Now 100 billion is 10^14. Now it would take computer 10^6 second to sort these numbers. That is quite long time. You get it right ? As the data grows computer keeps getting slower to produce the output. These are infact small numbers, in machine learning algorithms numbers go to 10^15, 10^20, so it would take forever for a computer to produce results.

So computer scientists gave the word big data to amount of data which a single computer/algorithm cannot handle. For computer algorithms dealing with data of sorts like videos, images, texts from internet one computer is not enough, so these data are Big data.

So big data is not a technology or a buzzword. These days big data is often characterized by 3 V’s : Volume, Variability and Velocity. I’ll leave the meaning of these words for you to figure out ;).

**MACHINE LEARNING**

Ever wondered how does Gmail automatically identifies whether a mail is spam or it is useful. How does your Digital Camera recognizes faces in an image. How does Facebook suggests Tags for a person in an image. Is it magic ? what is the mystery ? Yes, you are right my friend it’s Machine Learning. Machine Learning has a long history. It is tied with the invention of Artificial intelligence, whose origin dates back to start of computer science. It is now that humans have created powerful enough computers that it is starting to be used frequently. The question is What is Machine Learning ?

I think the term is self-explanatory. Machines learning something from the data, it’s machine learning.

You might have seen this definition( or some variant of this ) on the net. But one natural question is, how can a machine learn something ? It has no brain of it’s own. Then how the heck does a machine learns. Surprisingly the answer is quite simple (at least in abstract level), machines learn the same way humans learn. Many of the computer systems like compilers, have been designed keeping in mind some similar process in human body. Compilers work in same way humans interpret a language. Parsing the grammer, understanding the context then finally the meaning. Machine learning also in some sense works like humans learn some new thing or skill.

For example when a person learns driving, he has no idea how much gas should be used for particular speed, when to press the breaks, when to switch gears. He learns by example. Some teacher tells him when to change gear, when to press the gas, how much to press the gas, when to press the breaks. And of course some things he learns on his own, by executing these actions and observing the results.

Machines also learn the same way. The are presented with examples, like if you want to teach a machine to drive a car we’ll provide with examples of some combination of gas, break and gear and also the result of applying these combination. Machine will train on these examples and when training is done it is an expert in that subject and now can work in real world and can drive a car on it’s own.

Machine learning is often divided into three main classes:

**Supervised learning: **

In this class machine is presented with data which has features and the result along with it. (example like above explanation).

It will learn from this data, which combination of features produces the result and will apply it in real world.

**Unsupervised learning:**

* T*his is an interesting class of learning. In this setting machine is only present with the data and no label or result is provided to it. In the previous example if the setting is this then machine will be given no previous training. A computer will be directly provided a car and it has to figure out how to drive it. It is a challenging class of machine learning problem, as this requires some smart work.

**Reinforcement learning**

This class of learning is mostly used to build bots. Bots which work in some environment, which have some goal to fulfill and they learn from the environment what is the most optimal way to reach goal. In the previous setting machine will be provided with car and it will be allowed to train by itself how to learn the car. Machine will try out combinations of gas, break and by penalties ( machine hitting something) or by reward (like successfully crossing a street) it will learn how to drive a car.

Now Obviously this is simple explaination of machine learning and the details behind these are quite scary (don’w worry I’ll explain the details from next post).

What picture can better represent the use of machine learning then this. Page and Brin, two guys who literally started the revolution of machine learning in real life and Google’s self driving car, a pinnacle in machine learning itself.

**Data Mining**

If you search on Google Data mining vs Machine Learning you’ll find a different answer in each link. I thought I’ll go into less technical detail regarding this debate (anyways in coming posts I’ll discuss this issue in depth). Data mining is mining the data to find some useful insight into it. It uses techniques from machine learning and statistics to do so.

Some people argue that data mining is simply application of unsupervised learning. Some say it is Machine learning with some application of statistics. I believe it is mostly application of unsupervised learning.

Pattern recognition in large dataset, for instance mining web data. A good example is ranking web pages according to their relevance is mining huge dataset (entire web), this is data mining. It can be seen as practical application of machine learning on large real world dataset.

Definition of data mining is subjective . As you’ll learn more about machine learning and data mining you’ll form your own definition.

**Data Science**

If put in simple words data science is simply intersection Machine learning, data mining, statistics, computer science.

This is the famous venn diagram which explains it. A data scientist is someone who is expert in machine learning, computer science and has domain knowledge of the field where he is applying data science. He is also an hacker. Hacking here means problem solving skills. Don’t fool it with the hackers which are shown in fancy sci-fi movies.

Simply put data scientist is someone who is good with programming, has good knowledge of both descriptive and inferential statistics, good knowledge of machine learning algorithms and has some knowledge of parallel and distributed computing (phew, I know, it takes a lot to be a data scientist!).

That’s a lot of introduction . Thanks for reading it. Please comment if you find any discrepancy in the writing or in facts.

From the next post every week or so, I’ll write about one algorithm related to machine learning, big data. You might be wondering why should you read my blog when there are hundreds of blogs related to data science. But I believe there is lot of information, and as a beginner it is difficult to find blogs which are comprehensive and describe data science in layman’s terms. I’ll try to write posts which explains all the algorithms and tricks you need to be a data scientist, I’ll also try to include code snippets so that you can see things in action.

Till next time

Cheers !

]]>