I will talk about a few machine learning concepts I learned in the past year from various online resources such as
this course from Coursera & Stanford. There are a zillion more of them on the web.
My foray to ML world happened last year (2015) during my internship at Juniper. You can find that work
here. Since my research is related to building secure platforms for big data, machine learning is not something I can decouple myself with.
So what are some of the concepts and tools I learnt/used?
Distribution Functions
I beleive that the first thing to know before we dig deep is INTEGRAL. In lay-man terms, we have formulas to find areas of simple shapes like circles, rectangles and triangles. But for arbitrary shapes, we use integrals to find the area. We divide a shape into k parts, find area of each individual part and sum the k areas. As the value of k increases our result gets more accurate i.e. k tends to become a continuous variable at this point.
Next, you need to know about PDF & CDF. Take a function f(x) where x is a discrete random variable and find its integral, f'(x). Now pick any interval from the graph for f'(x). That interval is the probability that random variable x will lie within that interval. A probability distribution can be specified using a Probability Density Function (PDF) or a Cumulative Distribution Function (CDF).
Binomial (or) Bernouli distribution - discrete variables, Gaussian (or) Normal distribution - continuous variable. There are so many of them I wont go into the details.
Bayes Theorem
I think I can call myself a bayesian fanboy. It is all about probability and inference. Depending on the probability of what you've seen so far in your dataset, can you calculate the probability of something happening in future? Big issue with Bayesian - assuming that features are independent (which isn't true always)
Unsupervised & Supervised Learning
What is needed to teach a machine to learn? DATA. Lots of it. This is called training data. This inherently means we need ways to manage/order this data. How do we do that? Clustering, Classification, Dimensionality Reduction. Btw, I don't know much about reinforcement learning (which is about continuous learning)
Clustering [unsupervised]
How can you efficiently create groups within your dataset where each group is similar in one or many ways? You don't do this always. You only do it once in a while. K-means is the most popular clustering algorithm. It is based on distance between points.
Classification [supervised]
When you already have a clustered dataset and then get new data, can you classify it to belong to a certain known group? Support Vector Machines is the most popular supervised learning algorithm. This algo gives global optimum solution and it can be used for non-linear classification. But it is complex, involves quadratic optimization. I personally like Naive Bayes. It is not non-linear classifier but it is simple and mostly works.
Regression [supervised]
Basically, it is a way to understand relations between variables, predict output by analyzing a given input dataset. Simple cases have one continuous variable input but complex cases use multiple continuous variables as you can imagine. But how to predict? For this we need to know how to represent the hypothesis.
- Linear Regression [Gaussian distribution]
Lets say we have a bunch of points on a XY plane. We want to draw a line, y=mx+c, that gives the trend in those points. Here, x is independent and y is dependent on x. Slope m is the regression coeff or the effect. You can think of c as the error. It can describe how good the fit is. Simple example to calculate error - least squares.
- Logistic Regression [Bernouli distribution]
This is similar to linear regression but the dependent variable is binary (yes/no). Most common example, is this e-mail spam or not. Convex optimization problem and hence using gradient discent can give global optima.
- Error
Hypothesis is what our learning algorithm comes up with. Then there is the actual value. The difference is the error. Like I mentioned before, least squares is a popular way of calculating error. As expected, we need to minimize this error. Gradient descent is one popular way to do this. It favors local minima so start point is very crucial. Also, for large datasets we use stochastic gradient descent.
- Other stuff
If prior probability P(x | y) follows a guassian or poisson distribution, the probability of posterior P(y | (1/x)) is logistic.
NLP & LDA [unsupervised]
Honestly, I used only text data for NLP. But to be fair, it was unstructured data which made it a mess to deal with. Had to prune data a lot.
- LDA
This is a popular NLP technique. Works for generic text understanding and topic modeling. Basically it is a clustering algorithm.
- Choquet Integral
I was at a conference recently where someone used this function to make their data more subjective.
- HMM
Hidden Markov Models are famous for speech and handwriting recognition. Basically, pattern recognition. If you take handwriting recognition, the user writes some word/alphabet which is the observed data. There is a hidden state associated to that observed data that can be calculated using some probability distribution over a set of known values. These hidden states follow a Markov chain. So the algo in HMM has 2 steps: (a) forward (b) backward. Forward step for calculating the transition probability from one hidden state to another. Backward step to calculate the emission probability of a hidden state from an observed value. Of course, there is also the initial probabilites for hidden states.
Nueral Networks [supervised]
A Nueral network is a bunch of logistic regression units put together to get non-linear decision boundaries. They use back propogation (gradient descent) for converging on answer. Deep Learning is technique of multiple transformations (linear & non-linear) that can be implemented on artificial and recurrent nueral networks. It is highly parallelizable and compute intensive, hence prefers GPU. Luckily we have the compute power now. Though there is the fear of converging to local minima, DL gets better and better with more data. This is why there is huge hype for DL now. But can we parallelize and distribute the hyper parameter calculation for speed?
Dimensionality Reduction
Can you reduce the number of independent variables in your data? Check if you can combine some of the variables that are seemigly similar. Lesser dimensions = less complex computations. Simple! Principal Component Analysis is the most popular algorithm for dimensionality reduction.
Data Mining
Data Structures. Data Structures. Data Structures. If you want to mine data efficiently, you better choose the right data structure first. I only touched the surfaces in this field. FP growth is one algorithm I used in one of my projects (it uses a suffix tree btw).
Data Cleaning
This step seems so trivial but it is so important for accuarcy in results. I learnt that the hard way when dealing with emails as my input data. People have different styles in writing emails which makes the data extremely unstructured. On top of that, accounting for spelling mistakes and other grammatical mistakes is a nightmare!
Differential Privacy
Protect user data from being exposed to others and to yourself! Maybe introduce noise, hash the user data or abstract data to group level and work on that. In deep learning, can we sanitize the gradients?
Going forward, the biggest goal is to do projects in deep learning and differential privacy.