Solved – Solving a practical machine learning problem

machine learningscalabilityunderdetermined

I am currently doing my Phd in computational biology at Stanford. I get the data I need to answer the questions I am interested in. The data sets are sometimes "large" and these large problems take longer time periods to solve (a couple of days sometimes).

That being said I was wondering how machine learning on extremely massive data sets works? Suppose google wants to solve $Ax = b$ where $A$ has 10 billion rows, finding any gradients seems prohibitive. If google actually ran these simulations for as long as it takes (my equivalent of a couple of days), the solution maybe worthless before it arrives. This problem will be accentuated while training neural networks or implementing more complicated methods. What are practical solutions to this problem?

I have seen statements like "We pick representative samples…". This is an absurd statement in my opinion because when p >> n, nothing is representative since the systems are under-determined. Any help on what 'representative' in these cases will also help.

Best Answer

I have seen statements like "We pick representative samples...". This is an absurd statement...

I agree with you on this. And I don't think representative sampling is what they do (anymore). My understanding is that they analyse big data with distributed computing, using technology like Hadoop, Spark, and MLLib. I am sure they write their proprietary and complex machine learning / analysis libraries on top of these.

The algorithms for these "distributable" systems are coded differently than how you and I would code them in e.g. R, Matlab or Python. They need to be scalable and parallelisable, which is an issue for some algorithms. MLLib, for instance, currently only supports some very basic algorithms (list available on their website).

Related Question