Hadoop seems incredibly well-suited to shouldering machine-learning workloads. With HDFS you can store both structured and unstructured data across a cluster of machines, and SQL-on-Hadoop technologies like Hive make those structured data look like database tables. Execution frameworks like Spark let you distribute compute across the cluster as well. On paper, Hadoop is the perfect environment for running compute-intensive distributed machine learning algorithms across a vast amount of data.
Unfortunately, though, Hadoop seems incredibly well-suited for a lot of other things too. Streaming data? Storm and Flink! Security? Kerberos, Sentry, Ranger, and Knox! Data movement and message queues? Flume, Sqoop, and Kafka! SQL? Hive, Impala and Hawq! The Hadoop ecosystem has become a bag of often overlapping and competing technologies. Cloudera vs. Hortonworks vs. MapR is responsible for some of this, as is the dynamism of the open source community.
To read this article in full or to leave a comment, please click here