how to parse logs files in hadoop



0
Hi Sirish,
I have a task which is like this:
100s of machines are running some tests and printing logs to their log files all of which are getting stored on a network location.

so there are /path2files/machin1.log, /path2files/machine2.log ..files getting written continuously.

I will write a program in python (or java) to parse each log file and search for predefined set of errors in each line, if i find an error i'll raise a bug/ticket else do nothing and move to next line till the end of file.
then sleep for 1 hour
Check again these files and start looking for errors from where i left last time.

If these files are static (not getting appended), solution in hadoop is very simple, i just have to move files to hdfs and run MR job to parse each line of each file in parallel. No reducer is required.

Question - Is there any solution using hadoop?
Challenge here is: Files are not on hdfs and are getting appended continuously.

2 Answer(s)


0

I came to know open source software like Elastic Search and Kibana, https://www.elastic.co/products/kibana I have not used it that much, it seems you can configure the server names and log fie location, and may be some custom filters. You can get all that analytics out of the log files, I think you need LogStach, you can explore it all
https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-4-on-ubuntu-14-04

0

Hi Vadivel, Thanks for the info.I have used ELK for certain things like sending the data in json format and plotting it on kibana. It can do chart plotting and aggregation like min, max, average. As per my knowledge it needs data needs to be structured before Kibana can make use of it. In my case I need to find error in completely unstructured data and raise tickets based on that. Even if unstructured data can be shown on kibana, someone has to query for the errors. I need to read more if there is any automated reporting based on certain filters on unstructured data otherwise it can't solve my problem.

Thanks again.

I am still trying to find out what are the use cases of ELK, Solr and Splunk vs the hadoop ecosystem we have been learning.

-Sushil