Today I had to pick up from a developer who had left my team 3 months ago.
This can be a challenge for any group or organisation, but it’s particularly challenging when the development uses new or experimental technology.
In my case, I wanted to work on a Hadoop and MongoDB example my team had developed. I was relatively familiar with where the information lived, so I logged in, found the elements and was able to work through the entire example in less then an hour.
This was in part because my team did a good job documenting what they had done, but it was also due to the fact that development had been done in Pentaho Data Integration so I could logically follow the steps that had been implemented. Looking below you can see the process that had been implemented.
What this process does is written directly into the ETL process.
“To take our web logs and determine the referring traffic to our sites.
Web logs are copied from our web proxy, nGinX, to a staging area on the DI Server and then transferred to our Hadoop Cluster. Once on the cluster we use Pentaho Visual MapReduce to parse the log file and return the total hits and traffic for each country and referring site per day.”
This provides me the ability to, within an hour, make modifications to the process, debug issues or extend it for other usage.
This relatively “simple” process existed over 3 different servers/services and included a Pentaho DI server, our Firewall and our Hadoop cluster. I was able to log into one place and see exactly how our data flow went.
Having a single place where all of my data management processes are maintained means that my team and I can focus on doing analysis and making improvement rather then turning the handle of data processing or worse, reinventing the wheel.