Big Data Environments

Geographical data have always been ‘big’, presenting special challenges for geo-computation. Adding Web-scale ‘Big Data’ makes things even bigger, into the Gigabytes, Petabytes, Terabytes or more…

Adrian’s PhD research had highlighted several problems, e.g., with Relational Database Management Systems (RDBMSs) ‘ingesting’ large amounts of JSON (JavaScript Object Notation) or trusty old CSV (Comma Separated Values) formatted data.

Speed of processing – even on Adrian’s specially commissioned ‘shedputer’, two 4U IBM System X3850 M2 computers with 24 processor cores, 128GB RAM each and RAID10 solid state disks, pictured in situ, along with a 2U Dell PowerEdge 2950 – was also an issue.

Working with Professor Richard Healey, Gary Burton and David Marshall at the University or Portsmouth, Adrian investigated and assessed the use of High Performance Computers (HPCs) in the Institute of Cosmology and Gravitation’s SCIAMA Supercomputer for Big Data workloads.

Software from MapR, Oracle, the Apache Software Foundation (Drill), Tableau and Gephi (specialised, open source graph visualisation software) was also assessed.

Running MapR’s flavour of Hadoop on a 5-node HPC cluster with 48 processor cores, 100GB memory and 6TB disk space and Oracle’s 12c Relational Database Management System (RDBMS) on a 2-node HPC cluster brought significant performance gains for data ingestion, query and analysis.

The old instructions for interfacing MapR (now defunct) with Tableau through Java Database Connectivity (JDBC) connectors is reproduced below, in case it is of any further use to the community.

Adrian’s research was presented internally and at the Drilling for Big Data, GIS with NoSQL Workshop of the 19th AGILE International Conference (2016) on Geographic Information Science, Helsinki, Finland. His presentation included a live connection to the SCIAMA MapR cluster using Apache Drill to server JSON data to Tableau and produce a map.

The supercomputing tests also led to further work with Oracle, including one of the first ‘field tests’ of the Oracle Cloud, including usage of a highly-specified Exadata machine that performed very well with the Oracle 12c RDBMS.

Technology marches on and computing power continues to increase. Adrian’s shedputers still exist but most heavy lifting today is now done in the Cloud…