Requirements
Since Spark is rapidly evolving, I need to deploy and maintain a minimal Spark cluster for the purpose of testing and prototyping. A public cloud is the best fit for my current demand.
- Intranet speedThe cluster should easily copy the data from one server to another. MapReduce always shuffles a large chunk of data throughout the HDFS. It’s best that the hard disk is SSD.
- Elasticity and scalabilityBefore scaling the cluster out to more machines, the cloud should have some elasticity to size up or size down.
- Locality of HadoopMost importantly, the Hadoop cluster and the Spark cluster should have one-to-one mapping relationship like below. The computation and the storage should always be on the same machines.
Hadoop | Cluster Manager | Spark | MapReduce |
---|---|---|---|
Name Node | Master | Driver | Job Tracker |
Data Node | Slave | Executor | Task Tracker |
Choice of public cloud:
I simply compare two cloud service provider: AWS and DigitalOcean. Both have nice Python-based monitoring tools(Boto for AWS and python-digitalocean for DigitalOcean).
- From storage to computationHadoop’s S3 is a great storage to keep data and load it into the Spark/EC2 cluster. Or the Spark cluster on EC2 can directly read S3 bucket such as s3n://file (the speed is still acceptable). On DigitalOcean, I have to upload data from local to the cluster’s HDFS.
- DevOps tools:
- AWS: spark-ec2.py
- With default setting after running it, you will get
- 2 HDFSs: one persistent and one ephemeral
- Spark 1.3 or any earlier version
- Spark’s stand-alone cluster manager
- A minimal cluster with 1 master and 3 slaves will be consist of 4 m1.xlarge EC2 instances
- Pros: large memory with each node having 15 GB memory
- Cons: not SSD; expensive (cost $0.35 * 6 = $2.1 per hour)
- With default setting after running it, you will get
- DigitalOcean: https://digitalocean.mesosphere.com/
- With default setting after running it, you will get
- HDFS
- no Spark
- Mesos
- OpenVPN
- A minimal cluster with 1 master and 3 slaves will be consist of 4 2GB/2CPUs droplets
- Pros: as low as $0.12 per hour; Mesos provide fine-grained control over the cluster(down to 0.1 CPU and 16MB memory); nice to have VPN to guarantee the security
- Cons: small memory(each has 2GB memory); have to install Spark manually
- With default setting after running it, you will get
Add Spark to DigitalOcean cluster
Tom Faulhaber has a quick bash script for deployment. To install Spark 1.3.0, I write it into a fabfile for Python’s Fabric.
Then all the deployment onto the DigitOcean is just one command line.
Then all the deployment onto the DigitOcean is just one command line.
# 10.1.2.3 is the internal IP address of the master
fab -H 10.1.2.3 deploy_spark
The source codes above are available at my Github