Requirements
Since Spark is rapidly evolving, I need to deploy and maintain a minimal Spark cluster for the purpose of testing and prototyping. A public cloud is the best fit for my current demand.
- Intranet speedThe cluster should easily copy the data from one server to another. MapReduce always shuffles a large chunk of data throughout the HDFS. It’s best that the hard disk is SSD.
- Elasticity and scalabilityBefore scaling the cluster out to more machines, the cloud should have some elasticity to size up or size down.
- Locality of HadoopMost importantly, the Hadoop cluster and the Spark cluster should have one-to-one mapping relationship like below. The computation and the storage should always be on the same machines.
Hadoop | Cluster Manager | Spark | MapReduce |
---|---|---|---|
Name Node | Master | Driver | Job Tracker |
Data Node | Slave | Executor | Task Tracker |
Choice of public cloud:
I simply compare two cloud service provider: AWS and DigitalOcean. Both have nice Python-based monitoring tools(Boto for AWS and python-digitalocean for DigitalOcean).
- From storage to computationHadoop’s S3 is a great storage to keep data and load it into the Spark/EC2 cluster. Or the Spark cluster on EC2 can directly read S3 bucket such as s3n://file (the speed is still acceptable). On DigitalOcean, I have to upload data from local to the cluster’s HDFS.
- DevOps tools:
- AWS: spark-ec2.py
- With default setting after running it, you will get
- 2 HDFSs: one persistent and one ephemeral
- Spark 1.3 or any earlier version
- Spark’s stand-alone cluster manager
- A minimal cluster with 1 master and 3 slaves will be consist of 4 m1.xlarge EC2 instances
- Pros: large memory with each node having 15 GB memory
- Cons: not SSD; expensive (cost $0.35 * 6 = $2.1 per hour)
- With default setting after running it, you will get
- DigitalOcean: https://digitalocean.mesosphere.com/
- With default setting after running it, you will get
- HDFS
- no Spark
- Mesos
- OpenVPN
- A minimal cluster with 1 master and 3 slaves will be consist of 4 2GB/2CPUs droplets
- Pros: as low as $0.12 per hour; Mesos provide fine-grained control over the cluster(down to 0.1 CPU and 16MB memory); nice to have VPN to guarantee the security
- Cons: small memory(each has 2GB memory); have to install Spark manually
- With default setting after running it, you will get
Add Spark to DigitalOcean cluster
Tom Faulhaber has a quick bash script for deployment. To install Spark 1.3.0, I write it into a fabfile for Python’s Fabric.
Then all the deployment onto the DigitOcean is just one command line.
Then all the deployment onto the DigitOcean is just one command line.
# 10.1.2.3 is the internal IP address of the master
fab -H 10.1.2.3 deploy_spark
The source codes above are available at my Github
inspirational quotes about love
ReplyDeleteinspiring love quotes
short quotes
short inspirational quotes
Uplifting Bible Verses
Bible Verses About Family
Inspirational Bible Verses
Encouraging Bible Verses
Birthday Greetings For Husband
good night quotes
happy mothers day quotes
Inspirational Business Quotes
Life Changing Quotes
Short Life Quotes
Funny valentines day quotes
Happy Valentines Day Quotes