A newspaper organization has a on-premises application which allows the public to search its back catalogue
and retrieve individual newspaper pages via a website written in Java They have scanned the old newspapers
into JPEGs (approx 17TB) and used Optical Character Recognition (OCR) to populate a commercial search
product. The hosting platform and software are now end of life and the organization wants to migrate Its archive
to AWS and produce a cost efficient architecture and still be designed for availability and durability.
Which is the most appropriate?
Use S3 with reduced redundancy lo store and serve the scanned files, install the commercial search
application on EC2 Instances and configure with auto-scaling and an Elastic Load Balancer.
Model the environment using CloudFormation use an EC2 instance running Apache webserver and an open
source search application, stripe multiple standard EBS volumes together to store the JPEGs and search
Use S3 with standard redundancy to store and serve the scanned files, use CloudSearch for query
processing, and use Elastic Beanstalk to host the website across multiple availability zones.
Use a single-AZ RDS MySQL instance lo store the search index 33d the JPEG images use an EC2 instance
to serve the website and translate user queries into SQL.
Use a CloudFront download distribution to serve the JPEGs to the end users and Install the current
commercial search product, along with a Java Container Tor the website on EC2 instances and use
Route53 with DNS round-robin.
There is no such thing as “Most appropriate” without knowing all your goals. I find your scenarios very fuzzy,
since you can obviously mix-n-match between them. I think you should decide by layers instead:
Load Balancer Layer: ELB or just DNS, or roll-your-own. (Using DNS+EIPs is slightly cheaper, but less reliable
Storage Layer for 17TB of Images: This is the perfect use case for S3. Off-load all the web requests directly to
the relevant JPEGs in S3. Your EC2 boxes just generate links to them.
If your app already serves it’s own images (not links to images), you might start with EFS. But more than likely,
you can just setup a web server to re-write or re-direct all JPEG links to S3 pretty easily.
If you use S3, don’t serve directly from the bucket – Serve via a CNAME in domain you control. That way, you
can switch in CloudFront easily.
EBS will be way more expensive, and you’ll need 2x the drives if you need 2 boxes. Yuck.
Consider a smaller storage format. For example, JPEG200 or WebP or other tools might make for smaller
images. There is also the DejaVu format from a while back.
Cache Layer: Adding CloudFront in front of S3 will help people on the other side of the world — well, possibly.
Typical archives follow a power law. The long tail of requests means that most JPEGs won’t be requested
enough to be in the cache. So you are only speeding up the most popular objects. You can always wait, and
switch in CF later after you know your costs better. (In some cases, it can actually lower costs.)
You can also put CloudFront in front of your app, since your archive search results should be fairly static. Thiswill also allow you to run with a smaller instance type, since CF will handle much of the load if you do it right.
Database Layer: A few options:
Use whatever your current server does for now, and replace with something else down the road. Don’t underestimate this approach, sometimes it’s better to start now and optimize later.
Use RDS to run MySQL/Postgres
I’m not as familiar with ElasticSearch / Cloudsearch, but obviously Cloudsearch will be less maintenance
When creating the app layer from scratch, consider CloudFormation and/or OpsWorks. It’s extra stuff to learn,
but helps down the road.
Java+Tomcat is right up the alley of ElasticBeanstalk. (Basically EC2 + Autoscale + ELB).
Preventing Abuse: When you put something in a public S3 bucket, people will hot-link it from their web pages. If
you want to prevent that, your app on the EC2 box can generate signed links to S3 that expire in a few hours.
Now everyone will be forced to go thru the app, and the app can apply rate limiting, etc.
Saving money: If you don’t mind having downtime:
run everything in one AZ (both DBs and EC2s). You can always add servers and AZs down the road, as long as
it’s architected to be stateless. In fact, you should use multiple regions if you want it to be really robust.
use Reduced Redundancy in S3 to save a few hundred bucks per month (Someone will have to “go fix it” every
time it breaks, including having an off-line copy to repair S3.)
Buy Reserved Instances on your EC2 boxes to make them cheaper. (Start with the RI market and buy a
partially used one to get started.) It’s just a coupon saying “if you run this type of box in this AZ, you will save on
the per-hour costs.” You can get 1/2 to 1/3 off easily.
Rewrite the application to use less memory and CPU – that way you can run on fewer/smaller boxes. (May or
may not be worth the investment.)
If your app will be used very infrequently, you will save a lot of money by using Lambda. I’d be worried that it
would be quite slow if you tried to run a Java application on it though.
We’re missing some information like load, latency expectations from search, indexing speed, size of the search
index, etc. But with what you’ve given us, I would go with S3 as the storage for the files (S3 rocks. It is really,
really awesome). If you’re stuck with the commercial search application, then on EC2 instances with
autoscaling and an ELB. If you are allowed an alternative search engine, Elasticsearch is probably your best
bet. I’d run it on EC2 instead of the AWS Elasticsearch service, as IMHO it’s not ready yet. Don’t autoscale
Elasticsearch automatically though, it’ll cause all sorts of issues. I have zero experience with CloudSearch so ic
an’t comment on that. Regardless of which option, I’d use CloudFormation for all of it.