Have you ever wondered what technology some of the really big websites use? The likes of Digg, YouTube, Myspace and so on?
There is a very interesting website called High Scalability that is dedicated to, as they put it themselves, “building bigger, faster, more reliable websites.” They collect information about the architecture of high-traffic websites to serve as examples to others.
Underlying technology breakdown
We used some of the data from High Scalability to create a table with the OS, web server, scripting language and database used by nine of the largest websites in the world.
The ones we selected were Flickr, YouTube, PlentyOfFish, Digg, TypePad, LiveJournal, Friendster, MySpace, Wikipedia.
Quick Overview
OS: Linux 7 - Windows 2
Web server: Apache 7 - IIS 2 - Lighttpd 2
Scripting: PHP 4 - Perl 4 - ASP.NET 2 - Python 1 - Java 1
Database: MySQL 7 - SQL Server 1 (possibly 2)
Five of the sites use Memcached, a memory caching system originally developed by LiveJournal that has become a popular way to ease the load on for example databases.
Note that not all information at the High Scalability website is complete (but it’s still a great resource).
Looking at these architectures some observations come to mind: Most of these sites are using LAMP as the core runtime stack. Some have gone so far as to develop their own file system (Google, GFS). Some are using caching to solve the database bottleneck (memcached and the like). Many of them were forced to develop these solutions themselves, as at the time there was no ready-made alternative that could meet their requirements.
The application stack of these Web applications is very different from the stack that mission-critical applications in the financial world are built with. In the financial world, Java -- and to a lesser degree J2EE -- is used extensively. In recent years scalability requirements in capital markets led to a rapid shift in the middleware stack, introducing Compute Grid solutions for virtualization of CPU resources, enabling parallelization of batch applications. Data Grids were also introduced, enabling the virtualization of memory resources. Spring is becoming the common development framework in this world. At GigaSpaces, we're seeing more and more cases where Spring acts as a complete alternative to J2EE.
If we examine both worlds, we can see that both are facing similar challenges related to scalability. Not surprisingly, both ended up introducing similar solutions for addressing the scalability challenges:
On the Data Tier we see the following:
1. Adding a caching layer to take advantage of memory resources availability and reduce I/O overhead
2. Moving from a database-centric approach to partitioning, aka shards
On the Business Logic Tier:
3. Adding parallelization semantics to the application tier (e.g., MapReduce)
4. Moving to scale-out application models to achieve linear scalability
5. Moving away from the classic two-phase commit and XA for transaction processing (See: Lessons from Pat Helland: Life Beyond Distributed Transactions)
While there are many similar challenges, and to a certain degree, similar architectures, it seems that both worlds (Web and Financial) took different routes as it relates to the application stack.
Over at the High-Scalability site, someone posted the question: Why doesn't anyone use j2ee?
The answer given in that post can be summarized as follows:
1. LAMP provides a cost-effective solution (most of it relies on *free* open source stack).
2. Java is still used, but not as the primary language, i.e., it is used as one component either in the back-end or the front-end (e.g., servlets).
Finding out more
If you want to read more about these websites, we highly recommend that you head on over to High Scalability. They have a thorough breakdown of the architecture and design choices for each one.
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5