On the most fundamental level, a web application consists of client-side code and markup, server-side code, and a database. Many small websites and small-to-medium enterprise-level browser-based applications consist of nothing other than these components. However, on its own, this basic setup is not suitable for a big data application.
Big data is not just a large volume of normal data. Kommando tech website has published an interesting article that outlines the challenges that big data introduces.
As the article says, at the most fundamental level, there are four dimensions to consider, which are the following:
- Volume: Big data is just big.
- Variety: Big data is highly varied and diverse.
- Velocity: Big data is growing at exponential speed.
- Veracity: The accuracy of big data can vary greatly.
And to take full advantage of these dimensions of big data, any web application that is intended to do so would need the following components:
It must be available at any time, regardless of how many users are using it simultaneously. This is why it is absolutely crucial that a big data application doesn’t have any single points of failure. Therefore, it would not be possible to achieve high availability if either all of the code or the entire database exists on a single server as a single instance.
When you start typing a new email in Gmail, it instantly saves a draft on the server and constantly updates it while you are making further changes. When you write a post on Facebook, all of your friends can see it instantly. This level of performance is achieved despite the fact that both of the apps have millions of user interactions at any given time.
To achieve this level of performance, a web application shouldn’t have any bottlenecks. Therefore, once again, having a single instance of either the code base or the database is not suitable.
Also, there is a limit on how much hardware you can carry on adding to a single server to increase the performance. As the usage of an app grows linearly, the response time grows logarithmically. Therefore, at some point, adding more hardware will result in diminishing returns.
As an organization processes large quantities of user-generated data, it needs to take full advantage of it, both to help the organization to make effective business decisions and to deliver the most relevant content to the users of its products.
Technical solutions to dealing with big data
These are some of the best techniques that developers of big data applications use to achieve the goals of high availability, high performance, and analytics:
Instead of having a single server to execute the logic within a web application, the same code may be duplicated on several servers. When a request from a user’s browser is received, the system sends it to a server that is not overloaded with other requests. As the load is distributed between many servers, the system never gets into a state where response time increases to an unacceptable level.
There are many ways to implement load balancing. Microsoft Azure, for example, has a service that constantly sends requests to each server within a cluster to check if the server would be able to accept an incoming request from a user.
The exact parameters of a good server state can be configured. For example, a server may be set to not respond with an OK status code if its current CPU usage is above 90%. If a particular server does not respond with an OK status code within an acceptable time frame, the system may be set to try to resolve any issues with the server, but the immediate action would be to stop sending any users’ requests to that server and choose a different server instead.
Load balancing provides both performance and availability. If implemented properly, none of the servers within a load-balanced system are allowed to process more than they can handle without significantly slowing down. Likewise, if any server becomes unavailable, the system would still be able to cope with the remaining servers.
Databases can also be load-balanced, but this is mainly done for availability rather than performance. Microsoft SQL Server AlwaysON Availability Group, for example, allows a .NET application with Entity Framework to implement an execution strategy, which, for example, can be set to resubmit a SQL command to the database cluster if the server that the application was previously connected to suddenly became unavailable during the initial execution of the command.
If a system has a very large number of users that generate a very large amount of content on a constant basis, you cannot just use one huge database. Writing data to a disk and reading data from it are relatively computationally expensive operations. So is performing a scan on a huge database table. Therefore, even if you had a very powerful server, your system would grind to a halt due to the sheer number of simultaneous reads, writes, and table scans.
Load balancing provides only a partial solution to this problem. Although it would reduce the number of simultaneous reads and writes, it will not eliminate the need to scan large tables or indexes. Therefore, as more and more data will be generated, the performance will still eventually decrease to unacceptable levels.
One of the best solutions for this is database sharding, which is also known as partitioning or “nothing shared” architecture.
This is a technique of having several databases with the same schema, but different data rows. For example, if an application has millions of users, the users will be split into subsets, each of which will be stored in a separate database on its own server, while each of the databases will be a part of the same cluster and have the same data structure.
Just like a database table may have a primary key, each database in the cluster would have its own primary key to enable the system to quickly identify which cluster node needs to be queried when a particular request is submitted. For example, users may be grouped by geographic locations, so a location code would form a unique identifier of a cluster. This model is known as KKV or Key Key Value and is widely implemented by database systems that have been specifically designed for sharding, such as Cassandra.
Although this architecture is referred to as “nothing shared”, this doesn’t accurately describe a sharded database structure. To increase the performance of the queries, there would be some degree of duplication, which mainly relates to lookup tables and reference data records.
Of course, sharding doesn’t eliminate the need for load balancing, as it improves the performance but has virtually no effect on the availability. Therefore software architects would need to ensure that every node within the cluster is sufficiently replicated.
Making sense of the data
Any organization that manages big data would be aiming to take full advantage of it. Therefore, the process of big data analytics would be required to uncover hidden patterns of the users’ behavior. This is what makes you see personalized content on various websites and ads that are relevant to what you have been searching for on the web. On a larges scale, big data analytics is what drives business decisions for the companies that manage the data.
The field of big data analysis is vast. Some of it is automated via various business intelligence tools. HBase for example, a NoSQL database that forms a part of Apache Hadoop database clustering solution, comes with its own set of analytic tools. So do more traditional relational database systems such as SQL Server and Oracle. However, as well as using these tools, many organizations develop bespoke in-house solutions with the help of specially qualified data scientists.
This article was only a brief introduction to how big data applications are built. If you are interested to find out more, you will find the following links useful:
Facebook: Science and Social Graph (Video)