BigData, What,Why and How ?

9 min readSep 17, 2020

Data is a small piece of information which might seem useless to us but it has great value for any industry, every day we generate tonnes of data which is being recorded and used to analyze various types of information it contains.

Big Data is collection of such data sets that are too large and complex to be processed by traditional methods.

A study conducted last year shows the stats of data generated in a single day:

500 million tweets are sent
294 billion emails are sent
4 petabytes of data are created on Facebook
4 terabytes of data are created from each connected car
65 billion messages are sent on WhatsApp
5 billion searches are made

By 2025, it’s estimated that 463 exabytes of data will be created each day globally — that’s the equivalent of 212,765,957 DVDs per day!

Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity.two new concepts were added to it later.

Volume:

Volume is a huge amount of data.
To determine the value of data, size of data plays a very crucial role. If the volume of data is very large then it is actually considered as a ‘Big Data’. This means whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

Velocity:

Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc.
There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to meet the demands.
Sampling data can help in dealing with the issue like ‘velocity’.

Variety:

It refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise. It can be structured, semi-structured and unstructured.

Structured data: This data is basically an organized data. It generally refers to data that has defined the length and format of data.
Semi- Structured data: This data is basically a semi-organised data. It is generally a form of data that do not conform to the formal structure of data. Log files are the examples of this type of data.
Unstructured data: This data basically refers to unorganized data. It generally refers to data that doesn’t fit neatly into the traditional row and column structure of the relational database. Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows and columns.

Veracity:

It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get messy and quality and accuracy are difficult to control.
Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources.

Value:

After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of Data having no Value is of no good to the company, unless you turn it into something useful.
Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5V’s.

Some Applications of BigData:

Let’s see some examples of how our favorite companies use BigData:

Netflix:

The premise of Netflix’s first original TV show — the David Fincher-directed political thriller House of Cards — had its roots in big data. Netflix invested $100 million in the first two seasons of the show, which premiered in 2013, because consumers who watched the British series House of Cards also watched movies directed by David Fincher and starring Kevin Spacey. Executives correctly predicted that a series combining all three would be a hit.
Now, seven years later, big data impacts not only which series Netflix invests in, but how those series are presented to subscribers. Viewing histories, including the points at which users hit pause in any given show, reportedly influence everything from the thumbnails that appear on their homepages to the contents of the “Popular on Netflix” section.

Google:

Indexed pages:
Indexed pages are the collection of web pages stored to respond to search queries. Indexing is the process of adding web pages into google search index. It involves assigning keywords or phrases to web pages within a metadata tag or meta-tag so that webpage can be retrieved easily with a search engine that is tailored to search the keywords field.
Real-time Data Feeds:
Although it doesn’t promote itself as such, Google is actually a collection of data and a set of tools for working with it. It has progressed from an index of web pages to a central hub for real-time data feeds on just about anything that can be measured such as weather reports, travel reports, stock market and shares, shopping suggestions, travel suggestions, and several other things.
Sorting Tools:
Big Data analysis which implies utilizing tools intended to deal with and comprehend this massive data becomes an integral factor whenever users carry out a search query. The Google’s algorithms run complex calculations intended to match the questions that user entered with all the available data. It will try to determine whether the user is searching for news, people, facts or statistics, and retrieve the data from the appropriate feed.
Knowledge Graph Pages:
Google Knowledge Graph is a tool or database which collects all the data and facts about people, places and things along with proper differentiation and relationship between them. It is then later used by Google in solving our queries with useful answers. Google knowledge graph is user-centric and it provides them with useful relevant information quickly and easily.
To learn more click here.

Amazon:

Amazon uses Big Data gathered from customers while they browse to build and fine-tune its recommendation engine. The more Amazon knows about you, the better it can predict what you want to buy. And, once the retailer knows what you might want, it can streamline the process of persuading you to buy it — for example, by recommending various products instead of making you search through the whole catalogue.
Amazon’s recommendation technology is based on collaborative filtering, which means it decides what it thinks you want by building up a picture of who you are, then offering you products that people with similar profiles have purchased.
Amazon gathers data on every one of its customers while they use the site. As well as what you buy, the company monitors what you look at, your shipping address (Amazon can take a surprisingly good guess at your income level based on where you live), and whether you leave reviews/feedback.

Facebook:

Tracking cookies:
Facebook tracks its users across the web by using tracking cookies. If a user is logged into Facebook and simultaneously browses other websites, Facebook can track the sites they are visiting.
Facial recognition:
One of Facebook’s latest investments has been in facial recognition and image processing capabilities. Facebook can track its users across the internet and other Facebook profiles with image data provided through user sharing.
Tag suggestions:
Facebook suggests who to tag in user photos through image processing and facial recognition.
Analyzing the Likes:
A recent study conducted showed that it is viable to predict data accurately on a range of personal attributes that are highly sensitive just by analyzing a user’s Facebook Likes. Work conducted by researchers at Cambridge University and Microsoft Research shows how the patterns of Facebook Likes can very accurately predict your sexual orientation, satisfaction with life, intelligence, emotional stability, religion, alcohol use and drug use, relationship status, age, gender, race, and political views — among many others.

Internet of Things ( with AI ):

IoT will generate a vast amount of data, and in today’s world, well-analyzed data is extremely valuable.
Big data analytics tools have the capability to handle large volumes of data generated from IoT devices.
IoT delivers the data collected from various sensors and the big-data analytics tools can be used to store and create insights from this information.
Unique IoT services create niche opportunities to improve customer value.
For example, the enormous amount of data gathered from sensors can be analyzed in real time to derive conclusions that can help to make informed decisions for continuous improvements within operations.
The patterns and trends observed from the enormous amount of data are used by several machine learning algorithms and help in making predictive analytics with data-based learning.

Now that we know what BigData is and its applications let’s see how BigData Analysis is done. There are multiple tools available for it, lets look at some top tools:

Apache Hadoop:

Hadoop is an open-source framework that is written in Java and it provides cross-platform support.
No doubt, this is the topmost big data tool. In fact, over half of the Fortune 50 companies use Hadoop. Some of the Big names include Amazon Web services, Hortonworks, IBM, Intel, Microsoft, Facebook, etc.
Pros:
The core strength of Hadoop is its HDFS (Hadoop Distributed File System) which has the ability to hold all type of data — video, images, JSON, XML, and plain text over the same file system.
Highly scalable and provides quick access to data.
Highly-available service resting on a cluster of computers
Cons:
Sometimes disk space issues can be faced due to its 3x data redundancy.
I/O operations could have been optimized for better performance.

Cassandra:

Apache Cassandra is free of cost and open-source distributed NoSQL DBMS constructed to manage huge volumes of data spread across numerous commodity servers, delivering high availability. It employs CQL (Cassandra Structure Language) to interact with the database.
Some of the high-profile companies using Cassandra include Accenture, American Express, Facebook, General Electric, Honeywell, Yahoo, etc.
Pros:
No single point of failure.
Handles massive data very quickly.
Log-structured storage and automated replication
Linear scalability and Simple Ring architecture
Cons:
Requires some extra efforts in troubleshooting and maintenance.
Clustering could have been improved.
Row-level locking feature is not there.

Mongo Db:

MongoDB is a NoSQL, document-oriented database written in C, C++, and JavaScript. It is free to use and is an open source tool that supports multiple operating systems including Windows Vista ( and later versions), OS X (10.7 and later versions), Linux, Solaris, and FreeBSD.
Its main features include Aggregation, Adhoc-queries, Uses BSON format, Sharding, Indexing, Replication, Server-side execution of javascript, Schemaless, Capped collection, MongoDB management service (MMS), load balancing and file storage.
Some of the major customers using MongoDB include Facebook, eBay, MetLife, Google, etc.
Pros:
Easy to learn and install.
Provides support for multiple technologies and platforms.
Cons:
Limited analytics.
Slow for certain use cases.

Some other tools like Xplenty , Quoble, HPCC, Apache Storm, Couch DB, Cloudera, Flink, Kaggle, Tableau, Kafka etc can also be used.

Thank you for your time, hope you liked it.