Big Data: May 2019

Wednesday, May 29, 2019

Types of Problem suited to Big Data Analysis

predictive analytics

Predictive analytics is when we use big data to look for patterns which can help us predict what will happen in the future based on the pattern found. For example companies and investors can use big data analytics to predict what prices will be in the future from looking at past information of what sales have been in past years. This can help them in business decision making as they can make an estimate of what will likely happen.

diagnostic analytics

Diagnostic analytics is similar to predictive analytics in that it predicts a value, however it does not necessarily predict future values. It can have any 2 variables which can be compared. Using the existing information on the graph, the location of an unmapped value on the graph can be predicted by estimating where it would most likely be. For example the number of sales of hand held fans in a shop could be based on the temperature of that day. Using the information from previous sales and temperatures we can estimate how many fans are likely to be sold based on the temperature of the day. If the temperature is 5 degrees you may expect 1 sale. If the temperature is 30 degrees you might expect 400 sales.

descriptive analytics

Descriptive analytics is when we take data and map it onto a graph. Then see if there is any relationship between two variables. If a relationship is found we say what we think the relationship is. For example if we look at the previous example we can say that the warmer the weather is on a particular day, the more fans will be sold in the shop.

prescriptive analytics

Prescriptive analytics is when we look at the relationship and make decisions based on the relationship and what we can predict. It can be used to create the best outcome of something because we can make predictions based on the relationship. Businesses can use prescriptive analytics in order to maximise profits. Healthcare professionals can use it to give people the most effective treatment. Looking back to the shop example above we can make the decision that the business should buy less hand held fans for their stock during the cold winter and make sure there is a lot of fans stocked in the summer months to maximise profits.

Business example

An example of a company using Big Data for business is Netfix. They have over 100,000,000 subscribers so they have a large pool of data to analyse.

Netflix uses data from each user to show each user their own recommended shows. They do this by looking at the user's previous watch history and their search history to help predict what the user may be interested in, therefore increasing the likelihood that the user will watch the recommended content. This will increase Netflix's profit as it is more likely that more people will be watching for longer because they have been recommended shows they will like.

Characteristics of Big Data Analysis

A system dealing with big data must be able to support all sorts of data including structured and unstructured, this involves text and video formats also. This is because most big data is not in a traditional data form like a table, it is mostly in an unstructured format. The system also needs to be able to process the data in real time. This is because a lot of data is flowing in constantly and so the system needs to be able to process it before it gets backlogged. The system must be able to store very large amounts of data. Big data can be up to petabytes in size so analytics companies must have storage systems in place such as warehouses or cloud storage.

Tuesday, May 21, 2019

Value of Big Data

Big data is so valuable as it contributes to business globally. Businesses use big data to aid in their decision making and ultimately shapes the future of their business. It gives them insight into their customers and so big data can be used to make predictions. If businesses know insightful information they can tailor their products/services to suit customers or they can show targeted adverts to specific groups of people to increase profits. Data can also be sold to other companies who seek information about customers (even though illegal sometimes). Companies which hold a large amount of big data on their website users are therefore more valuable companies, for example Facebook's net worth is £105 billion pounds despite only having £5 billion pound yearly profit. This shows that big data is valuable even though it is intangible. Due to the increase of accessibility to internet it is likely that more big data will be produced globally and so this will increase the value of big data as it will become even more informative. The big data market value is predicted to increase to $118.52 billion by 2022 - a 5x increase from 2015.

History of Big Data 1

Before computers were created before the 1960's the only method of collecting and recording information from the population was by paper. People would need to manually hand out forms for people to fill out or someone would go round surveying people. It was hard to collect large amounts of information because it takes a lot of people and a lot of time to collate the data as it all had to be done manually. All the data collected was structured and planned ahead eg the questionnaires were written by someone who wanted to find specific information.

During the 1960's the first computers were created. They were mostly inaccessible to the majority of the population as they were very expensive and very large. They were not practical to have as they were hard to use for the average person also. They were only adopted by large companies who could afford them at this time but the computers did not have much use as the technology was not very advanced. Data was still mostly processed by paper manually.

In 1975 the first home PCs were made. Still not many people had them as they were expensive. There was still no internet but information could now more easily be stored electronically in files stored on the computer's local storage. Information was still analysed manually through paper as computers were not mainstream technology yet.

1983 was when the internet was created. Information was beginning to be analysed on the internet however not many people had a connection to internet as it was not mainstream yet. More and more information was being stored in computers though as we moved on from using paper to process data.

From 1995-2000 the internet was growing rapidly with the birth of social media and e-commerce websites. Lots of people were now using the internet and so data was being produced from a lot of people. However there was not a lot of data and it was not very meaningful data. There was no technology yet to be able to make sense of the unstructured data. Data was almost only stored on databases and not paper anymore.

2010-present is when the use of big data and internet has really exploded. Now we are giving information unconsciously just by using the internet especially through social media and online shopping. Everything we do on the internet can be analysed with technology like Hadoop. Most data collected from people is unstructured data and 3.2 billion people on the planet use the internet meaning there is so much information which can be analysed. Almost nothing is stored or processed using paper - computer, mobile and internet technology has almost completely eliminated the need to use paper.

Hadoop

Hadoop is a program that allows you to break up tasks to deal with big data.
Hadoop is a free open source application that allows different computers dealing with the same data to communicate with each other.
How it works:
A large amount of data can be broken into smaller chunks of data (called 'clusters'. A cluster is a group of similar data taken from a larger group of abstract data). The data is split up depending on keys and sorted into their own cluster. As there is a lot of data, several computers may be required to process all of the data. Each computer working on each cluster processes the data it has and outputs the results of each cluster bringing the outputs together. The end results are then compared. Hadoop allows the processing of big data to be much quicker because several computers can work together on the same data.

Limitations of Traditional Data Analysis

Why using databases (RDMS) couldn't handle big data....

Before today's technology, databases were used to store data but this method could no longer continue for big data because the size of big data is too large for normal databases to handle. Big data is very large - it can be up to petabytes in size. Traditional databases were not designed to hold this vast amount of information. They also cannot filter out the data from the unstructured data as most of big data is unstructured. It would only make sense to hold data in a database if it was structured and it is less common to find structured data ready for a database. Big data usually needs to be processed in real-time because there is so much of it. Databases are too slow to keep up with this speed of information flow.

Types of Statistics

Descriptive Statistics - This is statistics which are usually true statistics taken from sample data and they can be analysed by using calculations. For example for a set of temperatures for a month you may want to find the average temperature or the range of temperature.The results can be displayed using graphs to show the data visually.

Inferential Statistics - Inferential statistics is based on a small sample about the population. Conclusions about the population can be drawn from the sample data. It says what the data probably means however it may be an estimate which can be based on other information also. It makes predictions of the whole sample based on some parts of the sample. For example in a manufacturing factory if you measure 3 nails out of a box of 100 nails and all 3 measure 5 cm, you can assume that all 100 nails in the box also measure 5 cm.

Types of Data - structured / unstructured

Structured Data - this is data which has been taken from a format to receive specific information. Examples of structured data would be information taken from an application form or a table which someone has made. The information which is wanted has been defined beforehand e.g. someone who wrote the application form would be looking for specific pieces of information from people.

An advantage of structured data is that it is fast and easy to take information from as it is laid out in a structured manner. There is no unnecessary information in the way, making it easier for the information to be analysed by a computer algorithm. A disadvantage of unstructured data is that it can be limited in what data is there.

Unstructured Data - this is data which has no format and so it can have a wide variety of data. Examples would be information taken from a video or a Facebook Post. The data may be ambiguous so it might be difficult to organise the data.

An advantage of unstructured data is that it can be more flexible as it does not have a structure to adhere to. There can be a lot more data within unstructured data than structured data. A disadvantage of unstructured data is that there can be a lot of unnecessary information making it difficult for the information to be analysed.

Today there is a lot more unstructured data due to the increase of social media where a lot of big data is derived from. Before internet was made most of the data was structured because it was easier and quicker to analyse structured data.

What is Big Data? '3 Vs'

Big Data can be describes as having the '3 Vs'.

Volume: (amount) this refers to the amount of information which is collected by people. There is a very large number of people data is being analysed from. For example Facebook has over 2 billion users it collects and processes vast amounts of data from. By analysing and sorting all of this data Facebook has information about patterns and trends in human behaviour.

Velocity: (real time , batch) The data passed is in real time meaning that as soon as a person does something online it is instantly processed. The data has to be processed very quickly due to the volume of data constantly being generated.

Variation: (structured, non structured) Structured data is made up of clearly defined data types meaning the data can be put into categories and can be easily searched as it in an organised format. Non structured data on the other hand is data which is in a format which makes it hard to organise the information, for example videos or Facebook posts are hard to find what data can be taken from it.

Growth of Big Data

Over time big data has grown in size and there are many reasons for this. More data is being recorded because more and more people are producing data for example from their mobile phones. Technology is exponentially improving and more and more people are using up to date technology which can produce a wide range of big data. The increased use of social media such as Facebook has greatly contributed to the amount of big data there is. There now exists smart software which can make sense of unstructured data and organise it to make sense to be used.

An example of how fast big data is growing is Facebook now generates 500+TB of data every day when 30 years ago 200MB was considered a lot of data. An idea of how much it will increase in the future is that 90% of all data was generated in the past 2 years.