BIG DATA WITH PYTHON Numerical and Scientific Packages

BIG DATA WITH PYTHON Numerical and Scientific Packages

WHAT IS BIG DATA o 1. Big Data is term related to the large amount of data o 2. Data is of any type that is Structure or unstructured o 3. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time o 4. Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992

HISTORY OF BIG DATA 1. he term has been in use since the 1990 s, with some giving credit to John Mashey for coining or at least making it popular 2. Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale. 3. In a 2001 research report and related lectures, META Group (now Gartner) defined data growth challenges and opportunities as being threedimensional, i. e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources).

RECENT USES OF BIG DATA 1. In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government. The initiative is composed of 84 different big data programs spread across six departments 2. Big data analysis played a large role in Barack Obama's successful 2012 reelection campaign 3. The United States Federal Government owns six of the ten most powerful supercomputers in the world. 4. The Utah Data Center has been constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes

PROGRAMMING LANGUAGES DEALS WITH BIG DATA 1. Python Programming Language 2. R Programming 3. SCALA Programming 4. JAVA (HADOOP)

WHAT IS BIG DATA WITH PYTHON 1. If your data scientists don't do R, they'll likely know Python inside and out. 2. Python has been very popular in academia for more than a decade, especially in areas like Natural Language Processing (NLP). 3. There's Juypter/i. Python too -- the Web-based notebook server that allows you to mix code, plots, and, well, almost anything, in a shareable logbook format.

ENVIRONMENTAL SETUP 1. Download Anaconda Navigator from https: //www. anaconda. com/download/ 2. QT console 3. Jupyter Notebook to start

LIBRARIES FOR DATA SCIENCE IN PYTHON By far, the most commonly used packages are those in the Sci. Py stack. We will focus on these in this class. These packages include: • Pandas – data analysis library. • Num. Py • Sci. Py • Matplotlib – plotting library. • IPython – interactive computing. • Sym. Py – symbolic computation library.

PANDAS- A LIBRARY FOR DATA ANALYTICS 1. Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. 2. Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. 3. In 2008, developer Wes Mc. Kinney started developing pandas when in need of high performance, flexible tool for analysis of data. 4. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

PANDAS 1. Pandas deals with three data structures a. Series b. Data Frame c. Panel

DIMENSION & DESCRIPTION Data Structure Dimensions Description Series 1 1 D labeled homogeneous array, sizeimmutable. Data Frames 2 General 2 D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. Panel 3 General 3 D labeled, size-mutable array.

A SERIES CAN BE CREATED USING VARIOUS INPUTS LIKE − 1. Array 2. Dict 3. Scalar value or constant

CREATE SERIES 1. Create Empty Series import pandas as pd data=pd. Series() 2. Create Series with some data import pandas as pd data=pd. Series([‘a’, ’s’, ’s’]) print(data)
![SERIES WITH INDEX import pandas as pd datasets=pd. Series([‘a’, ’b’, ’c’], index=[‘ 1’, ’ SERIES WITH INDEX import pandas as pd datasets=pd. Series([‘a’, ’b’, ’c’], index=[‘ 1’, ’](http://slidetodoc.com/presentation_image/70436ce622eb27c5e5f1782a702716eb/image-14.jpg)
SERIES WITH INDEX import pandas as pd datasets=pd. Series([‘a’, ’b’, ’c’], index=[‘ 1’, ’ 2’, ’ 3’]) Print(datasets)

ACCESSING DATA FROM SERIES WITH POSITION import pandas as pd s = pd. Series([1, 2, 3, 4, 5], index = ['a', 'b', 'c', 'd', 'e']) #retrieve the first element print s[0] print(s[0]) #--retrieve 3 elements Print(s[: 3])

TASK 1 1. Write a Program to print only even number from series. 2. Write a program to print ODD Number from series. 3. Write a program to create two clusters of even and odd data from series.

RETRIEVE DATA USING LABEL (INDEX) import pandas as pd s = pd. Series([1, 2, 3, 4, 5], index = ['a', 'b', 'c', 'd', 'e']) #retrieve a single element print s['a'] #retrieve multiple elements with index name print s['a‘, ’b’, ’c’]

DATAFRAME A Data frame is a two-dimensional data structure, i. e. , data is aligned in a tabular fashion in rows and columns.

CREATE AN EMPTY DATAFRAME #import the pandas library and aliasing as pd import pandas as pd df = pd. Data. Frame() print (df)

CREATE A DATAFRAME FROM LISTS import pandas as pd data = [1, 2, 3, 4, 5] df = pd. Data. Frame(data) print df
![DATAFRAME WITH COLUMN NAME import pandas as pd data = [['Alex', 10], ['Bob', 12], DATAFRAME WITH COLUMN NAME import pandas as pd data = [['Alex', 10], ['Bob', 12],](http://slidetodoc.com/presentation_image/70436ce622eb27c5e5f1782a702716eb/image-21.jpg)
DATAFRAME WITH COLUMN NAME import pandas as pd data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]] df = pd. Data. Frame(data, columns=['Name', 'Age']) print df
![CREATING DICTIONARY d={'name': pd. Series(['nakul', 'HVPM']), 'Age': pd. Series([1, 2])} Df=pd. Data. Frame(d) Print(Df) CREATING DICTIONARY d={'name': pd. Series(['nakul', 'HVPM']), 'Age': pd. Series([1, 2])} Df=pd. Data. Frame(d) Print(Df)](http://slidetodoc.com/presentation_image/70436ce622eb27c5e5f1782a702716eb/image-22.jpg)
CREATING DICTIONARY d={'name': pd. Series(['nakul', 'HVPM']), 'Age': pd. Series([1, 2])} Df=pd. Data. Frame(d) Print(Df)

READING DATA FROM EXCEL DATASETS 1. Reading Excel sheet Import pandas as pd Dt=pd. read_csv(“Path of Excel Sheet”) Print(Dt) 2. Describe Excelsheet Dt. describe() 3. Print Top Records of Excel sheet Dt. head() Dt. Plot()

READING DATA FROM PARTICULAR COLUMN Import pandas as pd Dt=pd. read_csv(“Path of Excel Sheet”, usecols=['column_name']) Print(Dt)

PLOTTING MULTIPLE GRAPH import pandas as pd import matplotlib. pyplot as plt import time array 1=[] for i in range(10): array 1. append(i) sr 1=pd. Series(array 1) sr 1. plot(kind='bar') plt. show() plt. close() time. sleep(5)

GRAPHS IN PANDAS KIND bar’ or ‘barh’ for bar plots ‘hist’ for histogram ‘box’ for boxplot ‘kde’ or 'density' for density plots ‘area’ for area plots ‘scatter’ for scatter plots ‘hexbin’ for hexagonal bin plots ‘pie’ for pie plots
![PLOTTING GRAPH BY READING SERVER VALUES servervalue=[] import urllib. request import pandas as pd PLOTTING GRAPH BY READING SERVER VALUES servervalue=[] import urllib. request import pandas as pd](http://slidetodoc.com/presentation_image/70436ce622eb27c5e5f1782a702716eb/image-27.jpg)
PLOTTING GRAPH BY READING SERVER VALUES servervalue=[] import urllib. request import pandas as pd import time import matplotlib. pyplot as plt while(1): with urllib. request. urlopen("http: //mahavidyalay. in/Academic. Development/Server. Demo/ Show. Led 1. php")as response: html=response. read() servervalue. append(int(html)) sr 22=pd. Series(servervalue) sr 22. plot(kind='bar') plt. show() time. sleep(5)

LINKS AVAILABLE FOR STUDY https: //plot. ly/python/ipython-notebook-tutorial/ https: //plot. ly/pandas/ https: //www. anaconda. com/download/ http: //pbpython. com/simple-graphing-pandas. html
- Slides: 28