Page 1 of 26
Introduction
These notes cover fundamental and intermediate concepts in Python 3, Numpy, Pandas, and JSON handling, focusing on syntax, data structures, and essential functions for data manipulation and analysis.
PYTHON 3
I. print
print('Hi' + str(8)) # Hi 8
print(int(8) + 5) # 13
print(float(8.5) + 5) # 13.5
print("\n", "q")
II. Math
+,-,*,/(float division),//(integer division),**(power),%(modulo)
'Hi'*3 # 'HiHiHi'
III. Variables
n = 5
- Use
str()for string conversion
x, y = (3, 5) # unpacking
# x = 3, y = 5
IV. While Loop
while condition:
[body]
V. For Loop
for x in range(3, 5)
# for x in a:
# A range() generates a list using generator
VI. if, elif, else
or,and,not
VII. Strings
a = 'py'
b = 'thon'
'python'
a = 'python'
a[0] # 'p'
a[-1] # 'n'
a[0:100] # 'python' (out of range is handled in slicing)
a[2:-4] # ''
a[-4:-2] # 'th'
a[0] = 'a' # Error (immutable)
a += 'gh'
Page 2 of 26
VIII. Lists (Mutable)
- Compound data types are grouped
[n in a]returns True / False
a + [1]
a.append(1) # more efficient
9. Range function
range(start, stop, step)
- Also,
range()doesn't return a list, it returns an object which is iterable.
sum(range(4)) # 6
list(range(4)) # [0, 1, 2, 3]
10. lambda exp.
- Small anonymous functions can be created with
lambdakeyword:
lambda n, y, z: n + y + z
# params # return
11. List Functions
list.append(n)
list.insert(i, x)
list.extend(iterable_obj)
list.remove(x)
list.pop([i]) # optional
list.clear()
list.reverse()
list.copy()
list.sort(key=None, reverse=False)
Page 3 of 26
12. Queues
Lists are not efficient for insert/pops from beginning.
from collections import deque
q = deque(list)
q.append()
q.appendleft()
q.pop()
q.popleft()
# All else are same
13. del statement
del n
del n[2:4]
del n[4]
14. List comprehensions
a = [f(x**2) for x in range(4)]
15. Tuples
- Immutable list (may contain mutable objects)
a = (1, 2, 3)
x, y, z = a
16. Sets
- "Unordered" collection with no duplicate elements
n in set # membership
a = {1, 2, 3}
a = set(list)
Subsets
a - b(Difference)a | b(Union)a & b(Intersection)a ^ b(XOR)
Page 4 of 26
17. Dictionaries
- Key-value pairs
- Unique keys
- Old value replaced by new key
- Error if key does not exist
a = dict(list)
a = {'x': 1, 'y': 2}
del dict[key]
a['n']
list(a) # ['n', 'y']
'n' in a # False
for k, y in dict.items():
print(k, y)
Page 5 of 26
NUMPY
A library for working with large arrays & matrices of numeric data, such as matrix multiplication.
import numpy as np
Creating array
list1 = [1, 2, 3, 4]
array1 = np.array(list1) # (option = data type)
# This gives array
Now, we can use:
array1.ndim→ Number of dimensionsarray1.shape→ Gives shapearray1.dtype→ Gives its datatypearray1.size→ Number of elements
Common array creation
np.full((len), fill_value)
np.zeros((shape))
np.ones((shape))
# Pass a tuple in it (shape)
- Gives float dtype (by default)
np.empty()np.eye()# Identity matrixnp.random.rand(n)# Random n randoms
np.arange(start, stop, step) # creates array
np.reshape((rows, cols))
np.linspace(start, stop, n) # creates evenly spaced arrays
Page 6 of 26
Scalars & Arrays
array1 = np.array([[1, 2, 3], [4, 5, 6]])
array1 * 2
array1 + array1
array1 - array1
array1 ** 3
Indexing Arrays
arr = np.arange(0, 10)
arr[8] # 8
arr[1:5] # arr([1,2,3,4])
arr[2:5] = 4
arr[2,3] # (2-d array)
arr[[0,1,4,4,4,5,-1, -10]]
- Copy an array to avoid changing the original array:
arr2 = arr.copy()
2-D array slicing
arr_2d[:, :, :]
# Fancy indexing
arr_2d[ [ ] ]
Page 7 of 26
Array Transposition
arr.reshape((10, 5)) # We can form 3d matrix also
arr.T # Transpose
Universal array functions
- (axis = 0, 1)
np.sqrt(arr)
np.exp(arr)
np.abs(arr)
np.add(arr1, arr2)
np.maximum(arr1, arr2)
np.minimum(arr1, arr2)
Tag
np.pinp.sin(n)
Array Input & Output
np.save(' ', arr) # array name you want
np.load(' ', arr) # array name saved
# For multiple arrays
np.savez(' ', np1, np2)
np.load(' ', 'np2')
Boolean
arr < 3 # returns array of True/False
np.sum(arr < 3) # number of elements
np.sort(n) # sort inplace
n.sort()
Page 8 of 26
Pandas Introduction
Pandas is a neuru → newer package built on top of Numpy & provides efficient implementation of DataFrame.
DFs are multi-dimensional arrays with attached row & column labels, and often with heterogeneous types and/or missing data.
Page 9 of 26
PANDAS
import pandas as pd
Series
- It is 1-D array of indexed data (Numpy or with index)
obj = pd.Series([ ])
# Put in the data - list, tuple, dictionary
obj = pd.Series(data, index)
- By default, these are 0, 1, 2, ...
obj.index # gives index
obj.values # gives values
DataFrames (built on Series)
- 2-D array with flexible row indices & column names
df = pd.DataFrame(data, index, columns)
df.index
df.columns
df['col_name'] # This is pandas series
To add columns
df['New'] = df['w'] + df['x']
Page 10 of 26
To remove column
df.drop('New', axis=1, inplace=True)
To remove row
df.drop('E')
df.shape # gives (row, col)
To select row
df.loc[ ] # pass in row name, column name
df.iloc[ ] # pass in index (row index, column index)
# We can also do
df.loc[['i'], ['j']]
Page 11 of 26
Working with df
df[df > 0] # gives the dataframe back
df.reset_index()
df.drop('index', axis=1, inplace=True)
df.set_index(' ', inplace=True)
Missing Data
df.isnull()
df.notnull()
df.dropna() # drops each row containing nan value
df.dropna(axis=1) # drops each column with nan value
df.fillna(value={'F7': 1})
groupby
- Group them & aggregate them
df.groupby('company') # object created
Page 12 of 26
Now you can do aggregate like:
yo.mean()
yo.sum()
yo.std()
yo.count() # gives out count
yo.max()
yo.min()
Also, we can use a bunch of information by
df.groupby('C').describe()
Concatenation
pda.concat([ , ])
# Pass in dataframe name
Merging
pda.merge(left, right, on=' ')
# common on
Combining (or Joining)
df.join(df1)
Page 13 of 26
Operations
df['col2'].unique() # gives an array
df['col2'].nunique() # no. of unique arrays
df['col2'].value_counts()
# example output:
# 444 2
# 555 1
# 666 1
df['col2'].apply( ) # pass the function required
# ex: sum, ex: pass own built-in func
Special attrs
df.columns # prints column name
df.index
Page 14 of 26
df.sort_values(by='col2')
df.isnull() # gives out a boolean dataframe
# df.pivot_table() # (no content provided)
Page 15 of 26
JSON
JavaScript Object Notation
| JSON | PYTHON |
|---|---|
| object | dict |
| array | list |
| true | True |
| false | False |
| string | str |
| Real No. (integer) | int |
| . (real) | float |
obj = {'nom1': 'value'} # (dictionary)
array = ['v1', 'v2'] # (list)
import json
- Now, if we have JSON data, then use 'loads' or 'load'
json_data = '{"v1":"no","v2":"NO"}'
obj = json.loads(json_data)
type(obj) # dict
Page 16 of 26
- If we have a dict and we want to convert it into JSON, then we use 'dumps' or 'dump'
my_dict = {'v1': 'Red', 'v2': 'White'}
json_data = json.dumps(my_dict)
type(json_data)
# string (json data is in string format)
So, basically:
loads: convert json data into python datadumps: convert python data into json data
Now, to work with files
write
with open('n.json', 'w') as outFile:
json.dump(data, outFile)
# Python data
- And do same for append or read
Note
Pandas does very well in reading files and it also reads json.
import pandas as pd
df = pd.read_json('n.json')
print(df)
Page 17 of 26
Introduction
This section of the notes covers data visualization and manipulation in Python, focusing on how to read JSON data without pandas, as well as working with the Matplotlib and Seaborn libraries to create various types of plots and visualizations. It includes both functional and object-oriented plotting methods, layout management, saving figures, and customizing plot appearance.
To read without → without using pandas
To read a JSON file without pandas:
with open('x.json') as data:
json_data = json.load(data)
print(json_data)
Page 18 of 26
MATPLOTLIB
Importing and Setup
import matplotlib.pyplot as plt
%matplotlib inline # (shows plots in the output)
%matplotlib notebook # (interactive plot)
Functional Method
- For simple plot:
plt.plot(x, y, 'r') # red colour plt.show() # "brings the plot" - Labeling:
plt.xlabel('xlabel') plt.ylabel('ylabel') plt.title('title') - To subplot (multiple plots):
plt.subplot(1, 2, 1) # no. of rows, no. of columns plt.plot(x, y) plt.subplot(1, 2, 2) plt.plot(y, x)
Object Oriented Method
- For single plot:
fig = plt.figure()- Axes (lower object created):
ax = fig.add_axes([left, bottom, width, height]) ax.plot(x, y) ax.set_xlabel('xlabel') ax.set_ylabel('ylabel') ax.set_title('Set Title')
- Axes (lower object created):
Page 19 of 26
For multiple plot (inside it → if manually adding axes)
fig = plt.figure()
axes1 = fig.add_axes([0.1, 0.1, 0.8, 0.8])
axes2 = fig.add_axes([0.2, 0.5, 0.4, 0.3])
axes1.plot(x, y)
axes2.plot(y, x)
For multiple plot (outside it → if automatically adding axes)
fig, axes = plt.subplots(nrows=1, ncols=2)
Now axes is an object which contains a list of 1 row and 2 columns which we can iterate through.
You can call each element of the axes:
axes[0].plot(x, y) # will plot only on first
Note: To overcome overlapping, use
plt.tight_layout()
at the end.
Page 20 of 26
Figure Size (aspect ratio) and DPI
- Units: Inches per inch (previously per cm)
- Example:
fig = plt.figure(figsize=(3, 2)) # takes tuple, this is basically the ratio of rows over columns
To save a figure
fig.savefig('filename.extension', dpi=200)
- Extensions: .png, .jpg, .pdf (more details in docs)
Legend
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
ax.plot(x, y**2, label='x square')
ax.plot(x, y**3, label='x cube')
ax.legend() # This gives out the legend box
ax.legend(loc=0) # This gives location of index either as a tuple or simply 0, 1, etc.
Page 21 of 26
PLOT APPEARANCE
I. Colour
fig = plt.figure()
axes.plot(x, y, color='r') # either as a string or just 'r', 'g', 'b'
II. Line width
axes.plot(x, y, linewidth=1) # default is lw
axes.plot(x, y, alpha=0.5)
axes.plot(x, y, linestyle='--')
axes.plot(x, y, linestyle='-.') # ls
axes.plot(x, y, marker='+', markersize=20)
# markerface color: yellow
# markeredge width: 3
# markeredge color: green
Page 22 of 26
III. Plot axes
ax.set_xlim([0, 2])
ax.set_ylim([0, 1])
Scatter
plt.plot(x, y, marker='x', color='red') # better
# or
plt.scatter(x, y, marker='o')
Histogram
plt.hist(list)
# or
plt.hist(data, bins=30, histtype='...')
Page 23 of 26
Seaborn
import seaborn as sns
%matplotlib inline
There is an in-built dataset (tips is one of them)
tips = sns.load_dataset('tips')
Distribution Plot
sns.distplot(tips['total_bill'])
- This gives histogram + KDE (Kernel Distribution Estimation)
Histogram diagram:
graph LR
A[Count] -- | | --> B[Bins]
Joint Plot
sns.jointplot(x='', y='', data=)
# This gives two distribution plots + scatter plot
Page 24 of 26
III. Pair Plot
sns.pairplot(data)
- It actually gives pairwise all plots.
- Nice way to quickly recognize data.
- Most important things to see:
sns.pairplot(data, size=, palette=) # distribution, color decision
IV. Rug Plot
sns.rugplot(tips[''])
- Like a density plot.
- It's actually the number of counts shown by a line.
Categorized Plots
I. Box Plot
sns.boxplot(x='', y='', data=)
# categorical, numerical
II. Count Plot
sns.countplot(x='', data=tips)
Page 25 of 26
III. Box Plot
sns.boxplot(x='day', y='total_bill', data=tips)
# categorical, numerical
IV. Violin Plot
sns.violinplot(x='day', y='total_bill', data=tips)
V. Strip Plot
sns.stripplot(x='day', y='total_bill', data=tips)
# categorical, numerical
VI. Swarm Plot (combines strip plot & violin plot)
sns.swarmplot(x='', y='', data=tips)
# categorical, numerical
VII. Factor Plot (General Plot)
sns.factorplot(x='', y='', data=tips, kind='')
Page 26 of 26
Matrix Plot
(a) A coordinate scale is formed first
We can use correlation in it:
tc = tips.corr()
sns.heatmap(tc)
sns.heatmap(tc, annot=True) # This gives actual heat map
II. Pivot Table
tf = flights.pivot_table(index='', columns='', values='')
sns.heatmap(tf)
(b) Clustermap
sns.clustermap(tf)
Regression Plot
sns.lmplot(x='', y='', data=)
# This gives linear regression
References and Related Topics
- Python Documentation
- Numpy Documentation
- Pandas Documentation
- JSON Specification
- Related: Data Science, Data Cleaning, Data Visualization, Python for Data Analysis
- Matplotlib Documentation
- Seaborn Documentation