Data Analysis in Python

Monday, May 26, 2025 · 12 min read

Page 1 of 26

Introduction

These notes cover fundamental and intermediate concepts in Python 3, Numpy, Pandas, and JSON handling, focusing on syntax, data structures, and essential functions for data manipulation and analysis.

PYTHON 3

I. print

print('Hi' + str(8))     # Hi 8
print(int(8) + 5)        # 13
print(float(8.5) + 5)    # 13.5
print("\n", "q")

II. Math

+, -, *, / (float division), // (integer division), ** (power), % (modulo)

'Hi'*3      # 'HiHiHi'

III. Variables

n = 5

Use str() for string conversion

x, y = (3, 5)  # unpacking
# x = 3, y = 5

IV. While Loop

while condition:
    [body]

V. For Loop

for x in range(3, 5)
# for x in a:
# A range() generates a list using generator

VI. if, elif, else

or, and, not

VII. Strings

a = 'py'
b = 'thon'
'python'
a = 'python'
a[0]   # 'p'
a[-1]  # 'n'
a[0:100]  # 'python' (out of range is handled in slicing)
a[2:-4]  # ''
a[-4:-2] # 'th'
a[0] = 'a'      # Error (immutable)
a += 'gh'

Page 2 of 26

VIII. Lists (Mutable)

Compound data types are grouped
[n in a] returns True / False

a + [1]
a.append(1)  # more efficient

9. Range function

range(start, stop, step)

Also, range() doesn't return a list, it returns an object which is iterable.

sum(range(4))  # 6
list(range(4))  # [0, 1, 2, 3]

10. lambda exp.

Small anonymous functions can be created with lambda keyword:

lambda n, y, z: n + y + z
# params      # return

11. List Functions

list.append(n)
list.insert(i, x)
list.extend(iterable_obj)
list.remove(x)
list.pop([i])   # optional
list.clear()
list.reverse()
list.copy()
list.sort(key=None, reverse=False)

Page 3 of 26

12. Queues

Lists are not efficient for insert/pops from beginning.

from collections import deque

q = deque(list)
q.append()
q.appendleft()
q.pop()
q.popleft()
# All else are same

13. del statement

del n
del n[2:4]
del n[4]

14. List comprehensions

a = [f(x**2) for x in range(4)]

15. Tuples

Immutable list (may contain mutable objects)

a = (1, 2, 3)
x, y, z = a

16. Sets

"Unordered" collection with no duplicate elements

n in set  # membership
a = {1, 2, 3}
a = set(list)

Subsets

a - b (Difference)
a | b (Union)
a & b (Intersection)
a ^ b (XOR)

Page 4 of 26

17. Dictionaries

Key-value pairs
Unique keys
Old value replaced by new key
Error if key does not exist

a = dict(list)
a = {'x': 1, 'y': 2}
del dict[key]
a['n']
list(a)  # ['n', 'y']
'n' in a  # False
for k, y in dict.items():
    print(k, y)

Page 5 of 26

NUMPY

A library for working with large arrays & matrices of numeric data, such as matrix multiplication.

import numpy as np

Creating array

list1 = [1, 2, 3, 4]
array1 = np.array(list1)  # (option = data type)
# This gives array

Now, we can use:

array1.ndim → Number of dimensions
array1.shape → Gives shape
array1.dtype → Gives its datatype
array1.size → Number of elements

Common array creation

np.full((len), fill_value)
np.zeros((shape))
np.ones((shape))
# Pass a tuple in it (shape)

Gives float dtype (by default)
np.empty()
np.eye() # Identity matrix
np.random.rand(n) # Random n randoms

np.arange(start, stop, step)  # creates array
np.reshape((rows, cols))
np.linspace(start, stop, n)  # creates evenly spaced arrays

Page 6 of 26

Scalars & Arrays

array1 = np.array([[1, 2, 3], [4, 5, 6]])
array1 * 2
array1 + array1
array1 - array1
array1 ** 3

Indexing Arrays

arr = np.arange(0, 10)
arr[8]        # 8
arr[1:5]      # arr([1,2,3,4])
arr[2:5] = 4
arr[2,3]      # (2-d array)
arr[[0,1,4,4,4,5,-1, -10]]

Copy an array to avoid changing the original array:

arr2 = arr.copy()

2-D array slicing

arr_2d[:, :, :]
# Fancy indexing
arr_2d[ [  ] ]

Page 7 of 26

Array Transposition

arr.reshape((10, 5))  # We can form 3d matrix also
arr.T  # Transpose

Universal array functions

(axis = 0, 1)

np.sqrt(arr)
np.exp(arr)
np.abs(arr)
np.add(arr1, arr2)
np.maximum(arr1, arr2)
np.minimum(arr1, arr2)

Array Input & Output

np.save(' ', arr)   # array name you want
np.load(' ', arr)   # array name saved
# For multiple arrays
np.savez(' ', np1, np2)
np.load(' ', 'np2')

Boolean

arr < 3  # returns array of True/False
np.sum(arr < 3)  # number of elements
np.sort(n)       # sort inplace
n.sort()

Page 8 of 26

Pandas Introduction

Pandas is a ~~neuru~~ → newer package built on top of Numpy & provides efficient implementation of DataFrame.

DFs are multi-dimensional arrays with attached row & column labels, and often with heterogeneous types and/or missing data.

Page 9 of 26

PANDAS

import pandas as pd

Series

It is 1-D array of indexed data (Numpy or with index)

obj = pd.Series([ ])
# Put in the data - list, tuple, dictionary
obj = pd.Series(data, index)

By default, these are 0, 1, 2, ...

obj.index  # gives index
obj.values # gives values

DataFrames (built on Series)

2-D array with flexible row indices & column names

df = pd.DataFrame(data, index, columns)
df.index
df.columns
df['col_name']  # This is pandas series

To add columns

df['New'] = df['w'] + df['x']

Page 10 of 26

To remove column

df.drop('New', axis=1, inplace=True)

To remove row

df.drop('E')

df.shape  # gives (row, col)

To select row

df.loc[ ]  # pass in row name, column name
df.iloc[ ] # pass in index (row index, column index)
# We can also do
df.loc[['i'], ['j']]

Page 11 of 26

Working with df

df[df > 0]  # gives the dataframe back
df.reset_index()
df.drop('index', axis=1, inplace=True)
df.set_index(' ', inplace=True)

Missing Data

df.isnull()
df.notnull()
df.dropna()               # drops each row containing nan value
df.dropna(axis=1)         # drops each column with nan value
df.fillna(value={'F7': 1})

groupby

Group them & aggregate them

df.groupby('company')  # object created

Page 12 of 26

Now you can do aggregate like:

yo.mean()
yo.sum()
yo.std()
yo.count()  # gives out count
yo.max()
yo.min()

Also, we can use a bunch of information by

df.groupby('C').describe()

Concatenation

pda.concat([ , ])
# Pass in dataframe name

Merging

pda.merge(left, right, on=' ')
# common on

Combining (or Joining)

df.join(df1)

Page 13 of 26

Operations

df['col2'].unique()  # gives an array
df['col2'].nunique() # no. of unique arrays
df['col2'].value_counts()
# example output:
# 444    2
# 555    1
# 666    1
df['col2'].apply( )  # pass the function required
# ex: sum, ex: pass own built-in func

Special attrs

df.columns   # prints column name
df.index

Page 14 of 26

df.sort_values(by='col2')
df.isnull()  # gives out a boolean dataframe
# df.pivot_table()  # (no content provided)

Page 15 of 26

JSON

JavaScript Object Notation

JSON	PYTHON
object	dict
array	list
true	True
false	False
string	str
Real No. (integer)	int
. (real)	float

obj = {'nom1': 'value'}  # (dictionary)
array = ['v1', 'v2']     # (list)

import json

Now, if we have JSON data, then use 'loads' or 'load'

json_data = '{"v1":"no","v2":"NO"}'
obj = json.loads(json_data)
type(obj)  # dict

Page 16 of 26

If we have a dict and we want to convert it into JSON, then we use 'dumps' or 'dump'

my_dict = {'v1': 'Red', 'v2': 'White'}
json_data = json.dumps(my_dict)
type(json_data)
# string (json data is in string format)

So, basically:

loads: convert json data into python data
dumps: convert python data into json data

Now, to work with files

write

with open('n.json', 'w') as outFile:
    json.dump(data, outFile)
# Python data

And do same for append or read

Note

Pandas does very well in reading files and it also reads json.

import pandas as pd
df = pd.read_json('n.json')
print(df)

Page 17 of 26

Introduction

This section of the notes covers data visualization and manipulation in Python, focusing on how to read JSON data without pandas, as well as working with the Matplotlib and Seaborn libraries to create various types of plots and visualizations. It includes both functional and object-oriented plotting methods, layout management, saving figures, and customizing plot appearance.

To read without → without using pandas

To read a JSON file without pandas:

with open('x.json') as data:
    json_data = json.load(data)
    print(json_data)

Page 18 of 26

MATPLOTLIB

Importing and Setup

import matplotlib.pyplot as plt
%matplotlib inline  # (shows plots in the output)
%matplotlib notebook  # (interactive plot)

Functional Method

For simple plot:

plt.plot(x, y, 'r')  # red colour
plt.show()  # "brings the plot"

Labeling:

plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.title('title')

To subplot (multiple plots):

plt.subplot(1, 2, 1)  # no. of rows, no. of columns
plt.plot(x, y)
plt.subplot(1, 2, 2)
plt.plot(y, x)

Object Oriented Method

For single plot:

fig = plt.figure()

Axes (lower object created):

ax = fig.add_axes([left, bottom, width, height])
ax.plot(x, y)
ax.set_xlabel('xlabel')
ax.set_ylabel('ylabel')
ax.set_title('Set Title')

Page 19 of 26

For multiple plot (inside it → if manually adding axes)

fig = plt.figure()
axes1 = fig.add_axes([0.1, 0.1, 0.8, 0.8])
axes2 = fig.add_axes([0.2, 0.5, 0.4, 0.3])
axes1.plot(x, y)
axes2.plot(y, x)

For multiple plot (outside it → if automatically adding axes)

fig, axes = plt.subplots(nrows=1, ncols=2)

Now axes is an object which contains a list of 1 row and 2 columns which we can iterate through.

You can call each element of the axes:

axes[0].plot(x, y)  # will plot only on first

Note: To overcome overlapping, use

plt.tight_layout()

at the end.

Page 20 of 26

Figure Size (aspect ratio) and DPI

Units: Inches per inch (previously per cm)

Example:

fig = plt.figure(figsize=(3, 2))  # takes tuple, this is basically the ratio of rows over columns

To save a figure

fig.savefig('filename.extension', dpi=200)

Extensions: .png, .jpg, .pdf (more details in docs)

Legend

fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
ax.plot(x, y**2, label='x square')
ax.plot(x, y**3, label='x cube')
ax.legend()  # This gives out the legend box
ax.legend(loc=0)  # This gives location of index either as a tuple or simply 0, 1, etc.

Page 21 of 26

PLOT APPEARANCE

I. Colour

fig = plt.figure()
axes.plot(x, y, color='r')  # either as a string or just 'r', 'g', 'b'

II. Line width

axes.plot(x, y, linewidth=1)  # default is lw
axes.plot(x, y, alpha=0.5)
axes.plot(x, y, linestyle='--')
axes.plot(x, y, linestyle='-.')  # ls

axes.plot(x, y, marker='+', markersize=20)
# markerface color: yellow
# markeredge width: 3
# markeredge color: green

Page 22 of 26

III. Plot axes

ax.set_xlim([0, 2])
ax.set_ylim([0, 1])

Scatter

plt.plot(x, y, marker='x', color='red')  # better
# or
plt.scatter(x, y, marker='o')

Histogram

plt.hist(list)
# or
plt.hist(data, bins=30, histtype='...')

Page 23 of 26

Seaborn

import seaborn as sns
%matplotlib inline

There is an in-built dataset (tips is one of them)

tips = sns.load_dataset('tips')

Distribution Plot

sns.distplot(tips['total_bill'])

This gives histogram + KDE (Kernel Distribution Estimation)

Histogram diagram:

graph LR
    A[Count] -- | | --> B[Bins]

Joint Plot

sns.jointplot(x='', y='', data=)
# This gives two distribution plots + scatter plot

Page 24 of 26

III. Pair Plot

sns.pairplot(data)

It actually gives pairwise all plots.
Nice way to quickly recognize data.

Most important things to see:

sns.pairplot(data, size=, palette=)
# distribution, color decision

IV. Rug Plot

sns.rugplot(tips[''])

Like a density plot.
It's actually the number of counts shown by a line.

Categorized Plots

I. Box Plot

sns.boxplot(x='', y='', data=)
# categorical, numerical

II. Count Plot

sns.countplot(x='', data=tips)

Page 25 of 26

III. Box Plot

sns.boxplot(x='day', y='total_bill', data=tips)
# categorical, numerical

IV. Violin Plot

sns.violinplot(x='day', y='total_bill', data=tips)

V. Strip Plot

sns.stripplot(x='day', y='total_bill', data=tips)
# categorical, numerical

VI. Swarm Plot (combines strip plot & violin plot)

sns.swarmplot(x='', y='', data=tips)
# categorical, numerical

VII. Factor Plot (General Plot)

sns.factorplot(x='', y='', data=tips, kind='')

Page 26 of 26

Matrix Plot

(a) A coordinate scale is formed first

We can use correlation in it:

tc = tips.corr()
sns.heatmap(tc)
sns.heatmap(tc, annot=True)  # This gives actual heat map

II. Pivot Table

tf = flights.pivot_table(index='', columns='', values='')
sns.heatmap(tf)

(b) Clustermap

sns.clustermap(tf)

Regression Plot

sns.lmplot(x='', y='', data=)
# This gives linear regression

References and Related Topics

Python Documentation
Numpy Documentation
Pandas Documentation
JSON Specification
Related: Data Science, Data Cleaning, Data Visualization, Python for Data Analysis
Matplotlib Documentation
Seaborn Documentation

Share via EmailShare on XShare on LinkedInShare on Reddit

Introduction

PYTHON 3

I. print

II. Math

III. Variables

IV. While Loop

V. For Loop

VI. if, elif, else

VII. Strings

VIII. Lists (Mutable)

9. Range function

10. lambda exp.

11. List Functions

12. Queues

13. del statement

14. List comprehensions

15. Tuples

16. Sets

Subsets

17. Dictionaries

NUMPY

Creating array

Common array creation

Scalars & Arrays

Indexing Arrays

2-D array slicing

Array Transposition

Universal array functions

Tag

Array Input & Output

Boolean

Pandas Introduction

PANDAS

Series

DataFrames (built on Series)

To add columns

To remove column

To remove row

To select row

Working with df

Missing Data

groupby

Concatenation

Merging

Combining (or Joining)

Operations

Special attrs

JSON

JavaScript Object Notation

Now, to work with files

write

Note

Introduction

To read without → without using pandas

MATPLOTLIB

Importing and Setup

Functional Method

Object Oriented Method

For multiple plot (inside it → if manually adding axes)

For multiple plot (outside it → if automatically adding axes)

Figure Size (aspect ratio) and DPI

To save a figure

Legend

PLOT APPEARANCE

I. Colour

II. Line width

III. Plot axes

Scatter

Histogram

Seaborn

Distribution Plot

Histogram diagram:

Joint Plot

III. Pair Plot

IV. Rug Plot

Categorized Plots

I. Box Plot

II. Count Plot

III. Box Plot

IV. Violin Plot