Home | About | Archive | Contact

In this article I’m going to summarize some of the most utilized functions and operations in NumPy and Pandas.

This is an incomplete list (maybe I will add something else in the future), but in my experience these are the most important scripts to have in mind when you make data analysis with Python.

The aim is to do an efficient cheat sheet of NumPy and Pandas, assuming the knowledge of the basics that I left out.

In this article arr is a numpy array, df is a DataFrame, s is a Series.

NumPy

Sort in place arr.sort()
Sort out of place np.sort(arr)
Find the sorted unique values np.unique(arr)
If arr1 is a (3 x 2) matrix and arr2 is a (2 x 4) matrix
arr1 @ arr2 is the (3 x 4) matrix product.
Inverse of a matrix
np.linalg.inv(arr)
Determinant of a matrix
np.linalg.det(arr)
Draw a random (4 x 3) matrix from a standard normal distribution
np.random.randn(4,3)
Draw 100 random numbers chosen from the list [1,5,10] with different probabilities
np.random.choice([1,5,10], p = [0.2, 0.3, 0.5], size = 100)
Fix the generator seed of random numbers np.random.seed(2981)
Find the position of the maximum value np.argmax(arr) or arr.argmax()
You have a 2 x 2 numpy array and you want all values in a single array
np.ravel(array) or array.ravel()
Both operate out of place.
Draw random sample from a multivariate normal distribution
np.random.multivariate_normal(mean,cov)
where mean is an array of length N and cov is the covariance matrix (N x N)
Create a new array from an old one where the negative values are replaced with 0
and the positive values with 10
new_array = np.where(old_array < 0, 0, 10)
Convert a numpy array of str to an array of float.
arr.astype(float)
It operates out of place.
70th percentile of a numpy array.
np.percentile(arr, 70)
Vertical stack: arr1 is a (3 x 2) numpy array and arr2 is a (2 x 2).
np.r_[arr1,arr2] is a (5 x 2) numpy array .
Horizontal stack: arr1 is a (3 x 2) numpy array and arr2 is a (3 x 6).
np.c_[arr1,arr2] is a (3 x 8) numpy array.
Reshape out of place an array of 12 elements in a (4 x 3) matrix
arr.reshape((4,3))
Find the maximum value for each index among 4 arrays of the same shape.
np.maximum(arr1, arr2, arr3, arr4)

Pandas

Display only 2 floating points for each Series or DataFrame
pd.options.display.float_format = '{:.2f}'.format
Set a column as index df.set_index('name_col', inplace = True)
Go back to the original df.df.reset_index(inplace = True)
Delete a column of a DataFrame del df['col_name']
Delete row2 and row5 of a DataFrame df.drop(['row2', 'row5'], inplace = True)
Subset of df where both df[‘col1’] and df[‘col2’] verify the condition
df[(df['col1'] < 0) & (df.col2 > 10)]
Note that you can’t replace & with and and brackets are necessary.
You have a Series but you want to give a name to the column and highlight it
s.rename('Column_Name').to_frame()
Count the number of positive and not null numbers in each column of a DataFrame
(df > 0).sum(axis = 0)
Find the index label where column A has the minimum value
df['A'].idxmin()
Find the position where column A has the minimum value
df['A'].values.argmin()
Plot a one-dimensional random walk in one line
pd.Series(np.random.choice([-1,1], size = 10_000).cumsum()).plot()
R is a numpy array containing n returns: calculate the final capital (cumulative return)
(R + 1).prod()
Create a new DataFrame from an old one where we retain the positive values but we replace
the negative values with -1
df_new = df_old.where(df_old > 0, -1)
Instantiate a time series DataFrame of 252 daily gaussian data ending at 1st January 2020
df = pd.DataFrame({'Normal Data' : np.random.randn(252)}, index = pd.date_range(end = '1/1/2020', periods = 252))
Sort in decreasing order a DataFrame by a given columns
df.sort_values(by = 'col_name', ascending = False)
Sort in place a DataFrame by index
df.sort_index(inplace = True)
Drop rows with at least 1 null value
df.dropna(axis = 0)
Drop rows where only the C column has a null value
df.dropna(subset = ['C'])
Replace null values with the median value of the corresponding column
df.fillna(df.median(), axis = 0)
Rename a column
df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
Concatenate two series in order to have a DataFrame with two columns
pd.concat([arr1, arr2], axis = 1)
Each element of a series is an element of a given list: true or false?
s.isin(given_list)
Number of null values for each column
df.isnull().sum(axis = 0)
Create dummy variables for a categorical column and then discard one (in order to avoid the perfect collinearity)
pd.get_dummies(df['categorical_col']).iloc[:,:-1]
Returns
df['Adj Close'].pct_change()
Log returns
np.log(df['Adj Close'] / df['Adj Close'].shift(1))
100 days Rolling correlation
df['AAPL_rets'].rolling(100).corr(df['SPY_rets'])
We have a daily price dataframe: find the first, high, low and last price of each month
df['Adj Close'].resample('M').ohlc()
Convert an index of str in DatetimeIndex
df.index = pd.to_datetime(df.index)
Divide a series in quartiles and give them a label
pd.qcut(s, 4, labels = 'Q1 Q2 Q3 Q4'.split())
Bootstrap a DataFrame with 20 extractions
df.sample(20, replace = True)
Turn each data of the dataframe in 4 floating points
df.applymap(lambda x: '%.4f' % x)
Group by column C and apply a function f(df, k, q) to each column.
df.groupby('C').apply(f, k, q)
Group by column C and compute the sum for column A and the mean and std for column B
df.groupby('C').agg({'A' : 'sum', 'B' : ['mean', 'std']})
Create a pivot table where both index and columns are created with a groupby operation of the columns of the original DataFrame, then we apply the mean function to the ‘Returns’ column.
df.pivot_table(values = ['Returns'], index = ['Utilities','Financial','Energy'], columns = ['Day of the week'], aggfunc = 'mean')