In this article arr is a numpy array, df is a DataFrame, s is a Series.
Sort in place arr.sort()
Sort out of place np.sort(arr)
Find the sorted unique values np.unique(arr)
If arr1 is a (3 x 2) matrix and arr2 is a (2 x 4) matrix
arr1 @ arr2
is the (3 x 4) matrix product.
Inverse of a matrix
np.linalg.inv(arr)
Determinant of a matrix
np.linalg.det(arr)
Draw a random (4 x 3) matrix from a standard normal distribution
np.random.randn(4,3)
Draw 100 random numbers chosen from the list [1,5,10] with different probabilities
np.random.choice([1,5,10], p = [0.2, 0.3, 0.5], size = 100)
Fix the generator seed of random numbers np.random.seed(2981)
Find the position of the maximum value np.argmax(arr)
or arr.argmax()
You have a 2 x 2 numpy array and you want all values in a single array
np.ravel(array)
or array.ravel()
Both operate out of place.
Draw random sample from a multivariate normal distribution
np.random.multivariate_normal(
mean,
cov)
where mean is an array of length N and cov is the covariance matrix (N x N)
Create a new array from an old one where the negative values are replaced with 0
and the positive values with 10
new_array = np.where(old_array < 0, 0, 10)
Convert a numpy array of str to an array of float.
arr.astype(float)
It operates out of place.
70th percentile of a numpy array.
np.percentile(arr, 70)
Vertical stack: arr1 is a (3 x 2) numpy array and arr2 is a (2 x 2).
np.r_[arr1,arr2]
is a (5 x 2) numpy array .
Horizontal stack: arr1 is a (3 x 2) numpy array and arr2 is a (3 x 6).
np.c_[arr1,arr2]
is a (3 x 8) numpy array.
Reshape out of place an array of 12 elements in a (4 x 3) matrix
arr.reshape((4,3))
Find the maximum value for each index among 4 arrays of the same shape.
np.maximum(arr1, arr2, arr3, arr4)
Display only 2 floating points for each Series or DataFrame
pd.options.display.float_format = '{:.2f}'.format
Set a column as index df.set_index('name_col', inplace = True)
Go back to the original df.df.reset_index(inplace = True)
Delete a column of a DataFrame del df['col_name']
Delete row2 and row5 of a DataFrame df.drop(['row2', 'row5'], inplace = True)
Subset of df where both df[‘col1’] and df[‘col2’] verify the condition
df[(df['col1'] < 0) & (df.col2 > 10)]
Note that you can’t replace &
with and
and brackets are necessary.
You have a Series but you want to give a name to the column and highlight it
s.rename('Column_Name').to_frame()
Count the number of positive and not null numbers in each column of a DataFrame
(df > 0).sum(axis = 0)
Find the index label where column A has the minimum value
df['A'].idxmin()
Find the position where column A has the minimum value
df['A'].values.argmin()
Plot a one-dimensional random walk in one line
pd.Series(np.random.choice([-1,1], size = 10_000).cumsum()).plot()
R is a numpy array containing n returns: calculate the final capital (cumulative return)
(R + 1).prod()
Create a new DataFrame from an old one where we retain the positive values but we replace
the negative values with -1
df_new = df_old.where(df_old > 0, -1)
Instantiate a time series DataFrame of 252 daily gaussian data ending at 1st January 2020
df = pd.DataFrame({'Normal Data' : np.random.randn(252)}, index = pd.date_range(end = '1/1/2020', periods = 252))
Sort in decreasing order a DataFrame by a given columns
df.sort_values(by = 'col_name', ascending = False)
Sort in place a DataFrame by index
df.sort_index(inplace = True)
Drop rows with at least 1 null value
df.dropna(axis = 0)
Drop rows where only the C column has a null value
df.dropna(subset = ['C'])
Replace null values with the median value of the corresponding column
df.fillna(df.median(), axis = 0)
Rename a column
df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
Concatenate two series in order to have a DataFrame with two columns
pd.concat([arr1, arr2], axis = 1)
Each element of a series is an element of a given list: true or false?
s.isin(given_list)
Number of null values for each column
df.isnull().sum(axis = 0)
Create dummy variables for a categorical column and then discard one (in order to avoid the perfect collinearity)
pd.get_dummies(df['categorical_col']).iloc[:,:-1]
Returns
df['Adj Close'].pct_change()
Log returns
np.log(df['Adj Close'] / df['Adj Close'].shift(1))
100 days Rolling correlation
df['AAPL_rets'].rolling(100).corr(df['SPY_rets'])
We have a daily price dataframe: find the first, high, low and last price of each month
df['Adj Close'].resample('M').ohlc()
Convert an index of str in DatetimeIndex
df.index = pd.to_datetime(df.index)
Divide a series in quartiles and give them a label
pd.qcut(s, 4, labels = 'Q1 Q2 Q3 Q4'.split())
Bootstrap a DataFrame with 20 extractions
df.sample(20, replace = True)
Turn each data of the dataframe in 4 floating points
df.applymap(lambda x: '%.4f' % x)
Group by column C and apply a function f(df, k, q) to each column.
df.groupby('C').apply(f, k, q)
Group by column C and compute the sum for column A and the mean and std for column B
df.groupby('C').agg({'A' : 'sum', 'B' : ['mean', 'std']})
Create a pivot table where both index and columns are created with a groupby operation of the columns of the original DataFrame, then we apply the mean function to the ‘Returns’ column.
df.pivot_table(values = ['Returns'], index = ['Utilities','Financial','Energy'], columns = ['Day of the week'], aggfunc = 'mean')