MLFrame¶

class mlframe.MLFrame(frame, **kwargs)[source]¶

Bases: pandas.core.frame.DataFrame

A pd.DataFrame with an inplace model, and LinearRegression modeling functions.

See pandas.DataFrame documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Attributes Summary

model [statsmodels.regression.linear_model.OLS] https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

Methods Summary

`cat_cols`()	Computes and returns Categorical columns
`clean_col_names`([inplace, verbose, …])	Cleans the column names of a DataFrame for use in an R~Formula
`copy`(args, *kwargs)	Make a copy of this object’s indices and data.
`drop`(args, *kwargs)	Drop specified labels from rows or columns.
`drop_nulls_perc`(perc[, inplace, verbose])	Drops a column if the null value is over a certain percentage (0-1)
`fill_na_kind`([kind, columns, custom, …])	Fills na cells with the selection of it’s respective column
`fillna`(args, *kwargs)	Fill NA/NaN values using the specified method.
`find_outliers_IQR`(col[, verbose])	Finds outliers using the IQR method
`find_outliers_Z`(col[, verbose])	Finds outliers using the z_score method
`find_outliers_cooks_d`(target[, threshold, …])	Finds outliers using the Cook’s Distance method
`get_cols`(name)	Returns list of columns with name or names in it
`get_nulls`([verbose])	Returns sum of all nulls in the dataframe
`get_r_squareds`([verbose])	Tests models price to each column in the dataframe.
`get_vif`(target[, verbose])	Computes the Variance Inflation Factor for the columns of a dataframe based on the target column
`get_vif_cols`(target[, threshold, verbose, …])	Computes Variance Inflation Factor for the dataframe, and gets the columns that are above the defined threshold
`info`(args, *kwargs)	Print a concise summary of a DataFrame.
`log`(columns[, inplace, verbose])	logs the listed columns of the dataframe
`lrmodel`([target, columns, inplace, verbose])	Creates a LinearRegression model of target
`model_and_plot`(target[, figsize, verbose])	Creates a new model based on target, plots a scatter plot of (target, model residuals), and plots a qqplot based on the model residuals.
`model_resid_scatter`(target[, ax, title, …])	Plots a scatter plot and axhline based on target and the model’s residuals
`ms_matrix`(**kwargs)	Plots a missingno matrix
`num_cols`()	Computes and returns Numerical columns
`one_hot_encode`([columns, drop_first, verbose])	Makes a one hot encoded dataframe
`outlier_removal`([columns, IQR, z_score, …])	Removes outliers based on IQR or z_score or Cook’s Distance
`plot_coef`([cmap])	Plots a predefined plot of the model’s coefficients
`plot_corr`([figsize, annot])	Plots a predefined correlation heatmap
`qq_plot`([model])	Plots a statsmodels QQplot of the dataframe
`replace`(args, *kwargs)	Replace values given in to_replace with value.
`replace_all`(string[, replace_numbers])	Replaces bad characters in a string for column names to work in a R~formula
`scale`(columns[, inplace, verbose])	Scales the listed columns of the dataframe
`train_test_split`(target[, test_size, seed, …])	Runs a train test split algorithm on the data
`wrap__getitem__`(df)	Wrapper for get item [] so that it returns an MLFrame rather then a pd.DataFrame
`wrapper`()	Wrapper to return a MLFrame, and set the model when defined pd.DataFrame methods are used on a MLFrame

Attributes Documentation

model = None¶: [statsmodels.regression.linear_model.OLS] https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

Methods Documentation

cat_cols()[source]¶: Computes and returns Categorical columns

clean_col_names(inplace=False, verbose=True, replace_numbers=False)[source]¶

Cleans the column names of a DataFrame for use in an R~Formula

inplace[bool]::: Defines whether to return a new dataframe or mutate the dataframe
verbose[bool]::: Whether to show the difference between the old columns and clean columns or not
replace_numbers[bool]::: Whether to replace numbers with their english counterpart i.e (1 -> one)

None if inplace, otherwise returns a copy of the dataframe

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names()
Columns changed:
model year --> model_year
car name --> car_name

copy(*args, **kwargs)[source]¶

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

deep : bool, default True: Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.

copy : Series or DataFrame: Object type matches caller.

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

While Index objects are copied when deep=True, the underlying numpy array is not copied for performance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not needed.

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> s
a    1
b    2
dtype: int64

>>> s_copy = s.copy()
>>> s_copy
a    1
b    2
dtype: int64

Shallow copy versus default (deep) copy:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

Shallow copy shares data and index with original.

>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

Deep copy has own copy of data and index.

>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False

Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.

>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a    3
b    4
dtype: int64
>>> shallow
a    3
b    4
dtype: int64
>>> deep
a    1
b    2
dtype: int64

Note that when copying an object containing Python objects, a deep copy will copy the data, but will not do so recursively. Updating a nested data object will be reflected in the deep copy.

>>> s = pd.Series([[1, 2], [3, 4]])
>>> deep = s.copy()
>>> s[0][0] = 10
>>> s
0    [10, 2]
1     [3, 4]
dtype: object
>>> deep
0    [10, 2]
1     [3, 4]
dtype: object

drop(*args, **kwargs)[source]¶

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

labels : single label or list-like: Index or column labels to drop.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0: Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index : single label or list-like: Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
columns : single label or list-like: Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
level : int or level name, optional: For MultiIndex, level from which the labels will be removed.
inplace : bool, default False: If False, return a copy. Otherwise, do operation inplace and return None.
errors : {‘ignore’, ‘raise’}, default ‘raise’: If ‘ignore’, suppress error and only existing labels are dropped.

DataFrame: DataFrame without the removed index or column labels.

KeyError: If any of the labels is not found in the selected axis.

DataFrame.loc : Label-location based indexer for selection by label. DataFrame.dropna : Return DataFrame with labels on given axis omitted

where (all or any) data are missing.

DataFrame.drop_duplicates : Return DataFrame with duplicate rows: removed, optionally only considering certain columns.

Series.drop : Return Series with specified index labels removed.

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
...                   columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)
   A   D
0  0   3
1  4   7
2  8  11

>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

>>> df.drop(index='cow', columns='small')
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3

>>> df.drop(index='length', level=1)
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8

drop_nulls_perc(perc, inplace=False, verbose=True)[source]¶

Drops a column if the null value is over a certain percentage (0-1)

perc::[float]: The percentage under which nulls are for a column to get dropped
inplace[bool]::: Defines whether to return a new dataframe or mutate the dataframe
verbose[bool]::: Whether to print out the series or not

None if inplace, otherwise returns copy of dataframe with columns dropped

>>> df = MLFrame(pd.DataFrame(np.arange(12).reshape(3, 4),
...                   columns=['A', 'B', 'C', 'D']))
>>> df['A'].loc[1:3] = np.nan
>>> df['B'].loc[0] = np.nan
>>> df
    A    B   C   D
0  0.0  NaN   2   3
1  NaN  5.0   6   7
2  NaN  9.0  10  11
>>> df.drop_nulls_perc(.4)
    B   C   D
0  NaN   2   3
1  5.0   6   7
2  9.0  10  11

fill_na_kind(kind='mean', columns=[], custom=0, inplace=False, verbose=True)[source]¶

Fills na cells with the selection of it’s respective column

kind[str]::: ‘mean’ default ‘mode’ ‘median’ ‘perc’ percent value_counts of it’s respective column ‘custom’

defaults to 0
columns[str or list]::: the column or columns to fill, defaults to all
custom::: the variable to fill the NA with kind=’custom’
inplace[bool]::: Defines whether to return a new dataframe or mutate the dataframe.
verbose[bool]::: Whether to print out the filling information or not.

None if inplace, otherwise returns copy of dataframe with nulls filled with kind selected

>>> df = MLFrame(pd.DataFrame(np.arange(12).reshape(3, 4),
...                   columns=['A', 'B', 'C', 'D']))
>>> df['A'].loc[1:3] = np.nan
>>> df['B'].loc[0] = np.nan
>>> df
    A    B   C   D
0  0.0  NaN   2   3
1  NaN  5.0   6   7
2  NaN  9.0  10  11
>>> df.fill_na_kind('mean')
Filling 66.67% of A with nan
Filling 33.33% of B with 9.0
    A    B    C   D
0  0.0  5.0   2   3
1  0.0  5.0   6   7
2  0.0  9.0  10  11
>>> df.fill_na_kind('custom', custom=18)
Filling 66.67% of A with 18
Filling 33.33% of B with 18
    A    B   C   D
0  0.0  18   2   3
1  18  5.0   6   7
2  18  9.0  10  11

fillna(*args, **kwargs)[source]¶

Fill NA/NaN values using the specified method.

value : scalar, dict, Series, or DataFrame: Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None: Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
axis : {0 or ‘index’, 1 or ‘columns’}: Axis along which to fill missing values.
inplace : bool, default False: If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limit : int, default None: If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast : dict, default is None: A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

DataFrame or None: Object with missing values filled or None if inplace=True.

interpolate : Fill NaN values using interpolation. reindex : Conform object to new index. asfreq : Convert TimeSeries to specified frequency.

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list('ABCD'))
>>> df
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

Replace all NaN elements with 0s.

>>> df.fillna(0)
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>>> df.fillna(value=values)
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4

find_outliers_IQR(col, verbose=True)[source]¶

Finds outliers using the IQR method

col[str]::: Name of the column to search for outliers in
verbose[bool]::: Whether to print out the series or not

True/False Series of the outliers (True is outlier)

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> idx_outliers = df.find_outliers_IQR('horsepower', verbose=True)
Found 10 outliers using IQR in horsepower or ~ 2.55%
>>> df = MLFrame(df[~idx_outliers])

find_outliers_Z(col, verbose=True)[source]¶

col[str]::: Name of the column to search for outliers in
verbose[bool]::: Whether to print out the series or not

True/False Series of the outliers (True is outlier)

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> idx_outliers = df.find_outliers_Z('horsepower', verbose=True)
Found 5 outliers using z_score in horsepower or ~ 1.28%
>>> df = MLFrame(df[~idx_outliers])

find_outliers_cooks_d(target, threshold=None, verbose=True)[source]¶

target[str]::: Name of the target column for you model.
Threshold[int]::: Threshold at which to drop outliers, defauts to 4/n, n being the length of the data frame.
verbose[bool]::: Whether to print out the series or not

True/False Series of the outliers (True is outlier)

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> idx_outliers = df.find_outliers_cooks_d('horsepower', verbose=True)
>>> df = MLFrame(df[~idx_outliers])

get_cols(name)[source]¶

Returns list of columns with name or names in it

name[str, list]::: str or list of str for column selection

get_nulls(verbose=True)[source]¶

Returns sum of all nulls in the dataframe

verbose[bool]::: Whether to print out the null count of each row or not

>>> df = MLFrame(pd.DataFrame(np.arange(12).reshape(3, 4),
...                   columns=['A', 'B', 'C', 'D']))
>>> df['A'].loc[1:3] = np.nan
>>> df['B'].loc[0] = np.nan
>>> df
    A    B   C   D
0  0.0  NaN   2   3
1  NaN  5.0   6   7
2  NaN  9.0  10  11
>>> df.get_nulls(verbose=False)
3

get_r_squareds(verbose=True)[source]¶

Tests models price to each column in the dataframe.

verbose[bool]::: Whether to print out the series or not

sorted pd.Series of columns –> r_squared

get_vif(target, verbose=True)[source]¶

Computes the Variance Inflation Factor for the columns of a dataframe based on the target column

target[str]::: The column name to base the VIF on
verbose[bool]::: Whether or not to print out the VIF series

Series of variance_inflation_factor for each column

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.drop(['car name'], axis=1, inplace=True)
>>> df.get_vif('mpg', verbose=False)
const          763.558
cylinders       10.738
displacement    21.837
horsepower       9.944
weight          10.831
acceleration     2.626
model year       1.245
origin           1.772

get_vif_cols(target, threshold=6, verbose=True, inplace=False)[source]¶

Computes Variance Inflation Factor for the dataframe, and gets the columns that are above the defined threshold

target[str]::: The column name to base the VIF on
threshold=6[int]::: The threshold that columns would be above where they are an issue, and need to be looked at
verbose[bool]::: Whether to print out the series or not
inplace[bool]::: Whether to return the series or not

Depending on inplace Series of variance_inflation_factor for each column

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.drop(['car name'], axis=1, inplace=True)
>>> df.get_vif_cols('mpg', verbose=False)
horsepower      9.944
cylinders      10.738
weight         10.831
displacement   21.837
dtype: float64

info(*args, **kwargs)[source]¶

Print a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

data : DataFrame

DataFrame to print information about.

verbose : bool, optional

Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columns is followed.

buf : writable buffer, defaults to sys.stdout

Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.

max_cols : int, optional

When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.

memory_usage : bool, str, optional

Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting.

True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.

null_counts : bool, optional

Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.

None: This method prints a summary of a DataFrame and returns None.

DataFrame.describe: Generate descriptive statistics of DataFrame

columns.

DataFrame.memory_usage: Memory usage of DataFrame columns.

>>> int_values = [1, 2, 3, 4, 5]
>>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
>>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0]
>>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
...                   "float_col": float_values})
>>> df
    int_col text_col  float_col
0        1    alpha       0.00
1        2     beta       0.25
2        3    gamma       0.50
3        4    delta       0.75
4        5  epsilon       1.00

Prints information of all columns:

>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   int_col    5 non-null      int64
 1   text_col   5 non-null      object
 2   float_col  5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Prints a summary of columns count and its dtypes but not per column information:

>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 3 entries, int_col to float_col
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:

>>> import io
>>> buffer = io.StringIO()
>>> df.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
...           encoding="utf-8") as f:  # doctest: +SKIP
...     f.write(s)
260

The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:

>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6)
>>> df = pd.DataFrame({
...     'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6)
... })
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 22.9+ MB

>>> df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 188.8 MB

log(columns, inplace=False, verbose=True)[source]¶

logs the listed columns of the dataframe

columns[list, str]::: A list of columns to make logarithmic
inplace[bool]::: Defines whether to return a new dataframe or mutate the dataframe
verbose[bool]::: Whether to print out logged columns or not

None if inplace otherwise returns a copy of the dataframe with columns logged

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.drop(['car name'], axis=1, inplace = True)

>>> df = df.log(columns=['mpg', 'cylinders'])
Logging:
   mpg
   cylinders
# OR
>>> df.log('mpg', inplace=True)
Logging:
   mpg

lrmodel(target=None, columns=[], inplace=False, verbose=True, **kwargs)[source]¶

Creates a LinearRegression model of target

target::[str]: The target for which to model on
cols[list]::: a list of columns of which to build the model on. If empty, uses all columns-target
inplace[bool]::: Defines whether to return a new dataframe or mutate the dataframe
verbose[bool]::: Whether or not to display the model.summary()
kwargs{dict}::: Arguments that are sent to Model.from_formula() see:

https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html

None if inplace, otherwise returns the model

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(inplace=True)
>>> df.lrmodel('mpg', verbose=False, inplace=True)
>>> df.model.pvalues.max()
0.9996627853521083

model_and_plot(target, figsize=(10, 10), verbose=True, **kwargs)[source]¶

Creates a new model based on target, plots a scatter plot of (target, model residuals), and plots a qqplot based on the model residuals.

target::[str]: The target for which to model on
verbose[bool]::: Whether or not to display the model.summary()
kwargs{dict}::: Arguments that are sent to Model.from_formula() see:

https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html

model

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(inplace=True)
>>> df.model_and_plot('mpg')

model_resid_scatter(target, ax=None, title='', scatter_kws={}, line_kws={})[source]¶

Plots a scatter plot and axhline based on target and the model’s residuals

target[str]::: The target of the model
title[str]::: The title of the plot
ax[matplotlib.axes]:: The axis to plot onto
scatter_kws{dict}::: Arguments to send to the scatter plot see:

https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.scatter.html line_kws{dict}:

Arguments to send to the axhline
see:

https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.axhline.html

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(inplace=True)
>>> df.lrmodel('mpg', inplace=True)
>>> df.model_resid_scatter('mpg')

ms_matrix(**kwargs)[source]¶

Plots a missingno matrix

kwargs{dict}::: Arguments to send to ms.matrix

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.ms_matrix()

num_cols()[source]¶: Computes and returns Numerical columns

one_hot_encode(columns=[], drop_first=True, verbose=True, **kwargs)[source]¶

Makes a one hot encoded dataframe

columns[list]::: list of columns to one hot encode uses self.cat_cols() if not defined
drop_first=True::: whether to drop the first column or not to rid of multicollinearity
verbose[bool]::: Whether to print out the series or not
kwargs{dict}::: Arguments to send to pd.get_dummies see:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

encoded dataframe

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(verbose=False, inplace=True)
>>> # splitting car_name into model for categorizing
>>> df['model'] = df['car_name'].apply(
>>>     lambda x: x.split(' ')[0])
>>> df_ohe = df.one_hot_encode(columns=['model'])
Added categorical columns
37 -> model

outlier_removal(columns=[], IQR=False, z_score=False, cooks_d=False, verbose=True)[source]¶

Removes outliers based on IQR or z_score or Cook’s Distance

column[list, str]::: The columns of which to remove outliers if blank, removes from all columns
IQR[bool]::: Whether or not to remove outliers using IQR method
z_score[bool]::: Whether or not to remove outliers using z_score method
cooks_d[bool]::: Whether or not to remove outliers using the cooks_d method
verbose[bool]::: Whether to print how many outliers were found in each column or now

Copy of dataframe with outliers removed

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df = df.outlier_removal('horsepower',
...                          IQR=True)
Found 10 outliers using IQR in horsepower or ~ 2.55%
Removed
>>> # OR
>>> df = df.outlier_removal(['horsepower', 'mpg'],
                         z_score=True)
Found 10 outliers using z_score in horsepower or ~ 2.55%
Removed
Found 0 outliers using z_score in mpg or ~ 0.0%
Removed

plot_coef(cmap='Greens')[source]¶

Plots a predefined plot of the model’s coefficients

cmap[str]:: Default is Greens: The style.background_gradient color see:

https://matplotlib.org/3.3.1/tutorials/colors/colormaps.html

<pandas.io.formats.style.Styler>

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(inplace=True, verbose=False)
>>> df.drop('car_name', axis=1, inplace=True)
>>> df.plot_coef()

plot_corr(figsize=(25, 25), annot=False, **kwargs)[source]¶

Plots a predefined correlation heatmap

figsize(tu, ple)::: The size of the plotted figure
annot[bool]::: Whether or not to annotate the cells
kwargs{dict}::: Arguments that are sent to sns.heatmap see:

https://seaborn.pydata.org/generated/seaborn.heatmap.html

fig, ax

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(inplace=True, verbose=False)
>>> df.drop('car_name', axis=1, inplace=True)
>>> df.plot_corr(annot=True)

qq_plot(model=None, **kwargs)[source]¶

Plots a statsmodels QQplot of the dataframe

kwargs{dict}::: Arguments to send to sm.graphics.qqplot() see:

https://www.statsmodels.org/stable/generated/statsmodels.graphics.gofplots.qqplot.html

sm.graphics.qqplot()

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(inplace=True)
>>> df.lrmodel('mpg', inplace=True)
>>> df.qq_plot()

replace(*args, **kwargs)[source]¶

Replace values given in to_replace with value.

Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

to_replace : str, regex, list, dict, Series, int, float, or None

How to find the values that will be replaced.

numeric, str or regex:
- numeric: numeric values equal to to_replace will be replaced with value
- str: string exactly matching to_replace will be replaced with value
- regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
- First, if to_replace and value are both lists, they must be the same length.
- Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
- str, regex and numeric rules apply as above.
dict:
- Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.
- For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
- For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should be None to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
None:
- This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

See the examples section for examples of each of these.

value : scalar, dict, list, str, regex, default None

Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

inplace : bool, default False

If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.

limit : int, default None

Maximum size gap to forward or backward fill.

regex : bool or same types as to_replace, default False

Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

method : {‘pad’, ‘ffill’, ‘bfill’, None}

The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Changed in version 0.23.0: Added to DataFrame.

DataFrame: Object after replacement.

AssertionError

If regex is not a bool and to_replace is not None.

TypeError

If to_replace is not a scalar, array-like, dict, or None
If to_replace is a dict and value is not a list, dict, ndarray, or Series
If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.
When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced

ValueError

If a list or an ndarray is passed to to_replace and value but they are not the same length.

DataFrame.fillna : Fill NA values. DataFrame.where : Replace values based on boolean condition. Series.str.replace : Simple string replacement.

Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.
Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Scalar `to_replace` and `value`

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
4  5  a
4  6  b
4  7  c
4  8  d
4  9  e

>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
4  5  a
3  6  b
2  7  c
1  8  d
4  9  e

>>> s.replace([1, 2], method='bfill')
  0
  3
  3
  3
  4
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})
     A  B  C
 10  5  a
100  6  b
  2  7  c
  3  8  d
  4  9  e

>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
100  100  a
  1    6  b
  2    7  c
  3    8  d
  4    9  e

>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
100  5  a
  1  6  b
  2  7  c
  3  8  d
400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz

>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz

>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple bool or datetime64 objects, the data types in the to_replace parameter must match the data type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises
Traceback (most recent call last):
    ...
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a TypeError because one of the dict keys is not of the correct type for replacement.

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})
    10
  None
  None
     b
  None
dtype: object

When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually equivalent to s.replace(to_replace='a', value=None, method='pad'):

>>> s.replace('a', None)
  10
  10
  10
   b
   b
dtype: object

static replace_all(string, replace_numbers=False)[source]¶: Replaces bad characters in a string for column names to work in a R~formula

scale(columns, inplace=False, verbose=True)[source]¶

Scales the listed columns of the dataframe

columns[list, str]::: A list of columns to scale
inplace[bool]::: Defines whether to return a new dataframe or mutate the dataframe
verbose[bool]::: Whether to print out the scaled columns or not
Returns:: None if inplace otherwise returns a copy of the dataframe with columns scaled

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.drop(['car name'], axis=1, inplace = True)

>>> df = df.scale(columns=['mpg', 'cylinders'])
Scaling:
   mpg
   cylinders
# OR
>>> df.scale('mpg', inplace=True)
Scaling:
   mpg

train_test_split(target, test_size=100, seed=42, plot=True, verbose=True, inplace=False)[source]¶

Runs a train test split algorithm on the data

target[str]::: Name of the column of which to target
test_size[int]::: How many times to run the train_test_split
seed[int]::: The random seed to use
plot[bool]::: Whether or not to show the plots
verbose[bool]::: Whether or not to show the model
inplace[bool]::: Defines whether to return a new mode or change the current model

model[sm.regression.linear_model.RegressionResultsWrapper]::: The best model of the train_test_split

>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv'))
>>> df.clean_col_names(inplace=True)
>>> df.drop(['car_name', 'origin'], axis=1, inplace=True)
>>> model = df.train_test_split('mpg',
                                test_size=5,
                                verbose=False)
>>> model.pvalues
Intercept      0.005
cylinders      0.503
displacement   0.688
horsepower     0.868
weight         0.000
acceleration   0.510
model_year     0.000
dtype: float64

wrap__getitem__(df)[source]¶: Wrapper for get item [] so that it returns an MLFrame rather then a pd.DataFrame

wrapper()[source]¶: Wrapper to return a MLFrame, and set the model when defined pd.DataFrame methods are used on a MLFrame

MLFrame¶

mlframe

Navigation

Related Topics