MLFrame¶
-
class
mlframe.
MLFrame
(frame, **kwargs)[source]¶ Bases:
pandas.core.frame.DataFrame
A pd.DataFrame with an inplace model, and LinearRegression modeling functions.
See pandas.DataFrame documentationhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Attributes Summary
model
[statsmodels.regression.linear_model.OLS] https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html Methods Summary
cat_cols
()Computes and returns Categorical columns clean_col_names
([inplace, verbose, …])Cleans the column names of a DataFrame for use in an R~Formula copy
(*args, **kwargs)Make a copy of this object’s indices and data. drop
(*args, **kwargs)Drop specified labels from rows or columns. drop_nulls_perc
(perc[, inplace, verbose])Drops a column if the null value is over a certain percentage (0-1) fill_na_kind
([kind, columns, custom, …])Fills na cells with the selection of it’s respective column fillna
(*args, **kwargs)Fill NA/NaN values using the specified method. find_outliers_IQR
(col[, verbose])Finds outliers using the IQR method find_outliers_Z
(col[, verbose])Finds outliers using the z_score method find_outliers_cooks_d
(target[, threshold, …])Finds outliers using the Cook’s Distance method get_cols
(name)Returns list of columns with name or names in it get_nulls
([verbose])Returns sum of all nulls in the dataframe get_r_squareds
([verbose])Tests models price to each column in the dataframe. get_vif
(target[, verbose])Computes the Variance Inflation Factor for the columns of a dataframe based on the target column get_vif_cols
(target[, threshold, verbose, …])Computes Variance Inflation Factor for the dataframe, and gets the columns that are above the defined threshold info
(*args, **kwargs)Print a concise summary of a DataFrame. log
(columns[, inplace, verbose])logs the listed columns of the dataframe lrmodel
([target, columns, inplace, verbose])Creates a LinearRegression model of target model_and_plot
(target[, figsize, verbose])Creates a new model based on target, plots a scatter plot of (target, model residuals), and plots a qqplot based on the model residuals. model_resid_scatter
(target[, ax, title, …])Plots a scatter plot and axhline based on target and the model’s residuals ms_matrix
(**kwargs)Plots a missingno matrix num_cols
()Computes and returns Numerical columns one_hot_encode
([columns, drop_first, verbose])Makes a one hot encoded dataframe outlier_removal
([columns, IQR, z_score, …])Removes outliers based on IQR or z_score or Cook’s Distance plot_coef
([cmap])Plots a predefined plot of the model’s coefficients plot_corr
([figsize, annot])Plots a predefined correlation heatmap qq_plot
([model])Plots a statsmodels QQplot of the dataframe replace
(*args, **kwargs)Replace values given in to_replace with value. replace_all
(string[, replace_numbers])Replaces bad characters in a string for column names to work in a R~formula scale
(columns[, inplace, verbose])Scales the listed columns of the dataframe train_test_split
(target[, test_size, seed, …])Runs a train test split algorithm on the data wrap__getitem__
(df)Wrapper for get item [] so that it returns an MLFrame rather then a pd.DataFrame wrapper
()Wrapper to return a MLFrame, and set the model when defined pd.DataFrame methods are used on a MLFrame Attributes Documentation
-
model
= None¶ [statsmodels.regression.linear_model.OLS] https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html
Methods Documentation
-
clean_col_names
(inplace=False, verbose=True, replace_numbers=False)[source]¶ Cleans the column names of a DataFrame for use in an R~Formula
- inplace[bool]::
- Defines whether to return a new dataframe or mutate the dataframe
- verbose[bool]::
- Whether to show the difference between the old columns and clean columns or not
- replace_numbers[bool]::
- Whether to replace numbers with their english counterpart i.e (1 -> one)
None if inplace, otherwise returns a copy of the dataframe
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names() Columns changed: model year --> model_year car name --> car_name
-
copy
(*args, **kwargs)[source]¶ Make a copy of this object’s indices and data.
When
deep=True
(default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).When
deep=False
, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).- deep : bool, default True
- Make a deep copy, including a copy of the data and the indices.
With
deep=False
neither the indices nor the data are copied.
- copy : Series or DataFrame
- Object type matches caller.
When
deep=True
, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).While
Index
objects are copied whendeep=True
, the underlying numpy array is not copied for performance reasons. SinceIndex
is immutable, the underlying data can be safely shared and a copy is not needed.>>> s = pd.Series([1, 2], index=["a", "b"]) >>> s a 1 b 2 dtype: int64
>>> s_copy = s.copy() >>> s_copy a 1 b 2 dtype: int64
Shallow copy versus default (deep) copy:
>>> s = pd.Series([1, 2], index=["a", "b"]) >>> deep = s.copy() >>> shallow = s.copy(deep=False)
Shallow copy shares data and index with original.
>>> s is shallow False >>> s.values is shallow.values and s.index is shallow.index True
Deep copy has own copy of data and index.
>>> s is deep False >>> s.values is deep.values or s.index is deep.index False
Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.
>>> s[0] = 3 >>> shallow[1] = 4 >>> s a 3 b 4 dtype: int64 >>> shallow a 3 b 4 dtype: int64 >>> deep a 1 b 2 dtype: int64
Note that when copying an object containing Python objects, a deep copy will copy the data, but will not do so recursively. Updating a nested data object will be reflected in the deep copy.
>>> s = pd.Series([[1, 2], [3, 4]]) >>> deep = s.copy() >>> s[0][0] = 10 >>> s 0 [10, 2] 1 [3, 4] dtype: object >>> deep 0 [10, 2] 1 [3, 4] dtype: object
-
drop
(*args, **kwargs)[source]¶ Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
- labels : single label or list-like
- Index or column labels to drop.
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
- index : single label or list-like
- Alternative to specifying axis (
labels, axis=0
is equivalent toindex=labels
). - columns : single label or list-like
- Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
). - level : int or level name, optional
- For MultiIndex, level from which the labels will be removed.
- inplace : bool, default False
- If False, return a copy. Otherwise, do operation inplace and return None.
- errors : {‘ignore’, ‘raise’}, default ‘raise’
- If ‘ignore’, suppress error and only existing labels are dropped.
- DataFrame
- DataFrame without the removed index or column labels.
- KeyError
- If any of the labels is not found in the selected axis.
DataFrame.loc : Label-location based indexer for selection by label. DataFrame.dropna : Return DataFrame with labels on given axis omitted
where (all or any) data are missing.- DataFrame.drop_duplicates : Return DataFrame with duplicate rows
- removed, optionally only considering certain columns.
Series.drop : Return Series with specified index labels removed.
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) A B C D 2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2
>>> df.drop(index='cow', columns='small') big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3
>>> df.drop(index='length', level=1) big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
-
drop_nulls_perc
(perc, inplace=False, verbose=True)[source]¶ Drops a column if the null value is over a certain percentage (0-1)
- perc::[float]
- The percentage under which nulls are for a column to get dropped
- inplace[bool]::
- Defines whether to return a new dataframe or mutate the dataframe
- verbose[bool]::
- Whether to print out the series or not
None if inplace, otherwise returns copy of dataframe with columns dropped
>>> df = MLFrame(pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D'])) >>> df['A'].loc[1:3] = np.nan >>> df['B'].loc[0] = np.nan >>> df A B C D 0 0.0 NaN 2 3 1 NaN 5.0 6 7 2 NaN 9.0 10 11 >>> df.drop_nulls_perc(.4) B C D 0 NaN 2 3 1 5.0 6 7 2 9.0 10 11
-
fill_na_kind
(kind='mean', columns=[], custom=0, inplace=False, verbose=True)[source]¶ Fills na cells with the selection of it’s respective column
- kind[str]::
‘mean’ default ‘mode’ ‘median’ ‘perc’ percent value_counts of it’s respective column ‘custom’
defaults to 0- columns[str or list]::
- the column or columns to fill, defaults to all
- custom::
- the variable to fill the NA with kind=’custom’
- inplace[bool]::
- Defines whether to return a new dataframe or mutate the dataframe.
- verbose[bool]::
- Whether to print out the filling information or not.
None if inplace, otherwise returns copy of dataframe with nulls filled with kind selected
>>> df = MLFrame(pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D'])) >>> df['A'].loc[1:3] = np.nan >>> df['B'].loc[0] = np.nan >>> df A B C D 0 0.0 NaN 2 3 1 NaN 5.0 6 7 2 NaN 9.0 10 11 >>> df.fill_na_kind('mean') Filling 66.67% of A with nan Filling 33.33% of B with 9.0 A B C D 0 0.0 5.0 2 3 1 0.0 5.0 6 7 2 0.0 9.0 10 11 >>> df.fill_na_kind('custom', custom=18) Filling 66.67% of A with 18 Filling 33.33% of B with 18 A B C D 0 0.0 18 2 3 1 18 5.0 6 7 2 18 9.0 10 11
-
fillna
(*args, **kwargs)[source]¶ Fill NA/NaN values using the specified method.
- value : scalar, dict, Series, or DataFrame
- Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
- method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
- Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
- axis : {0 or ‘index’, 1 or ‘columns’}
- Axis along which to fill missing values.
- inplace : bool, default False
- If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
- limit : int, default None
- If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- downcast : dict, default is None
- A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
- DataFrame or None
- Object with missing values filled or None if
inplace=True
.
interpolate : Fill NaN values using interpolation. reindex : Conform object to new index. asfreq : Convert TimeSeries to specified frequency.
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list('ABCD')) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method='ffill') A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
-
find_outliers_IQR
(col, verbose=True)[source]¶ Finds outliers using the IQR method
- col[str]::
- Name of the column to search for outliers in
- verbose[bool]::
- Whether to print out the series or not
True/False Series of the outliers (True is outlier)
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> idx_outliers = df.find_outliers_IQR('horsepower', verbose=True) Found 10 outliers using IQR in horsepower or ~ 2.55% >>> df = MLFrame(df[~idx_outliers])
-
find_outliers_Z
(col, verbose=True)[source]¶ - col[str]::
- Name of the column to search for outliers in
- verbose[bool]::
- Whether to print out the series or not
True/False Series of the outliers (True is outlier)
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> idx_outliers = df.find_outliers_Z('horsepower', verbose=True) Found 5 outliers using z_score in horsepower or ~ 1.28% >>> df = MLFrame(df[~idx_outliers])
-
find_outliers_cooks_d
(target, threshold=None, verbose=True)[source]¶ - target[str]::
- Name of the target column for you model.
- Threshold[int]::
- Threshold at which to drop outliers, defauts to 4/n, n being the length of the data frame.
- verbose[bool]::
- Whether to print out the series or not
True/False Series of the outliers (True is outlier)
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> idx_outliers = df.find_outliers_cooks_d('horsepower', verbose=True) >>> df = MLFrame(df[~idx_outliers])
-
get_cols
(name)[source]¶ Returns list of columns with name or names in it
- name[str, list]::
- str or list of str for column selection
-
get_nulls
(verbose=True)[source]¶ Returns sum of all nulls in the dataframe
- verbose[bool]::
- Whether to print out the null count of each row or not
>>> df = MLFrame(pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D'])) >>> df['A'].loc[1:3] = np.nan >>> df['B'].loc[0] = np.nan >>> df A B C D 0 0.0 NaN 2 3 1 NaN 5.0 6 7 2 NaN 9.0 10 11 >>> df.get_nulls(verbose=False) 3
-
get_r_squareds
(verbose=True)[source]¶ Tests models price to each column in the dataframe.
- verbose[bool]::
- Whether to print out the series or not
sorted pd.Series of columns –> r_squared
-
get_vif
(target, verbose=True)[source]¶ Computes the Variance Inflation Factor for the columns of a dataframe based on the target column
- target[str]::
- The column name to base the VIF on
- verbose[bool]::
- Whether or not to print out the VIF series
Series of variance_inflation_factor for each column
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.drop(['car name'], axis=1, inplace=True) >>> df.get_vif('mpg', verbose=False) const 763.558 cylinders 10.738 displacement 21.837 horsepower 9.944 weight 10.831 acceleration 2.626 model year 1.245 origin 1.772
-
get_vif_cols
(target, threshold=6, verbose=True, inplace=False)[source]¶ Computes Variance Inflation Factor for the dataframe, and gets the columns that are above the defined threshold
- target[str]::
- The column name to base the VIF on
- threshold=6[int]::
- The threshold that columns would be above where they are an issue, and need to be looked at
- verbose[bool]::
- Whether to print out the series or not
- inplace[bool]::
- Whether to return the series or not
Depending on inplace Series of variance_inflation_factor for each column
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.drop(['car name'], axis=1, inplace=True) >>> df.get_vif_cols('mpg', verbose=False) horsepower 9.944 cylinders 10.738 weight 10.831 displacement 21.837 dtype: float64
-
info
(*args, **kwargs)[source]¶ Print a concise summary of a DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
- data : DataFrame
- DataFrame to print information about.
- verbose : bool, optional
- Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columns
is followed. - buf : writable buffer, defaults to sys.stdout
- Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
- max_cols : int, optional
- When to switch from the verbose to the truncated output. If the
DataFrame has more than max_cols columns, the truncated output
is used. By default, the setting in
pandas.options.display.max_info_columns
is used. - memory_usage : bool, str, optional
Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the
pandas.options.display.memory_usage
setting.True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.
- null_counts : bool, optional
- Whether to show the non-null counts. By default, this is shown
only if the DataFrame is smaller than
pandas.options.display.max_info_rows
andpandas.options.display.max_info_columns
. A value of True always shows the counts, and False never shows the counts.
- None
- This method prints a summary of a DataFrame and returns None.
- DataFrame.describe: Generate descriptive statistics of DataFrame
- columns.
DataFrame.memory_usage: Memory usage of DataFrame columns.
>>> int_values = [1, 2, 3, 4, 5] >>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'] >>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0] >>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values, ... "float_col": float_values}) >>> df int_col text_col float_col 0 1 alpha 0.00 1 2 beta 0.25 2 3 gamma 0.50 3 4 delta 0.75 4 5 epsilon 1.00
Prints information of all columns:
>>> df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 int_col 5 non-null int64 1 text_col 5 non-null object 2 float_col 5 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Prints a summary of columns count and its dtypes but not per column information:
>>> df.info(verbose=False) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Columns: 3 entries, int_col to float_col dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:
>>> import io >>> buffer = io.StringIO() >>> df.info(buf=buffer) >>> s = buffer.getvalue() >>> with open("df_info.txt", "w", ... encoding="utf-8") as f: # doctest: +SKIP ... f.write(s) 260
The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:
>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6) >>> df = pd.DataFrame({ ... 'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6) ... }) >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 22.9+ MB
>>> df.info(memory_usage='deep') <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 188.8 MB
-
log
(columns, inplace=False, verbose=True)[source]¶ logs the listed columns of the dataframe
- columns[list, str]::
- A list of columns to make logarithmic
- inplace[bool]::
- Defines whether to return a new dataframe or mutate the dataframe
- verbose[bool]::
- Whether to print out logged columns or not
None if inplace otherwise returns a copy of the dataframe with columns logged
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.drop(['car name'], axis=1, inplace = True)
>>> df = df.log(columns=['mpg', 'cylinders']) Logging: mpg cylinders # OR >>> df.log('mpg', inplace=True) Logging: mpg
-
lrmodel
(target=None, columns=[], inplace=False, verbose=True, **kwargs)[source]¶ Creates a LinearRegression model of target
- target::[str]
- The target for which to model on
- cols[list]::
- a list of columns of which to build the model on. If empty, uses all columns-target
- inplace[bool]::
- Defines whether to return a new dataframe or mutate the dataframe
- verbose[bool]::
- Whether or not to display the model.summary()
- kwargs{dict}::
- Arguments that are sent to Model.from_formula() see:
https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html
None if inplace, otherwise returns the model
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(inplace=True) >>> df.lrmodel('mpg', verbose=False, inplace=True) >>> df.model.pvalues.max() 0.9996627853521083
-
model_and_plot
(target, figsize=(10, 10), verbose=True, **kwargs)[source]¶ Creates a new model based on target, plots a scatter plot of (target, model residuals), and plots a qqplot based on the model residuals.
- target::[str]
- The target for which to model on
- verbose[bool]::
- Whether or not to display the model.summary()
- kwargs{dict}::
- Arguments that are sent to Model.from_formula() see:
https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html
model
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(inplace=True) >>> df.model_and_plot('mpg')
-
model_resid_scatter
(target, ax=None, title='', scatter_kws={}, line_kws={})[source]¶ Plots a scatter plot and axhline based on target and the model’s residuals
- target[str]::
- The target of the model
- title[str]::
- The title of the plot
- ax[matplotlib.axes]:
- The axis to plot onto
- scatter_kws{dict}::
- Arguments to send to the scatter plot see:
https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.scatter.html line_kws{dict}:
Arguments to send to the axhline see:
https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.axhline.html
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(inplace=True) >>> df.lrmodel('mpg', inplace=True) >>> df.model_resid_scatter('mpg')
-
ms_matrix
(**kwargs)[source]¶ Plots a missingno matrix
- kwargs{dict}::
- Arguments to send to ms.matrix
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.ms_matrix()
-
one_hot_encode
(columns=[], drop_first=True, verbose=True, **kwargs)[source]¶ Makes a one hot encoded dataframe
- columns[list]::
- list of columns to one hot encode uses self.cat_cols() if not defined
- drop_first=True::
- whether to drop the first column or not to rid of multicollinearity
- verbose[bool]::
- Whether to print out the series or not
- kwargs{dict}::
- Arguments to send to pd.get_dummies see:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
encoded dataframe
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(verbose=False, inplace=True) >>> # splitting car_name into model for categorizing >>> df['model'] = df['car_name'].apply( >>> lambda x: x.split(' ')[0]) >>> df_ohe = df.one_hot_encode(columns=['model']) Added categorical columns 37 -> model
-
outlier_removal
(columns=[], IQR=False, z_score=False, cooks_d=False, verbose=True)[source]¶ Removes outliers based on IQR or z_score or Cook’s Distance
- column[list, str]::
- The columns of which to remove outliers if blank, removes from all columns
- IQR[bool]::
- Whether or not to remove outliers using IQR method
- z_score[bool]::
- Whether or not to remove outliers using z_score method
- cooks_d[bool]::
- Whether or not to remove outliers using the cooks_d method
- verbose[bool]::
- Whether to print how many outliers were found in each column or now
Copy of dataframe with outliers removed
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df = df.outlier_removal('horsepower', ... IQR=True) Found 10 outliers using IQR in horsepower or ~ 2.55% Removed >>> # OR >>> df = df.outlier_removal(['horsepower', 'mpg'], z_score=True) Found 10 outliers using z_score in horsepower or ~ 2.55% Removed Found 0 outliers using z_score in mpg or ~ 0.0% Removed
-
plot_coef
(cmap='Greens')[source]¶ Plots a predefined plot of the model’s coefficients
- cmap[str]:: Default is Greens
- The style.background_gradient color see:
https://matplotlib.org/3.3.1/tutorials/colors/colormaps.html
<pandas.io.formats.style.Styler>
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(inplace=True, verbose=False) >>> df.drop('car_name', axis=1, inplace=True) >>> df.plot_coef()
-
plot_corr
(figsize=(25, 25), annot=False, **kwargs)[source]¶ Plots a predefined correlation heatmap
- figsize(tu, ple)::
- The size of the plotted figure
- annot[bool]::
- Whether or not to annotate the cells
- kwargs{dict}::
- Arguments that are sent to sns.heatmap see:
https://seaborn.pydata.org/generated/seaborn.heatmap.html
fig, ax
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(inplace=True, verbose=False) >>> df.drop('car_name', axis=1, inplace=True) >>> df.plot_corr(annot=True)
-
qq_plot
(model=None, **kwargs)[source]¶ Plots a statsmodels QQplot of the dataframe
- kwargs{dict}::
- Arguments to send to sm.graphics.qqplot() see:
https://www.statsmodels.org/stable/generated/statsmodels.graphics.gofplots.qqplot.html
sm.graphics.qqplot()
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(inplace=True) >>> df.lrmodel('mpg', inplace=True) >>> df.qq_plot()
-
replace
(*args, **kwargs)[source]¶ Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with
.loc
or.iloc
, which require you to specify a location to update with some value.- to_replace : str, regex, list, dict, Series, int, float, or None
How to find the values that will be replaced.
numeric, str or regex:
- numeric: numeric values equal to to_replace will be replaced with value
- str: string exactly matching to_replace will be replaced with value
- regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
- First, if to_replace and value are both lists, they must be the same length.
- Second, if
regex=True
then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use. - str, regex and numeric rules apply as above.
dict:
- Dicts can be used to specify different replacement values
for different existing values. For example,
{'a': 'b', 'y': 'z'}
replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None. - For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
{'a': 1, 'b': 'z'}
looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not beNone
in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in. - For a DataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}
, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The value parameter should beNone
to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
- Dicts can be used to specify different replacement values
for different existing values. For example,
None:
- This means that the regex argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If value is also
None
then this must be a nested dictionary or Series.
- This means that the regex argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If value is also
See the examples section for examples of each of these.
- value : scalar, dict, list, str, regex, default None
- Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
- inplace : bool, default False
- If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.
- limit : int, default None
- Maximum size gap to forward or backward fill.
- regex : bool or same types as to_replace, default False
- Whether to interpret to_replace and/or value as regular
expressions. If this is
True
then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must beNone
. - method : {‘pad’, ‘ffill’, ‘bfill’, None}
The method to use when for replacement, when to_replace is a scalar, list or tuple and value is
None
.Changed in version 0.23.0: Added to DataFrame.
- DataFrame
- Object after replacement.
- AssertionError
- If regex is not a
bool
and to_replace is notNone
.
- If regex is not a
- TypeError
- If to_replace is not a scalar, array-like,
dict
, orNone
- If to_replace is a
dict
and value is not alist
,dict
,ndarray
, orSeries
- If to_replace is
None
and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series. - When replacing multiple
bool
ordatetime64
objects and the arguments to to_replace does not match the type of the value being replaced
- If to_replace is not a scalar, array-like,
- ValueError
- If a
list
or anndarray
is passed to to_replace and value but they are not the same length.
- If a
DataFrame.fillna : Fill NA values. DataFrame.where : Replace values based on boolean condition. Series.str.replace : Simple string replacement.
- Regex substitution is performed under the hood with
re.sub
. The rules for substitution forre.sub
are the same. - Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
- This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
- When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.
Scalar `to_replace` and `value`
>>> s = pd.Series([0, 1, 2, 3, 4]) >>> s.replace(0, 5) 0 5 1 1 2 2 3 3 4 4 dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4], ... 'B': [5, 6, 7, 8, 9], ... 'C': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
List-like `to_replace`
>>> df.replace([0, 1, 2, 3], 4) A B C 0 4 5 a 1 4 6 b 2 4 7 c 3 4 8 d 4 4 9 e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1]) A B C 0 4 5 a 1 3 6 b 2 2 7 c 3 1 8 d 4 4 9 e
>>> s.replace([1, 2], method='bfill') 0 0 1 3 2 3 3 3 4 4 dtype: int64
dict-like `to_replace`
>>> df.replace({0: 10, 1: 100}) A B C 0 10 5 a 1 100 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': 0, 'B': 5}, 100) A B C 0 100 100 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace({'A': {0: 100, 4: 400}}) A B C 0 100 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 400 9 e
Regular expression `to_replace`
>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'], ... 'B': ['abc', 'bar', 'xyz']}) >>> df.replace(to_replace=r'^ba.$', value='new', regex=True) A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True) A B 0 new abc 1 foo bar 2 bait xyz
>>> df.replace(regex=r'^ba.$', value='new') A B 0 new abc 1 foo new 2 bait xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}) A B 0 new abc 1 xyz new 2 bait xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new') A B 0 new abc 1 new new 2 bait xyz
Note that when replacing multiple
bool
ordatetime64
objects, the data types in the to_replace parameter must match the data type of the value being replaced:>>> df = pd.DataFrame({'A': [True, False, True], ... 'B': [False, True, False]}) >>> df.replace({'a string': 'new value', True: False}) # raises Traceback (most recent call last): ... TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
This raises a
TypeError
because one of thedict
keys is not of the correct type for replacement.Compare the behavior of
s.replace({'a': None})
ands.replace('a', None)
to understand the peculiarities of the to_replace parameter:>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter.
s.replace({'a': None})
is equivalent tos.replace(to_replace={'a': None}, value=None, method=None)
:>>> s.replace({'a': None}) 0 10 1 None 2 None 3 b 4 None dtype: object
When
value=None
and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case. The commands.replace('a', None)
is actually equivalent tos.replace(to_replace='a', value=None, method='pad')
:>>> s.replace('a', None) 0 10 1 10 2 10 3 b 4 b dtype: object
-
static
replace_all
(string, replace_numbers=False)[source]¶ Replaces bad characters in a string for column names to work in a R~formula
-
scale
(columns, inplace=False, verbose=True)[source]¶ Scales the listed columns of the dataframe
- columns[list, str]::
- A list of columns to scale
- inplace[bool]::
- Defines whether to return a new dataframe or mutate the dataframe
- verbose[bool]::
- Whether to print out the scaled columns or not
- Returns:
- None if inplace otherwise returns a copy of the dataframe with columns scaled
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.drop(['car name'], axis=1, inplace = True)
>>> df = df.scale(columns=['mpg', 'cylinders']) Scaling: mpg cylinders # OR >>> df.scale('mpg', inplace=True) Scaling: mpg
-
train_test_split
(target, test_size=100, seed=42, plot=True, verbose=True, inplace=False)[source]¶ Runs a train test split algorithm on the data
- target[str]::
- Name of the column of which to target
- test_size[int]::
- How many times to run the train_test_split
- seed[int]::
- The random seed to use
- plot[bool]::
- Whether or not to show the plots
- verbose[bool]::
- Whether or not to show the model
- inplace[bool]::
- Defines whether to return a new mode or change the current model
- model[sm.regression.linear_model.RegressionResultsWrapper]::
- The best model of the train_test_split
>>> df = MLFrame(pd.read_csv('mlframe/tests/auto-mpg.csv')) >>> df.clean_col_names(inplace=True) >>> df.drop(['car_name', 'origin'], axis=1, inplace=True) >>> model = df.train_test_split('mpg', test_size=5, verbose=False) >>> model.pvalues Intercept 0.005 cylinders 0.503 displacement 0.688 horsepower 0.868 weight 0.000 acceleration 0.510 model_year 0.000 dtype: float64
-