sharkpy

SharkPy - A friendly machine learning framework with shark-themed feedback 🦈

Submodules

Classes

Shark

A machine learning model manager that simplifies training, prediction, and analysis.

Package Contents

class sharkpy.Shark[source]

A machine learning model manager that simplifies training, prediction, and analysis.

model

The trained machine learning model (e.g., LogisticRegression, RandomForestClassifier).

Type:: object or None

problem_type

Type of ML problem (‘classification’ or ‘regression’).

Type:: str or None

features

Input features used for training.

Type:: pd.DataFrame or None

target

Target variable (encoded for classification, original for regression).

Type:: pd.Series or np.ndarray or None

target_name

Name of the target column in the input data.

Type:: str or None

data

Original input DataFrame, including features and target.

Type:: pd.DataFrame or None

project_name

Name of the current project for tracking and reporting.

Type:: str or None

feature_names

Names of feature columns.

Type:: list of str or None

encoders

Dictionary storing feature encoders (e.g., for categorical features).

Type:: dict

label_encoder

Encoder for categorical target variable (for classification).

Type:: LabelEncoder or None

stats_model

Statistical model for detailed analysis (optional).

Type:: object or None

statistical_summary

Summary of statistical analysis (optional).

Type:: str or None

p_values

P-values from statistical analysis (optional).

Type:: pd.Series or None

conf_intervals

Confidence intervals from statistical analysis (optional).

Type:: pd.DataFrame or None

Examples

>>> from sharkpy import Shark
>>> import pandas as pd
>>> shark = Shark()
>>> data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv', header=None)
>>> data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
>>> shark.learn(data=data, target='species', model_choice='logistic_regression')
>>> predictions = shark.predict(data)
>>> shark.explain(export_path='explanation.pdf', format='pdf', depth='simple')
>>> cv_results, train_metrics = shark.report(cv_folds=5)

model = None

features = None

target = None

problem_type = None

target_name = None

data = None

label_encoder = None

project_name = None

feature_names = None

encoders

stats_model = None

statistical_summary = None

p_values = None

conf_intervals = None

learn(data: str | pandas.DataFrame, project_name: str = 'your data', target: str | None = None, problem_type: str | None = None, model: object | None = None, model_choice: str | None = None, detailed_stats: bool = False, n_trials: int = 30, verbose: bool = False) → Shark[source]

Train a machine learning model on the provided data.

Parameters:

data (str or pd.DataFrame) – Dataset for training. Can be a file path (CSV) or a pandas DataFrame.
project_name (str, optional) – Name of the project for tracking and reporting (default: “your data”).
target (str, optional) – Name of the target column to predict (default: None).
problem_type (str, optional) – Type of problem: ‘regression’, ‘classification’, or None for auto-detection (default: None).
model (object, optional) – Custom model instance to use (default: None).
model_choice (str, optional) – Built-in model to use (e.g., ‘logistic_regression’, ‘random_forest’, ‘xgboost’) (default: None).
detailed_stats (bool, optional) – Whether to compute detailed statistical analysis (e.g., p-values, confidence intervals) (default: False).
n_trials (int, optional) – Number of optimization trials for boosting models (e.g., XGBoost) (default: 30).
verbose (bool, optional) – Whether to print detailed output during training (default: False).

Returns:

The current Shark instance with trained model and updated attributes.

Return type:

Shark

Notes

Automatically encodes categorical features and target (for classification).
Stores the original DataFrame in self.data and target name in self.target_name.
For classification, stores the LabelEncoder in self.label_encoder to preserve category names.
Performs K-Fold cross-validation and prints mean and standard deviation of scores.
Fits the selected model on the entire dataset after cross-validation.
Warning: Avoid loading untrusted CSV files, as they may contain malicious data.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'a']})
>>> shark.learn(data, target='y', model_choice='logistic_regression')
🦈 Looks like a classification problem (non-numeric target: y)
🦈 Encoding categorical target 'y' to numeric labels
...
>>> shark.target_name
'y'
>>> shark.label_encoder.classes_
array(['a', 'b'], dtype=object)

Make predictions using the trained model.

Parameters:: X (dict, pd.DataFrame, list of dict, np.ndarray, or None, optional) – Input samples to predict. If None, predicts on training data. Options: - dict: Single prediction (e.g., {‘feature1’: value1, ‘feature2’: value2}). - list of dict: Multiple scenarios (e.g., [{‘feature1’: value1}, {‘feature1’: value2}]). - pd.DataFrame: Multiple samples with feature columns. - np.ndarray: Raw feature values (must match training feature count).
Returns:: Predicted values. For classification, returns original category names if label_encoder is available.
Return type:: float, str, or np.ndarray
Raises:: ValueError – If no model is trained or input data is invalid.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4], 'y': ['cat', 'dog']})
>>> shark.learn(data, target='y')
>>> shark.predict({'x1': 1, 'x2': 3})
'cat'
>>> shark.predict(data[['x1', 'x2']])
array(['cat', 'dog'], dtype=object)

predict_baseline() → float | str[source]

Make a baseline prediction using the minimum values of the training features.

Returns:: Baseline prediction for regression (mean) or classification (most frequent class).
Return type:: float or str
Raises:: ValueError – If no model is trained.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]})
>>> shark.learn(data, target='y')
>>> shark.predict_baseline()
20.0

plot(kind: str = 'prediction', show: bool = True, save_path: str | None = None, colors: Dict[str, str] | None = None)[source]

Visualize model behavior based on the specified plot type.

Parameters:

kind (str, optional) – Type of plot: ‘prediction’, ‘residuals’, ‘confusion_matrix’, ‘roc’, ‘pr_curve’, ‘proba_hist’, or ‘feature_importance’ (default: ‘prediction’).
show (bool, optional) – Whether to display the plot (default: True).
save_path (str, optional) – Path to save the plot (default: None).
colors (dict, optional) – Custom color specifications for the plot. If None, uses default SharkPy colors. Available keys: ‘primary’, ‘secondary’, ‘accent’, ‘background’, ‘grid’, ‘text’, ‘bars’

Return type:

None

Raises:

ValueError – If no model is trained or the plot type is invalid.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]})
>>> shark.learn(data, target='y')
>>> shark.plot(kind='confusion_matrix')

>>> # Custom colors example
>>> custom_colors = {
>>>     'primary': '#FF6B6B',    # Coral red
>>>     'secondary': '#4ECDC4',  # Turquoise
>>>     'background': '#F7FFF7'  # Light green
>>> }
>>> shark.plot(kind='feature_importance', colors=custom_colors)

report(cv_folds: int = 5, export_path: str | None = None, format: str = 'txt') → tuple[source]

Generate a comprehensive performance report with cross-validation metrics.

Parameters:

cv_folds (int, optional) – Number of cross-validation folds (default: 5).
export_path (str, optional) – Path to export the report (txt, docx, or pdf) (default: None).
format (str, optional) – Export format: ‘txt’, ‘docx’, or ‘pdf’ (default: ‘txt’).

Returns:

(cv_results, train_metrics), where cv_results is a dict of cross-validation metrics and train_metrics is a dict of training metrics.

Return type:

tuple

Raises:

ValueError – If no model is trained or the format is invalid.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]})
>>> shark.learn(data, target='y')
>>> cv_results, train_metrics = shark.report(cv_folds=5)
>>> print(cv_results['test_accuracy'].mean())

explain(cv_results=None, train_metrics=None, export_path: str | None = None, format: str = 'txt', depth: str = 'deep', verbose: int = 1) → pandas.DataFrame | None[source]

Explain the model’s behavior and performance with customizable depth and export options.

Parameters:

cv_results (dict, optional) – Cross-validation results from report(), containing metrics like test_r2 or test_accuracy.
train_metrics (dict, optional) – Training metrics from report(), containing metrics like r2 or accuracy.
export_path (str, optional) – Path to export the explanation (txt, docx, or pdf) (default: None).
format (str, optional) – Export format: ‘txt’, ‘docx’, or ‘pdf’ (default: ‘txt’).
depth (str, optional) – Explanation depth: ‘simple’ (beginner overview), ‘mechanics’ (technical details), ‘interpretation’ (performance analysis), ‘actionable’ (recommendations), ‘deep’ (all levels, default), or ‘shapash’ (interactive SHAP dashboard).

Returns:

Feature importance DataFrame if available, else None.

Return type:

pd.DataFrame or None

Notes

Requires a trained model (call learn first).
For classification, uses label_encoder to display original category names (e.g., ‘Iris-setosa’ instead of 0).
If export_path is provided, saves the explanation in the specified format.
‘shapash’ depth requires the shapash package to be installed.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4], 'y': ['cat', 'dog']})
>>> shark.learn(data, target='y')
>>> shark.explain(depth='simple', export_path='explanation.txt')
🦈 Sharky is diving into the LogisticRegression model explanation...
...
>>> # explanation.txt contains: "This model predicts one of 2 categories (cat, dog)..."

save_model(name: str = 'shark_model', directory: str = 'models') → str[source]

Save the trained model to a .joblib file.

Parameters:

name (str, optional) – Filename without extension (default: “shark_model”).
directory (str, optional) – Folder where the model will be saved (default: “models”).

Returns:

Path to the saved model file.

Return type:

str

Raises:

ValueError – If no model is trained.
OSError – If directory creation or file writing fails.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]})
>>> shark.learn(data, target='y')
>>> shark.save_model(name='my_model')
'models/my_model.joblib'

load_model(model_path: str) → object[source]

Load a saved SharkPy model from a .joblib file.

Parameters:

model_path (str) – Path to the saved .joblib model file.

Returns:

The loaded model object.

Return type:

object

Raises:

FileNotFoundError – If the model file does not exist.
ValueError – If the file is not a valid model.

Examples

>>> shark = Shark()
>>> shark.load_model('models/my_model.joblib')
<sklearn.linear_model.LinearRegression object at ...>

battle(data: pandas.DataFrame, target: str, models: List[str] = ['linear_regression', 'random_forest', 'xgboost'], metric: str = 'r2', n_trials: int = 30, early_stopping: bool = False, min_score: float = 0.5, verbose: int = 0) → Dict[source]

Compare multiple models and select the best performer.

Parameters:

data (pd.DataFrame) – Input data for training.
target (str) – Name of the target column.
models (list of str, optional) – List of model names to compare (e.g., [‘linear_regression’, ‘random_forest’]) (default: [‘linear_regression’, ‘random_forest’, ‘xgboost’]).
metric (str, optional) – Metric to compare models (e.g., ‘r2’, ‘accuracy’) (default: ‘r2’).
n_trials (int, optional) – Number of optimization trials for boosting models (default: 30).
early_stopping (bool, optional) – If True, stops training if any model exceeds min_score. Not recommended as it may miss better models later (default: False).
min_score (float, optional) – Minimum score to trigger early stopping (default: 0.5).
verbose (int, optional) – Verbosity level for model training (default: 0)

Returns:

Dictionary containing champion model name, model object, score, all results, details, and comparison plot.

Return type:

dict

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]})
>>> result = shark.battle(data, target='y', models=['linear_regression', 'random_forest'])
>>> print(result['champion'])
'linear_regression'

explain_with_shapash(title_story: str | None = None, display: bool = True)[source]

Create an interactive Shapash dashboard for model interpretation.

Parameters:

title_story (str, optional) – Title for the Shapash dashboard (default: None).
display (bool, optional) – Whether to display the dashboard (default: True).

Return type:

None

Raises:

ImportError – If the shapash package is not installed.
ValueError – If no model is trained.

Examples

>>> shark = Shark()
>>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]})
>>> shark.learn(data, target='y')
>>> shark.explain_with_shapash(title_story='My Model Analysis')

available_models() → Dict[source]

List all available models with their details and print a comparison table.

Returns:: Dictionary of available models and their details.
Return type:: dict

Examples

>>> shark = Shark()
>>> models = shark.available_models()
🦈 Available Models in SharkPy 🦈
...
>>> print(models.keys())
dict_keys(['linear_regression', 'random_forest', 'xgboost', ...])