sharkpy
SharkPy - A friendly machine learning framework with shark-themed feedback 🦈
Submodules
Classes
A machine learning model manager that simplifies training, prediction, and analysis. |
Package Contents
- class sharkpy.Shark[source]
A machine learning model manager that simplifies training, prediction, and analysis.
- model
The trained machine learning model (e.g., LogisticRegression, RandomForestClassifier).
- Type:
object or None
- problem_type
Type of ML problem (‘classification’ or ‘regression’).
- Type:
str or None
- features
Input features used for training.
- Type:
pd.DataFrame or None
- target
Target variable (encoded for classification, original for regression).
- Type:
pd.Series or np.ndarray or None
- target_name
Name of the target column in the input data.
- Type:
str or None
- data
Original input DataFrame, including features and target.
- Type:
pd.DataFrame or None
- project_name
Name of the current project for tracking and reporting.
- Type:
str or None
- feature_names
Names of feature columns.
- Type:
list of str or None
- encoders
Dictionary storing feature encoders (e.g., for categorical features).
- Type:
dict
- label_encoder
Encoder for categorical target variable (for classification).
- Type:
LabelEncoder or None
- stats_model
Statistical model for detailed analysis (optional).
- Type:
object or None
- statistical_summary
Summary of statistical analysis (optional).
- Type:
str or None
- p_values
P-values from statistical analysis (optional).
- Type:
pd.Series or None
- conf_intervals
Confidence intervals from statistical analysis (optional).
- Type:
pd.DataFrame or None
Examples
>>> from sharkpy import Shark >>> import pandas as pd >>> shark = Shark() >>> data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv', header=None) >>> data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'] >>> shark.learn(data=data, target='species', model_choice='logistic_regression') >>> predictions = shark.predict(data) >>> shark.explain(export_path='explanation.pdf', format='pdf', depth='simple') >>> cv_results, train_metrics = shark.report(cv_folds=5)
- model = None
- features = None
- target = None
- problem_type = None
- target_name = None
- data = None
- label_encoder = None
- project_name = None
- feature_names = None
- encoders
- stats_model = None
- statistical_summary = None
- p_values = None
- conf_intervals = None
- learn(data: str | pandas.DataFrame, project_name: str = 'your data', target: str | None = None, problem_type: str | None = None, model: object | None = None, model_choice: str | None = None, detailed_stats: bool = False, n_trials: int = 30, verbose: bool = False) Shark[source]
Train a machine learning model on the provided data.
- Parameters:
data (str or pd.DataFrame) – Dataset for training. Can be a file path (CSV) or a pandas DataFrame.
project_name (str, optional) – Name of the project for tracking and reporting (default: “your data”).
target (str, optional) – Name of the target column to predict (default: None).
problem_type (str, optional) – Type of problem: ‘regression’, ‘classification’, or None for auto-detection (default: None).
model (object, optional) – Custom model instance to use (default: None).
model_choice (str, optional) – Built-in model to use (e.g., ‘logistic_regression’, ‘random_forest’, ‘xgboost’) (default: None).
detailed_stats (bool, optional) – Whether to compute detailed statistical analysis (e.g., p-values, confidence intervals) (default: False).
n_trials (int, optional) – Number of optimization trials for boosting models (e.g., XGBoost) (default: 30).
verbose (bool, optional) – Whether to print detailed output during training (default: False).
- Returns:
The current Shark instance with trained model and updated attributes.
- Return type:
Notes
Automatically encodes categorical features and target (for classification).
Stores the original DataFrame in self.data and target name in self.target_name.
For classification, stores the LabelEncoder in self.label_encoder to preserve category names.
Performs K-Fold cross-validation and prints mean and standard deviation of scores.
Fits the selected model on the entire dataset after cross-validation.
Warning: Avoid loading untrusted CSV files, as they may contain malicious data.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'a']}) >>> shark.learn(data, target='y', model_choice='logistic_regression') 🦈 Looks like a classification problem (non-numeric target: y) 🦈 Encoding categorical target 'y' to numeric labels ... >>> shark.target_name 'y' >>> shark.label_encoder.classes_ array(['a', 'b'], dtype=object)
- predict(X: Dict | pandas.DataFrame | List[Dict] | numpy.ndarray | None = None) float | str | numpy.ndarray[source]
Make predictions using the trained model.
- Parameters:
X (dict, pd.DataFrame, list of dict, np.ndarray, or None, optional) – Input samples to predict. If None, predicts on training data. Options: - dict: Single prediction (e.g., {‘feature1’: value1, ‘feature2’: value2}). - list of dict: Multiple scenarios (e.g., [{‘feature1’: value1}, {‘feature1’: value2}]). - pd.DataFrame: Multiple samples with feature columns. - np.ndarray: Raw feature values (must match training feature count).
- Returns:
Predicted values. For classification, returns original category names if label_encoder is available.
- Return type:
float, str, or np.ndarray
- Raises:
ValueError – If no model is trained or input data is invalid.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4], 'y': ['cat', 'dog']}) >>> shark.learn(data, target='y') >>> shark.predict({'x1': 1, 'x2': 3}) 'cat' >>> shark.predict(data[['x1', 'x2']]) array(['cat', 'dog'], dtype=object)
- predict_baseline() float | str[source]
Make a baseline prediction using the minimum values of the training features.
- Returns:
Baseline prediction for regression (mean) or classification (most frequent class).
- Return type:
float or str
- Raises:
ValueError – If no model is trained.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]}) >>> shark.learn(data, target='y') >>> shark.predict_baseline() 20.0
- plot(kind: str = 'prediction', show: bool = True, save_path: str | None = None, colors: Dict[str, str] | None = None)[source]
Visualize model behavior based on the specified plot type.
- Parameters:
kind (str, optional) – Type of plot: ‘prediction’, ‘residuals’, ‘confusion_matrix’, ‘roc’, ‘pr_curve’, ‘proba_hist’, or ‘feature_importance’ (default: ‘prediction’).
show (bool, optional) – Whether to display the plot (default: True).
save_path (str, optional) – Path to save the plot (default: None).
colors (dict, optional) – Custom color specifications for the plot. If None, uses default SharkPy colors. Available keys: ‘primary’, ‘secondary’, ‘accent’, ‘background’, ‘grid’, ‘text’, ‘bars’
- Return type:
None
- Raises:
ValueError – If no model is trained or the plot type is invalid.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]}) >>> shark.learn(data, target='y') >>> shark.plot(kind='confusion_matrix')
>>> # Custom colors example >>> custom_colors = { >>> 'primary': '#FF6B6B', # Coral red >>> 'secondary': '#4ECDC4', # Turquoise >>> 'background': '#F7FFF7' # Light green >>> } >>> shark.plot(kind='feature_importance', colors=custom_colors)
- report(cv_folds: int = 5, export_path: str | None = None, format: str = 'txt') tuple[source]
Generate a comprehensive performance report with cross-validation metrics.
- Parameters:
cv_folds (int, optional) – Number of cross-validation folds (default: 5).
export_path (str, optional) – Path to export the report (txt, docx, or pdf) (default: None).
format (str, optional) – Export format: ‘txt’, ‘docx’, or ‘pdf’ (default: ‘txt’).
- Returns:
(cv_results, train_metrics), where cv_results is a dict of cross-validation metrics and train_metrics is a dict of training metrics.
- Return type:
tuple
- Raises:
ValueError – If no model is trained or the format is invalid.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]}) >>> shark.learn(data, target='y') >>> cv_results, train_metrics = shark.report(cv_folds=5) >>> print(cv_results['test_accuracy'].mean())
- explain(cv_results=None, train_metrics=None, export_path: str | None = None, format: str = 'txt', depth: str = 'deep', verbose: int = 1) pandas.DataFrame | None[source]
Explain the model’s behavior and performance with customizable depth and export options.
- Parameters:
cv_results (dict, optional) – Cross-validation results from report(), containing metrics like test_r2 or test_accuracy.
train_metrics (dict, optional) – Training metrics from report(), containing metrics like r2 or accuracy.
export_path (str, optional) – Path to export the explanation (txt, docx, or pdf) (default: None).
format (str, optional) – Export format: ‘txt’, ‘docx’, or ‘pdf’ (default: ‘txt’).
depth (str, optional) – Explanation depth: ‘simple’ (beginner overview), ‘mechanics’ (technical details), ‘interpretation’ (performance analysis), ‘actionable’ (recommendations), ‘deep’ (all levels, default), or ‘shapash’ (interactive SHAP dashboard).
- Returns:
Feature importance DataFrame if available, else None.
- Return type:
pd.DataFrame or None
Notes
Requires a trained model (call learn first).
For classification, uses label_encoder to display original category names (e.g., ‘Iris-setosa’ instead of 0).
If export_path is provided, saves the explanation in the specified format.
‘shapash’ depth requires the shapash package to be installed.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4], 'y': ['cat', 'dog']}) >>> shark.learn(data, target='y') >>> shark.explain(depth='simple', export_path='explanation.txt') 🦈 Sharky is diving into the LogisticRegression model explanation... ... >>> # explanation.txt contains: "This model predicts one of 2 categories (cat, dog)..."
- save_model(name: str = 'shark_model', directory: str = 'models') str[source]
Save the trained model to a .joblib file.
- Parameters:
name (str, optional) – Filename without extension (default: “shark_model”).
directory (str, optional) – Folder where the model will be saved (default: “models”).
- Returns:
Path to the saved model file.
- Return type:
str
- Raises:
ValueError – If no model is trained.
OSError – If directory creation or file writing fails.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]}) >>> shark.learn(data, target='y') >>> shark.save_model(name='my_model') 'models/my_model.joblib'
- load_model(model_path: str) object[source]
Load a saved SharkPy model from a .joblib file.
- Parameters:
model_path (str) – Path to the saved .joblib model file.
- Returns:
The loaded model object.
- Return type:
object
- Raises:
FileNotFoundError – If the model file does not exist.
ValueError – If the file is not a valid model.
Examples
>>> shark = Shark() >>> shark.load_model('models/my_model.joblib') <sklearn.linear_model.LinearRegression object at ...>
- battle(data: pandas.DataFrame, target: str, models: List[str] = ['linear_regression', 'random_forest', 'xgboost'], metric: str = 'r2', n_trials: int = 30, early_stopping: bool = False, min_score: float = 0.5, verbose: int = 0) Dict[source]
Compare multiple models and select the best performer.
- Parameters:
data (pd.DataFrame) – Input data for training.
target (str) – Name of the target column.
models (list of str, optional) – List of model names to compare (e.g., [‘linear_regression’, ‘random_forest’]) (default: [‘linear_regression’, ‘random_forest’, ‘xgboost’]).
metric (str, optional) – Metric to compare models (e.g., ‘r2’, ‘accuracy’) (default: ‘r2’).
n_trials (int, optional) – Number of optimization trials for boosting models (default: 30).
early_stopping (bool, optional) – If True, stops training if any model exceeds min_score. Not recommended as it may miss better models later (default: False).
min_score (float, optional) – Minimum score to trigger early stopping (default: 0.5).
verbose (int, optional) – Verbosity level for model training (default: 0)
- Returns:
Dictionary containing champion model name, model object, score, all results, details, and comparison plot.
- Return type:
dict
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]}) >>> result = shark.battle(data, target='y', models=['linear_regression', 'random_forest']) >>> print(result['champion']) 'linear_regression'
- explain_with_shapash(title_story: str | None = None, display: bool = True)[source]
Create an interactive Shapash dashboard for model interpretation.
- Parameters:
title_story (str, optional) – Title for the Shapash dashboard (default: None).
display (bool, optional) – Whether to display the dashboard (default: True).
- Return type:
None
- Raises:
ImportError – If the shapash package is not installed.
ValueError – If no model is trained.
Examples
>>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]}) >>> shark.learn(data, target='y') >>> shark.explain_with_shapash(title_story='My Model Analysis')
- available_models() Dict[source]
List all available models with their details and print a comparison table.
- Returns:
Dictionary of available models and their details.
- Return type:
dict
Examples
>>> shark = Shark() >>> models = shark.available_models() 🦈 Available Models in SharkPy 🦈 ... >>> print(models.keys()) dict_keys(['linear_regression', 'random_forest', 'xgboost', ...])