sharkpy.core ============ .. py:module:: sharkpy.core Attributes ---------- .. autoapisummary:: sharkpy.core.explain_with_shapash Classes ------- .. autoapisummary:: sharkpy.core.Shark Module Contents --------------- .. py:data:: explain_with_shapash :value: None .. py:class:: Shark A machine learning model manager that simplifies training, prediction, and analysis. .. attribute:: model The trained machine learning model (e.g., LogisticRegression, RandomForestClassifier). :type: object or None .. attribute:: problem_type Type of ML problem ('classification' or 'regression'). :type: str or None .. attribute:: features Input features used for training. :type: pd.DataFrame or None .. attribute:: target Target variable (encoded for classification, original for regression). :type: pd.Series or np.ndarray or None .. attribute:: target_name Name of the target column in the input data. :type: str or None .. attribute:: data Original input DataFrame, including features and target. :type: pd.DataFrame or None .. attribute:: project_name Name of the current project for tracking and reporting. :type: str or None .. attribute:: feature_names Names of feature columns. :type: list of str or None .. attribute:: encoders Dictionary storing feature encoders (e.g., for categorical features). :type: dict .. attribute:: label_encoder Encoder for categorical target variable (for classification). :type: LabelEncoder or None .. attribute:: stats_model Statistical model for detailed analysis (optional). :type: object or None .. attribute:: statistical_summary Summary of statistical analysis (optional). :type: str or None .. attribute:: p_values P-values from statistical analysis (optional). :type: pd.Series or None .. attribute:: conf_intervals Confidence intervals from statistical analysis (optional). :type: pd.DataFrame or None .. rubric:: Examples >>> from sharkpy import Shark >>> import pandas as pd >>> shark = Shark() >>> data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv', header=None) >>> data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'] >>> shark.learn(data=data, target='species', model_choice='logistic_regression') >>> predictions = shark.predict(data) >>> shark.explain(export_path='explanation.pdf', format='pdf', depth='simple') >>> cv_results, train_metrics = shark.report(cv_folds=5) .. py:attribute:: model :value: None .. py:attribute:: features :value: None .. py:attribute:: target :value: None .. py:attribute:: problem_type :value: None .. py:attribute:: target_name :value: None .. py:attribute:: data :value: None .. py:attribute:: label_encoder :value: None .. py:attribute:: project_name :value: None .. py:attribute:: feature_names :value: None .. py:attribute:: encoders .. py:attribute:: stats_model :value: None .. py:attribute:: statistical_summary :value: None .. py:attribute:: p_values :value: None .. py:attribute:: conf_intervals :value: None .. py:method:: learn(data: Union[str, pandas.DataFrame], project_name: str = 'your data', target: Optional[str] = None, problem_type: Optional[str] = None, model: Optional[object] = None, model_choice: Optional[str] = None, detailed_stats: bool = False, n_trials: int = 30, verbose: bool = False) -> Shark Train a machine learning model on the provided data. :param data: Dataset for training. Can be a file path (CSV) or a pandas DataFrame. :type data: str or pd.DataFrame :param project_name: Name of the project for tracking and reporting (default: "your data"). :type project_name: str, optional :param target: Name of the target column to predict (default: None). :type target: str, optional :param problem_type: Type of problem: 'regression', 'classification', or None for auto-detection (default: None). :type problem_type: str, optional :param model: Custom model instance to use (default: None). :type model: object, optional :param model_choice: Built-in model to use (e.g., 'logistic_regression', 'random_forest', 'xgboost') (default: None). :type model_choice: str, optional :param detailed_stats: Whether to compute detailed statistical analysis (e.g., p-values, confidence intervals) (default: False). :type detailed_stats: bool, optional :param n_trials: Number of optimization trials for boosting models (e.g., XGBoost) (default: 30). :type n_trials: int, optional :param verbose: Whether to print detailed output during training (default: False). :type verbose: bool, optional :returns: The current Shark instance with trained model and updated attributes. :rtype: Shark .. rubric:: Notes - Automatically encodes categorical features and target (for classification). - Stores the original DataFrame in `self.data` and target name in `self.target_name`. - For classification, stores the `LabelEncoder` in `self.label_encoder` to preserve category names. - Performs K-Fold cross-validation and prints mean and standard deviation of scores. - Fits the selected model on the entire dataset after cross-validation. - Warning: Avoid loading untrusted CSV files, as they may contain malicious data. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'a']}) >>> shark.learn(data, target='y', model_choice='logistic_regression') 🦈 Looks like a classification problem (non-numeric target: y) 🦈 Encoding categorical target 'y' to numeric labels ... >>> shark.target_name 'y' >>> shark.label_encoder.classes_ array(['a', 'b'], dtype=object) .. py:method:: predict(X: Optional[Union[Dict, pandas.DataFrame, List[Dict], numpy.ndarray]] = None) -> Union[float, str, numpy.ndarray] Make predictions using the trained model. :param X: Input samples to predict. If None, predicts on training data. Options: - dict: Single prediction (e.g., {'feature1': value1, 'feature2': value2}). - list of dict: Multiple scenarios (e.g., [{'feature1': value1}, {'feature1': value2}]). - pd.DataFrame: Multiple samples with feature columns. - np.ndarray: Raw feature values (must match training feature count). :type X: dict, pd.DataFrame, list of dict, np.ndarray, or None, optional :returns: Predicted values. For classification, returns original category names if `label_encoder` is available. :rtype: float, str, or np.ndarray :raises ValueError: If no model is trained or input data is invalid. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4], 'y': ['cat', 'dog']}) >>> shark.learn(data, target='y') >>> shark.predict({'x1': 1, 'x2': 3}) 'cat' >>> shark.predict(data[['x1', 'x2']]) array(['cat', 'dog'], dtype=object) .. py:method:: predict_baseline() -> Union[float, str] Make a baseline prediction using the minimum values of the training features. :returns: Baseline prediction for regression (mean) or classification (most frequent class). :rtype: float or str :raises ValueError: If no model is trained. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]}) >>> shark.learn(data, target='y') >>> shark.predict_baseline() 20.0 .. py:method:: plot(kind: str = 'prediction', show: bool = True, save_path: Optional[str] = None, colors: Optional[Dict[str, str]] = None) Visualize model behavior based on the specified plot type. :param kind: Type of plot: 'prediction', 'residuals', 'confusion_matrix', 'roc', 'pr_curve', 'proba_hist', or 'feature_importance' (default: 'prediction'). :type kind: str, optional :param show: Whether to display the plot (default: True). :type show: bool, optional :param save_path: Path to save the plot (default: None). :type save_path: str, optional :param colors: Custom color specifications for the plot. If None, uses default SharkPy colors. Available keys: 'primary', 'secondary', 'accent', 'background', 'grid', 'text', 'bars' :type colors: dict, optional :rtype: None :raises ValueError: If no model is trained or the plot type is invalid. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]}) >>> shark.learn(data, target='y') >>> shark.plot(kind='confusion_matrix') >>> # Custom colors example >>> custom_colors = { >>> 'primary': '#FF6B6B', # Coral red >>> 'secondary': '#4ECDC4', # Turquoise >>> 'background': '#F7FFF7' # Light green >>> } >>> shark.plot(kind='feature_importance', colors=custom_colors) .. py:method:: report(cv_folds: int = 5, export_path: Optional[str] = None, format: str = 'txt') -> tuple Generate a comprehensive performance report with cross-validation metrics. :param cv_folds: Number of cross-validation folds (default: 5). :type cv_folds: int, optional :param export_path: Path to export the report (txt, docx, or pdf) (default: None). :type export_path: str, optional :param format: Export format: 'txt', 'docx', or 'pdf' (default: 'txt'). :type format: str, optional :returns: (cv_results, train_metrics), where cv_results is a dict of cross-validation metrics and train_metrics is a dict of training metrics. :rtype: tuple :raises ValueError: If no model is trained or the format is invalid. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]}) >>> shark.learn(data, target='y') >>> cv_results, train_metrics = shark.report(cv_folds=5) >>> print(cv_results['test_accuracy'].mean()) .. py:method:: explain(cv_results=None, train_metrics=None, export_path: Optional[str] = None, format: str = 'txt', depth: str = 'deep', verbose: int = 1) -> Optional[pandas.DataFrame] Explain the model's behavior and performance with customizable depth and export options. :param cv_results: Cross-validation results from report(), containing metrics like test_r2 or test_accuracy. :type cv_results: dict, optional :param train_metrics: Training metrics from report(), containing metrics like r2 or accuracy. :type train_metrics: dict, optional :param export_path: Path to export the explanation (txt, docx, or pdf) (default: None). :type export_path: str, optional :param format: Export format: 'txt', 'docx', or 'pdf' (default: 'txt'). :type format: str, optional :param depth: Explanation depth: 'simple' (beginner overview), 'mechanics' (technical details), 'interpretation' (performance analysis), 'actionable' (recommendations), 'deep' (all levels, default), or 'shapash' (interactive SHAP dashboard). :type depth: str, optional :returns: Feature importance DataFrame if available, else None. :rtype: pd.DataFrame or None .. rubric:: Notes - Requires a trained model (call `learn` first). - For classification, uses `label_encoder` to display original category names (e.g., 'Iris-setosa' instead of 0). - If `export_path` is provided, saves the explanation in the specified format. - 'shapash' depth requires the `shapash` package to be installed. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4], 'y': ['cat', 'dog']}) >>> shark.learn(data, target='y') >>> shark.explain(depth='simple', export_path='explanation.txt') 🦈 Sharky is diving into the LogisticRegression model explanation... ... >>> # explanation.txt contains: "This model predicts one of 2 categories (cat, dog)..." .. py:method:: save_model(name: str = 'shark_model', directory: str = 'models') -> str Save the trained model to a .joblib file. :param name: Filename without extension (default: "shark_model"). :type name: str, optional :param directory: Folder where the model will be saved (default: "models"). :type directory: str, optional :returns: Path to the saved model file. :rtype: str :raises ValueError: If no model is trained. :raises OSError: If directory creation or file writing fails. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]}) >>> shark.learn(data, target='y') >>> shark.save_model(name='my_model') 'models/my_model.joblib' .. py:method:: load_model(model_path: str) -> object Load a saved SharkPy model from a .joblib file. :param model_path: Path to the saved .joblib model file. :type model_path: str :returns: The loaded model object. :rtype: object :raises FileNotFoundError: If the model file does not exist. :raises ValueError: If the file is not a valid model. .. rubric:: Examples >>> shark = Shark() >>> shark.load_model('models/my_model.joblib') .. py:method:: battle(data: pandas.DataFrame, target: str, models: List[str] = ['linear_regression', 'random_forest', 'xgboost'], metric: str = 'r2', n_trials: int = 30, early_stopping: bool = False, min_score: float = 0.5, verbose: int = 0) -> Dict Compare multiple models and select the best performer. :param data: Input data for training. :type data: pd.DataFrame :param target: Name of the target column. :type target: str :param models: List of model names to compare (e.g., ['linear_regression', 'random_forest']) (default: ['linear_regression', 'random_forest', 'xgboost']). :type models: list of str, optional :param metric: Metric to compare models (e.g., 'r2', 'accuracy') (default: 'r2'). :type metric: str, optional :param n_trials: Number of optimization trials for boosting models (default: 30). :type n_trials: int, optional :param early_stopping: If True, stops training if any model exceeds `min_score`. Not recommended as it may miss better models later (default: False). :type early_stopping: bool, optional :param min_score: Minimum score to trigger early stopping (default: 0.5). :type min_score: float, optional :param verbose: Verbosity level for model training (default: 0) :type verbose: int, optional :returns: Dictionary containing champion model name, model object, score, all results, details, and comparison plot. :rtype: dict .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]}) >>> result = shark.battle(data, target='y', models=['linear_regression', 'random_forest']) >>> print(result['champion']) 'linear_regression' .. py:method:: explain_with_shapash(title_story: Optional[str] = None, display: bool = True) Create an interactive Shapash dashboard for model interpretation. :param title_story: Title for the Shapash dashboard (default: None). :type title_story: str, optional :param display: Whether to display the dashboard (default: True). :type display: bool, optional :rtype: None :raises ImportError: If the `shapash` package is not installed. :raises ValueError: If no model is trained. .. rubric:: Examples >>> shark = Shark() >>> data = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]}) >>> shark.learn(data, target='y') >>> shark.explain_with_shapash(title_story='My Model Analysis') .. py:method:: available_models() -> Dict List all available models with their details and print a comparison table. :returns: Dictionary of available models and their details. :rtype: dict .. rubric:: Examples >>> shark = Shark() >>> models = shark.available_models() 🦈 Available Models in SharkPy 🦈 ... >>> print(models.keys()) dict_keys(['linear_regression', 'random_forest', 'xgboost', ...])