- Environment Creation
- GitHub Setup
- Initialize Git Repository & First Commit
- Setup And requirements.txt
- Setting up src Folder
- Set up Project With GitHub
- Data Ingestion
- Data Transformation
- Model Trainer
- Model Evaluation
- Model Deployment
- CI/CD Pipelines - GitHub Actions
- Deployment - AWS
-
Create a new environment:
conda create -p venv python==<version> -y
-
Activate the new environment:
conda activate venv/
-
Set up the GitHub Repository
- Create a new repository on GitHub.
- Copy the repository URL.
-
Initialize GitHub Repository locally:
git init
-
Create files locally:
- README.md
- .gitignore
-
Add README.md to the git repository:
git add README.md .gitignore
-
Commit the changes to the repository:
git commit -m "first commit"
-
Check the status of the changes:
git status
-
Create a new branch:
git branch -M main
-
Add remote repository:
git remote add origin <paste_your_repository_url>
-
Push into the GitHub Repository:
git push -u origin main
Now, your local repository is initialized, and the README.md file is committed and pushed to the main branch on GitHub. This sets up the foundation for your end-to-end machine learning project.
-
Create a requirements.txt file
-
Create a
setup.py
file in the project root to specify project metadata and dependencies.setup.py
is used to build an application as a Package(library)
from setuptools import find_packages,setup
from typing import List
HYPEN_E_DOT='-e .'
def get_requirements(file_path:str)->List[str]:
'''
this function will return the list of requirements
'''
requirements=[]
with open(file_path) as file_obj:
requirements=file_obj.readlines()
requirements=[req.replace("\n","") for req in requirements]
if HYPEN_E_DOT in requirements:
requirements.remove(HYPEN_E_DOT)
return requirements
setup(
name='mlproject', # To Change
version='0.0.1', # To Change
author='Your Name', # To Change
author_email='your_email_address', # To Change
packages=find_packages(),
install_requires=get_requirements('requirements.txt')
)
- Create a
src
Folder:
- The
src
(source) folder is commonly used to organize the source code of your project.
mkdir src
- Add
__init__.py
tosrc
:
__init__.py
is forsetup.py
Configuration- The
__init__.py
file can be an empty file or include initialization code. It signals Python to treat the directory as a package.
touch src/__init__.py
- Move Code Files to
src
:
- Move your Python files (modules) from the project root to the
src
folder.
mv module1.py src/
mv module2.py src/
- Update Import Statements:
- Update import statements in your code to reflect the new package structure.
# Before
from module1 import function1
# After
from src.module1 import function1
- Install the Package in Editable Mode:
- Install your package in editable mode to reflect changes without reinstalling.
pip install -e .
- Add to the end of the
requirements.txt
-e .
that is mapping to thesetup.py
. - This will help in case if you want to install directly the
requirements.txt
with thepip install requirements.txt
it will automatically trigger thesetup.py
get_requirements()
will resolve the problem of-e .
at the end of therequirements.txt
HYPEN_E_DOT=get_requirements
def get_requirements(file_path:str)->List[str]:
'''
this function will return the list of requirements
'''
requirements=[]
with open(file_path) as file_obj:
requirements=file_obj.readlines()
requirements=[req.replace("\n","") for req in requirements]
if HYPEN_E_DOT in requirements:
requirements.remove(HYPEN_E_DOT)
return requirements
- Run requirements.txt:
pip install -r requirements.txt
-e .
in therequirements.txt
will trigger thesetup.py
- As a result you will get all libraries installed and new folder in your directory
mlproject.egg-info
Notes:
- The
src
structure enhances modularity and avoids clutter in the project root. __init__.py
is crucial for Python to recognize thesrc
folder as a package.setup.py
configures project metadata and dependencies for distribution.- Using an editable installation (
pip install -e .
) facilitates development and testing.
This organization improves code structure and prepares your project for distribution or deployment. Adjust the package name, version, and dependencies in setup.py
according to your project's specifications.
- Create
__init__.py
File inside thecomponents
Folder- The components will be created as a package in the way that they can be imported to some other file location, that is why you add
__init__.py
in it.
- The components will be created as a package in the way that they can be imported to some other file location, that is why you add
- Start to add components.
- The components are the modules (for example, Data Ingestion, Data Transformation) that we are going to create. For example:
data_ingestion.py
,data_transformation.py
,model_trainer.py
- The components are the modules (for example, Data Ingestion, Data Transformation) that we are going to create. For example:
- Create
__init__.py
File inside thepipeline
Folder- The components will be created as a package in the way that they can be imported to some other file location, that is why you add
__init__.py
in it. - In the
pipeline
Folder you will have pipelines which will consume modules from thecomponents
Folder
- The components will be created as a package in the way that they can be imported to some other file location, that is why you add
- Start to add pipelines.
- For example:
train_pipeline.py
,predict_pipeline.py
- For example:
-
Create
logger.py
File inside thesrc
Folder. -
Compiling the
logger.py
File:
import logging
import os
from datetime import datetime
LOG_FILE=f"{datetime.now().strftime('%m_%d_%Y_%H_%M_%S')}.log"
logs_path=os.path.join(os.getcwd(),"logs",LOG_FILE)
os.makedirs(logs_path,exist_ok=True)
LOG_FILE_PATH=os.path.join(logs_path,LOG_FILE)
logging.basicConfig(
filename=LOG_FILE_PATH,
format="[ %(asctime)s ] %(lineno)d %(name)s - %(levelname)s - %(message)s",
level=logging.INFO,
)
-
Create
exception.py
File inside thesrc
Folder. (follow the provided link to get mo info about the Built-in Exceptions in Python) -
Compiling the
exception.py
File:
import sys
def error_message_detail(error,error_detail:sys):
# exc_tb will return all info like on which file, line number the exception has occured
_,_,exc_tb = error_detail.exc_info()
file_name = exc_tb.tb_frame.f_code.co_filename
error_message = "Error occured in python script name [{0}] line number [{1}] error message [{2}]".format(
file_name, exc_tb.tb_lineno, str(error)
)
return error_message
class CustomException(Exception):
def __init__(self, error_message,error_detail:sys):
super.__init__(error_message)
self.error_message = error_message_detail(error_message,error_detail=error_detail)
def __str__(self):
return self.error_message
- This exception handling can be reused everywhere and it will be common
- Create
utils.py
File inside thesrc
Folder. This file will contain any functionalities written in a common way which will be used in entire application (will be used in components).
Find the material of this tutorial in the notebook
folder:
1.EDA STUDENT PERFORMANCE.ipynb
2.MODEL TRAINING.ipynb
- The following class will help us to save data into a specific location, in this case into the
artifacts
folder.
@dataclass
class DataIngestionConfig:
train_data_path: str=os.path.join('artifacts',"train.csv")
test_data_path: str=os.path.join('artifacts',"test.csv")
raw_data_path: str=os.path.join('artifacts',"data.csv")
-
In Python, the
@dataclass
decorator is used to automatically generate special methods such as__init__()
,__repr__()
,__eq__()
, and others, based on class variables defined within the class. It's particularly useful for creating classes that primarily store data, such as configuration settings, without needing to manually implement these methods. -
Please notice that in this notebook we use
logging
andCustomException
libraries that we've created in thesrc
folder
logging.info("Entered the data ingestion method or component")
except Exception as e:
raise CustomException(e,sys)
Impute in the context of data analysis or statistics refers to filling in missing values in a dataset.
- We need
num_pipeline
to fill the missing values with "median" value in the numerical features (Imputing) and for Standard Scaling the numerical values. - For filling it's used "median" value because from our EDA we understood that there are outliers in numerical features.
num_pipeline= Pipeline(
steps=[
("imputer",SimpleImputer(strategy="median")),
("scaler",StandardScaler())
]
)
cat_pipeline
for categorical features:
cat_pipeline=Pipeline(
steps=[
("imputer",SimpleImputer(strategy="most_frequent")),
("one_hot_encoder",OneHotEncoder()),
("scaler",StandardScaler(with_mean=False))
]
)