A comprehensive collection of Python scripts and examples focused on data engineering fundamentals, including data importing, manipulation, API interactions, and Python programming concepts.
- Overview
- Project Structure
- Technologies
- Installation
- Usage
- Topics Covered
- Data Files
- Contributing
- License
This repository contains hands-on examples and exercises covering essential data engineering topics in Python. It includes practical implementations of data importing techniques, working with various file formats, API interactions, and intermediate Python programming concepts.
data-engineer/
├── data/ # Sample datasets
│ ├── a_movie.json
│ ├── digits_header.txt
│ ├── digits.csv
│ ├── moby_dick.txt
│ ├── sales.csv
│ ├── seaslug.txt
│ ├── test.hdf5
│ ├── titanic_corrupt.txt
│ ├── titanic.csv
│ └── winequality-red.csv
├── importing-data-in-python/ # Data importing fundamentals
│ ├── importing-data.py
│ ├── introduction-and-flat-files.py
│ └── relational-databases.pyi
├── intermediate-importing-data/ # Advanced data importing
│ ├── diving-deep-into-the-Twitter-API.py
│ ├── importing-data-from-the-Internet.py
│ └── interacting-with-APIs.py
├── intermediate-python/ # Python programming concepts
│ ├── function.py
│ ├── lambda-functions-and-error-handling.py
│ └── python-ecosystem.py
├── introduction-api/ # API fundamentals
│ └── making-api-requests-with-python.py
├── introduction-to-python/ # Python basics
│ ├── control-flow-and-loops.py
│ ├── data-types.py
│ └── introduction.py
├── scripts/ # Utility scripts
│ ├── generate_digits.py
│ ├── generate_h5py.py
│ └── numpy_txt.py
└── requirements.txt # Project dependencies
- Python 3.x
- NumPy - Numerical computing
- Pandas - Data manipulation and analysis
- Matplotlib - Data visualization
- PyYAML - YAML file parsing
- Pillow - Image processing
- Tweepy - Twitter API integration
- Clone the repository:
git clone https://github.com/petsereypanha/data-engineer.git
cd data-engineer- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows- Install required packages:
pip install -r requirements.txtNavigate to any module directory and run the Python scripts:
# Example: Run data importing scripts
python importing-data-in-python/importing-data.py
# Example: Run API interaction scripts
python intermediate-importing-data/interacting-with-APIs.py
# Example: Run Python function examples
python intermediate-python/function.py- Reading flat files (CSV, TXT)
- Working with Excel files
- Loading pickle files
- Importing SAS and Stata files
- Working with HDF5 files
- Loading MATLAB files
- Importing data from the Internet
- API interactions and authentication
- Working with Twitter API
- HTTP requests and responses
- JSON data parsing
- Function definition and usage
- Default arguments and keyword arguments
- Docstrings and documentation
*argsand**kwargs- Lambda functions
- Error handling
- Python ecosystem tools
- Basic data types
- Control flow (if/else statements)
- Loops (for, while)
- Python fundamentals
- Making HTTP requests
- RESTful API concepts
- Authentication methods
- Handling API responses
The data/ directory contains various sample datasets for practice:
- CSV files:
digits.csv,sales.csv,titanic.csv,winequality-red.csv - JSON files:
a_movie.json - Text files:
moby_dick.txt,seaslug.txt,digits_header.txt - HDF5 files:
test.hdf5 - Specialized formats: Various data formats for learning different import techniques
Contributions are welcome! Feel free to:
- Fork the repository
- Create a new branch (
git checkout -b feature/improvement) - Make your changes
- Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Create a Pull Request
This project is created for educational purposes.
Panha Setserey
- GitHub: @petsereypanha
⭐ If you find this repository helpful, please consider giving it a star!