#

datalake

Here are 65 public repositories matching this topic...

Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

data-science data csv sql database ai pandas data-analysis datalake gpt-3 gpt-4 llm

Updated Jun 12, 2024
Python

activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Updated Jun 12, 2024
Python

awslabs / aws-orbit-workbench

A Data Platform built for AWS, powered by Kubernetes.

kubernetes aws jupyter analytics gpu jupyterhub data-analysis redshift mach workbench datalake dataengineering eks eks-cluster orbit-workbench

Updated Jul 24, 2023
Python

UncoderIO / Uncoder_IO

An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.

translation xdr siem sigma datalake edr threathunting roota uncoder uncoderio

Updated Jun 12, 2024
Python

ApacheSpark

martandsingh / ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

sql database spark hive hadoop etl pyspark data-engineering spark-streaming data-analysis databricks datalake spark-sql timetravel apachespark etl-pipeline deltalake

Updated Dec 28, 2023
Python

vim89 / datapipelines-essentials-python

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

python big-data spark apache-spark hadoop etl xml python3 xml-parsing pyspark data-pipeline datalake hadoop-mapreduce spark-sql etl-framework hadoop-hdfs etl-pipeline etl-components

Updated May 6, 2023
Python

hifxit / dataligo

A library to accelerate ML and ETL pipeline by connecting all data sources

python database nosql datawarehouse datalake etl-pipeline ml-pipeline

Updated May 3, 2023
Python

PaloAltoNetworks / pan-cortex-data-lake-python

Python idiomatic SDK for Cortex™ Data Lake.

Updated Jan 7, 2022
Python

abdullahkhawer / aws-auto-terminate-idle-emr

An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.

Updated Jun 5, 2024
Python

aws-samples / aws-insurancelake-etl

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project

aws insurance glue datalake cdk

Updated Mar 27, 2024
Python

brfulu / us-accidents-data-engineering

Udacity Data Engineer Nanodegree - Capstone project

aws airflow spark athena s3 datalake

Updated Dec 19, 2019
Python

aws-samples / aws-insurancelake-infrastructure

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.

aws insurance datalake cdk

Updated Mar 3, 2024
Python

openEDI / open-data-access-tools

OEDI Data Lake Access

aws datalake nrel renewable-energy open-energy oedi

Updated May 29, 2024
Python

legout / pydala

Poor mans simple python api for creating a local or remote datalake based on several (pyarrow) datasets using duckdb

datalake pyarrow duckdb

Updated Jul 14, 2023
Python

mehroosali / s3-redshift-batch-etl-pipeline

Built functional python ETL script with functions that initialized spark clusters using pyspark library to extract songs stored in S3 bucket. Partitioned songs data by year and artist_id and compressed in parquet output files to increase load performance. Used the overwrite mode in spark to ensure every new run of ELT script is overwritten in th…

aws airflow sql spark etl analytics s3 python3 pyspark redshift datalake spark-sql airflow-dags

Updated Dec 28, 2021
Python

KleinYuan / llama2-csv-webapp

self host/local host llama2 based web app to chat with your csvs (multiple)

meta csv pandas openai datalake streamlit large-language-models llm chatgpt pandasai llama2 pandas-ai

Updated Jan 15, 2024
Python

kimtth / pyspark-tika-text-extraction

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

spark apache-spark multithreading pyspark data-pipeline datalake apache-tika tika-python

Updated Nov 15, 2021
Python

mfilipelino / kafka2hdfs

pyspark streaming kafka(0.8.2) to hdfs

kafka spark spark-streaming hdfs datalake

Updated Dec 13, 2018
Python

edgBR / delta-lake-polars

Building a poor man's data lake: Exploring the Power of Polars and Delta Lake

data-engineering delta datalake delta-lake polars polars-dataframe

Updated Feb 23, 2024
Python

UcheIgbokwe / FormulaOneDataETL

Collection of data on Formula One Racing

python spark databricks datalake azuredatabricks azuredatalakegen2

Updated Dec 21, 2022
Python

Improve this page

Add a description, image, and links to the datalake topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the datalake topic, visit your repo's landing page and select "manage topics."