Machine Learning Guide24 Mai 2018

MLA 003 Storage: HDF, Pickle, Postgres

Practical workflow of loading, cleaning, and storing large datasets for machine learning, moving from ingesting raw CSVs or JSON files with pandas to saving processed datasets and neural network weights using HDF5 for efficient numerical storage. It clearly distinguishes among storage options—explaining when to use HDF5, pickle files, or SQL databases—while highlighting how libraries like pandas, TensorFlow, and Keras interact with these formats and why these choices matter for production pipelines.

Links

Notes and resources at ocdevel.com/mlg/mla-3
Try a walking desk stay healthy & sharp while you learn & code

Data Ingestion and Preprocessing

Data Sources and Formats:
- Datasets commonly originate as CSV (comma-separated values), TSV (tab-separated values), fixed-width files (FWF), JSON from APIs, or directly from databases.
- Typical applications include structured data (e.g., real estate features) or unstructured data (e.g., natural language corpora for sentiment analysis).
Pandas as the Core Ingestion Tool:
- Pandas provides versatile functions such as read_csv, read_json, and others to load various file formats with robust options for handling edge cases (e.g., file encodings, missing values).
- After loading, data cleaning is performed using pandas: dropping or imputing missing values, converting booleans and categorical columns to numeric form.
Data Encoding for Machine Learning:
- All features must be numerical before being supplied to machine learning models like TensorFlow or Keras.
- Categorical data is one-hot encoded using pandas.get_dummies, converting strings to binary indicator columns.
- The underlying NumPy array of a DataFrame is accessed via df.values for direct integration with modeling libraries.

Numerical Data Storage Options

HDF5 for Storing Processed Arrays:
- HDF5 (Hierarchical Data Format version 5) enables efficient storage of large multidimensional NumPy arrays.
- Libraries like h5py and built-in pandas functions (to_hdf) allow seamless saving and retrieval of arrays or DataFrames.
- TensorFlow and Keras use HDF5 by default to store neural network weights as multi-dimensional arrays for model checkpointing and early stopping, accommodating robust recovery and rollback.
Pickle for Python Objects:
- Python's pickle protocol serializes arbitrary objects, including machine learning models and arrays, into files for later retrieval.
- While convenient for quick iterations or heterogeneous data, pickle is less efficient with NDarrays compared to HDF5, lacks significant compression, and poses security risks if not properly safeguarded.
SQL Databases and Spreadsheets:
- For mixed or heterogeneous data, or when producing results for sharing and collaboration, relational databases like PostgreSQL or spreadsheets such as CSVs are used.
- Databases serve as the endpoint for production systems, where model outputs—such as generated recommendations or reports—are published for downstream use.

Storage Workflow in Machine Learning Pipelines

Typical Process:
- Data is initially loaded and processed with pandas, then converted to numerical arrays suitable for model training.
- Intermediate states and model weights are saved using HDF5 during model development and training, ensuring recovery from interruptions and facilitating early stopping.
- Final outputs, especially those requiring sharing or production use, are published to SQL databases or shared as spreadsheet files.
Best Practices and Progression:
- Quick project starts may involve pickle for accessible storage during early experimentation.
- For large-scale, high-performance applications, migration to HDF5 for numerical data and SQL for production-grade results is recommended.
- Alternative options like Feather and PyTables (an interface on top of HDF5) exist for specialized needs.

Summary

HDF5 is optimal for numerical array storage due to its efficiency, built-in compression, and integration with major machine learning frameworks.
Pickle accommodates arbitrary Python objects but is suboptimal for numerical data persistence or security.
SQL databases and spreadsheets are used for disseminating results, especially when human consumption or application integration is required.
The selection of a storage format is determined by data type, pipeline stage, and end-use requirements within machine learning workflows.

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(60)

MLA 030 AI Job Displacement & ML Careers

ML engineering demand remains high with a 3.2 to 1 job-to-candidate ratio, but entry-level hiring is collapsing as AI automates routine programming and data tasks. Career longevity requires shifting f...

26 Feb 42min

MLA 029 OpenClaw

OpenClaw is a self-hosted AI agent daemon that executes autonomous tasks through messaging apps like WhatsApp and Telegram using persistent memory. It integrates with Claude Code to enable software de...

22 Feb 51min

MLA 028 AI Agents

AI agents differ from chatbots by pursuing autonomous goals through the ReACT loop rather than responding to turn-based prompts. While coding agents are currently the most reliable due to verifiable f...

22 Feb 37min

MLA 027 AI Video End-to-End Workflow

How to maintain character consistency, style consistency, etc in an AI video. Prosumers can use Google Veo 3's "High-Quality Chaining" for fast social media content. Indie filmmakers can achieve narra...

14 Jul 20251h 11min

MLA 026 AI Video Generation: Veo 3 vs Sora, Kling, Runway, Stable Video Diffusion

Google Veo leads the generative video market with superior 4K photorealism and integrated audio, an advantage derived from its YouTube training data. OpenAI Sora is the top tool for narrative storytel...

12 Jul 202540min

MLA 025 AI Image Generation: Midjourney vs Stable Diffusion, GPT-4o, Imagen & Firefly

The AI image market has split: Midjourney creates the highest quality artistic images but fails at text and precision. For business use, OpenAI's GPT-4o offers the best conversational control, while A...

9 Jul 20251h 12min

MLG 036 Autoencoders

Auto encoders are neural networks that compress data into a smaller "code," enabling dimensionality reduction, data cleaning, and lossy compression by reconstructing original inputs from this code. Ad...

30 Mai 20251h 5min

MLG 035 Large Language Models 2

At inference, large language models use in-context learning with zero-, one-, or few-shot examples to perform new tasks without weight updates, and can be grounded with Retrieval Augmented Generation ...

8 Mai 202545min