CSV Cleaner & Normalizer
Clean messy CSV files, normalize headers, deduplicate rows, and fix encodings — all in your browser.
CSV Cleaner & Normalizer
Clean messy CSV files, normalize headers, deduplicate rows, and fix encodings — all in your browser.
CSV Cleaner & Normalizer
Clean messy CSV files, normalize headers, deduplicate rows, and fix encodings — all in your browser.
Features
- Multi-step cleaning pipeline applied in declared order: trim whitespace, lowercase headers, dedupe rows by full-row hash or selected columns, drop empty rows/columns, normalize encoding, replace specific values, sort rows by any column
- Header operations: rename, reorder, drop unwanted columns, and lowercase/normalize header names (replace spaces and hyphens with underscores) for SQL-compatible output
- Smart deduplication: hash entire rows or only selected columns, with optional case-insensitive and whitespace-insensitive comparison so visually-identical rows collapse correctly
- Encoding detection and fix: identifies UTF-8 BOM, Windows-1252, and Latin-1 inputs and re-encodes to UTF-8 — fixes the `?Name` and `Â` issues from misencoded Excel exports
- Value normalization: trim whitespace per cell, collapse internal whitespace, replace nulls/empty strings with a literal token (NULL, N/A, etc.), and apply a configurable date format to recognized date columns
- Preview pane shows the first 50 rows after each step so you can verify the pipeline behaves correctly before exporting the full dataset
- Pipeline runs in a Web Worker for large files so the UI stays responsive — the spinner indicates active processing rather than a frozen tab
- Pure client-side: every step runs in your browser, no upload happens. Works offline once cached, safe for confidential data that shouldn't leave your machine
How to use
- Drop or paste your CSV into the input pane; the preview shows the first rows immediately.
- Configure the pipeline: pick the steps you need (trim, dedupe, drop empty, normalize headers, etc.) and arrange them in execution order.
- For each step, set its options — e.g. dedupe by which columns, what to replace nulls with, which columns to drop.
- Watch the preview after each step to confirm the data looks correct before committing to a full run.
- Click Run to process the full file. The Web Worker chews through millions of rows without freezing the UI; the spinner shows progress.
- Download the cleaned CSV. The output preserves your chosen delimiter and line-ending; re-import into your spreadsheet or database confidently.
Tips & Best Practices
- Build the pipeline incrementally: add one step, check the preview, add the next. This is faster than configuring everything upfront and debugging.
- For dedupe-heavy workflows, run trim-whitespace FIRST so visually-identical rows collapse correctly.
- For database imports, end the pipeline with "lowercase headers" + "replace spaces with underscores" so column names become SQL-compatible identifiers.
- Use the preview's row count display to detect when a step accidentally drops too many rows (e.g. an overly-aggressive dedupe column choice).
- Pair with the CSV ↔ JSON Converter when the destination format is JSON; clean here, convert there.
FAQ
How does deduplication decide which row to keep when there are duplicates?
Default behavior keeps the FIRST occurrence and drops the rest. This is suitable for most cleaning workflows where the leading row in a sorted dataset is the canonical record. If you need to keep the last, sort the data descending by your timestamp column first, then dedupe. The tool doesn't expose a "keep last" option directly because the equivalent is one sort + one dedupe.
Why does my dedupe still leave seemingly-identical rows?
Whitespace differences and case differences count as not-identical by default. Toggle ignore-whitespace and case-insensitive in the dedupe step to collapse rows that look the same but have stray spaces or capitalization. Hidden Unicode characters (zero-width spaces, NBSPs from copy-paste) are normalized automatically.
How does encoding fix work?
The tool reads the file as ArrayBuffer, checks for a BOM signature, and then runs a heuristic on the byte distribution to guess Windows-1252 vs Latin-1 vs UTF-8 without BOM. If a guess looks wrong (you see `Â` characters where there should be accented letters), explicitly set encoding in the import dialog. Output is always UTF-8 without BOM.
Can I undo a pipeline step?
Remove the step from the pipeline and re-run; the input file is preserved in memory. The tool doesn't maintain per-step undo because the pipeline is declarative — changing a step's order or options is the same as undoing and redoing with new settings.
What happens to malformed rows (wrong field count)?
A configurable step decides: drop the row (default), pad missing fields with empty values, or truncate extra fields. The pipeline's validation step also surfaces the rows so you can see what was malformed before deciding.
How big a file can it handle?
Designed for files up to ~500MB. Larger files start hitting browser memory limits; for billion-row CSVs use a CLI tool (csvkit, miller, or a Python/Pandas script). The Web Worker keeps the UI responsive but doesn't magically expand RAM.
Is the data sent anywhere?
No. Every step runs in your browser; the Web Worker is a separate JavaScript context still in the same origin. DevTools Network tab confirms zero requests during a clean. Safe for PII, financial, or competitive data.
Can I save my pipeline configuration for reuse?
Not yet — pipelines reset each visit. If recurring cleans matter, save the configuration in a comment at the top of your CSV or in a shared doc; recreating it takes 30 seconds.