The No-Upload AI Analyst (V4: Secure Mode)
How I analyze sensitive data with AI—without uploading a single row (plus a copy-paste prompt for non-tech users and code you can run locally)
TL;DR: I rebuilt my “AI that thinks like an analyst” so the model never sees raw rows. You paste only headers or a tiny sample, apply a policy preset (Hash/Mask/Redact), and run analysis on the safe table—locally. This post gives you (1) a copy-paste prompt for non-technical folks and (2) code for a local notebook or Streamlit app to transform sensitive columns and produce a data-handling report for ops/legal.
Copy-Paste This Prompt First (for non-technical users)
The No-Upload AI Analyst — Master Prompt
You are my no-upload AI analyst. You must never ask for or require file uploads. You will work from column names, a few header rows or a tiny synthetic sample, and my business question.
Rules & Privacy Defaults
Assume PII may exist. Default to Hash (HMAC SHA-256) emails/phones/IDs, Mask names/addresses, Redact SSNs and any fields we don’t need.
Prefer client-side or local steps (no external uploads).
Ask for schema first (column names, types, 1–5 synthetic rows if needed).
When suggesting analyses, operate on transformed/safe columns, not raw values.
Provide a one-page data-handling summary: Column → Action (Hash/Mask/Redact/Keep) → Reason.
Workflow
Ask me for:
My business question (e.g., “Why did churn spike in May?”)
Schema (column names like: user_id, email, signup_date, plan, country, churned, phone)
Optional tiny sample or synthetic rows (5 rows max), or headers only if data is sensitive.
Based on the schema, propose a privacy preset:
Analytics-Safe (hash IDs/emails/phones; mask names/addresses; keep dates or bucket by month; redact SSN)
Marketing-Safe (stricter on addresses/phones)
HIPAA-like Strict (example only, not legal advice; heavy redaction)
Produce the data-handling plan (per column) and confirm with me.
Generate analysis steps that work on safe columns only. Use schema-aware reasoning (e.g., “Split churn by plan & country,” “Cohort by signup month,” “Compare activated vs not activated”).
Give me a short checklist and sanity tests to validate results locally.
Tone & Format
Use plain English, short steps, and bullet points.
Give me 2–3 options for each decision so I can choose.
Include mini-exercises I can do with just headers or a tiny fake sample.
Do not ask me to upload files or paste large datasets. Operate with schema and small, optional synthetic rows only.
One-Line “Pocket” Version:
“Be a no-upload AI analyst: work only from column names + a tiny sample (or none), default to Hash/Mask/Redact presets, produce a per-column data-handling plan and safe analysis steps—no file uploads.”
The (Very Human) Story Behind This Post
Three months ago I almost shipped a “clever” feature—drag-and-drop CSVs into my AI app, instant EDA. Then I pictured the real CSVs people carry around: hospital claims, student records, purchase histories. I imagined someone dragging a million rows into my app from café Wi-Fi. My stomach dropped. Not because my app couldn’t handle it, but because I shouldn’t.
So I didn’t kill the app—I killed uploads.
Instead, I rebuilt the whole thing around a question that now sits on a sticky note above my desk:
“What if your AI analyst never sees your data—but still helps you analyze it like a pro?”
That’s this post. It’s the recipe I wish I had when I started: no uploads, client-side transforms, policy presets, and an audit-ready report. You can use it at a hospital, a bank, a school—or in your tiny startup where you simply don’t want customer emails floating around in the void.
And yes: it still works. Because thinking like an analyst doesn’t start with rows. It starts with good questions, clear schema, and safe defaults.
The Rules-First Mindset: Three Privacy Moves
If you remember nothing else, remember these three verbs:
Hash — turn a value into a code, joinable but not readable
Use HMAC SHA-256 (hash + secret).
Emails/phones/IDs become stable codes → you can still join tables across systems without exposing raw values.
Same input + same secret = same digest.
Mask — blur the details, keep the shape
“a***@example.com”, “--1234”, “E**** J***”
Useful for UI or human review: you can spot patterns without the secrets.
Redact — delete it
If you don’t need SSN, drop it. Redaction shrinks the blast radius of any future breach.
A friend once said, “Privacy is like seatbelts. Boring until you need it.” Make these three moves your seatbelts.
Why No-Upload? Because Your Stakeholders Want Insights, Not Risk
In regulated teams (healthcare/finance/education) and, honestly, in most companies now, uploading raw files into random tools is a non-starter. Even when tools promise encryption, you still face:
Policy friction: Legal reviews, vendor security forms, data transfer agreements.
Operational risk: Misconfigured buckets, lax logging, “small” exceptions that become the headline.
Human error: Wrong file dragged to the wrong window.
The no-upload pattern sidesteps all of it. Keep data local; transform to a safe table; analyze; move on.
Your Default Policy (Steal This)
Analytics-Safe (good general default)
Hash: emails, phones, user/account IDs, device IDs
Mask: names, street addresses
Keep: dates (or bucket by month), aggregated counts, categories
Redact: SSN, passport, payment tokens, precise coordinates
Logs: contain nothing sensitive
Report: include a per-column action + reason
Marketing-Safe (stricter on contact info)
Everything in Analytics-Safe, plus:
Tighter masking on addresses/phones; consider redacting phone entirely unless truly needed.
“HIPAA-like” Strict (example only; not legal advice)
Heavy redaction on direct identifiers
Dates bucketed or coarsened
Free-text fields risk-scanned or excluded
You don’t need to be a compliance wizard to start. You just need sensible defaults and a habit of explaining your decisions in one page.
The 5-Step No-Upload Workflow
Paste tiny sample (or just headers)
Example schema:
user_id, email, signup_date, country, churned, phone
Optional: 3–5 synthetic rows (fake emails/phones; real formats, not real people).
Pick a preset
Analytics-Safe, Marketing-Safe, or HIPAA-like (example only).
Set your secret (for hashing)
Use a strong passphrase; store it safely. Same secret → stable joins.
Transform
Email/Phone/IDs → HMAC hash
Names/Addresses → Mask
SSN → Redact
Dates → Keep or Month bucket
Get two outputs
Safe Table: the one you analyze
Data-Handling Report (one page): Column → Action → Reason
Then proceed with analysis on the safe table. The model never saw raw PII.
“But Can AI Still Help Without My Rows?”
Yes. Because analyst thinking = schema + question + constraints.
Give AI:
Column names (and types if you know them)
The business question
Any guardrails (“We measure churn as 45-day inactivity,” “Activation = created 3 dashboards,” etc.)
AI can then propose valid analysis steps based on familiar patterns:
Churn diagnosis: split by plan, country, acquisition channel
Cohorts: group users by signup month, compare retention
Activation: compare activated vs not activated users
Breakage points: payment failures, support tickets around key dates
It’s how senior analysts work. They don’t need your raw table to sketch the plan.
Mini-Exercises (You Can Do With Only Headers)
Exercise 1 — Pick the move
email
→ Hashphone
→ Hashuser_id
→ Hashfull_name
→ Maskstreet_address
→ Mask or Redact (depends)ssn
→ Redactsignup_date
→ Keep or bucket by monthplan
→ Keepcountry
→ Keep
Exercise 2 — Cohort thinking
With just
signup_date
andchurned
, you can still plan a retention chart by month. You don’t need names.
Exercise 3 — Risk sweep
Find free-text columns (e.g.,
notes
,comments
). Decide: exclude, mask patterns (emails/phones), or redraft the workflow to avoid them.
FAQs (In Plain English)
Q: Why HMAC instead of plain SHA-256?
A: The secret (“key”) makes it hard to guess or reverse with dictionary attacks. Same input + same secret = same digest → you can join safely.
Q: Can we still join tables?
A: Yes—if both sides hash the same field with the same secret and method.
Q: Are dates safe?
A: Often yes, but consider bucketing by month (or coarser) when precise dates could identify someone.
Q: Is this legally compliant?
A: This is sensible defaults, not legal advice. Use it to get 80% of the way there, then align with your legal/privacy team.
Q: Where should I run this?
A: Locally on your machine for maximum privacy. Avoid cloud pasting for sensitive content.
Q: What about logs?
A: Your logs should contain nothing sensitive. Log actions, not raw values.
A One-Page Data-Handling Report (Template)
Project: Monthly churn spike analysis
Preset: Analytics-Safe
Secret Handling: Managed locally by analyst; not loggedColumn Detected Type Action Reason Notes user_id identifier Hash Joinable without exposure HMAC SHA-256 email email Hash PII; needed for joins only HMAC SHA-256 phone phone Hash PII; not needed in UI HMAC SHA-256 full_name name Mask Human-readable shape for review e.g., “A**** K****” street_address address Mask Pattern recognition without exact address Consider redaction if unneeded ssn ssn Redact Not required for analysis Dropped before analysis signup_date date Keep Needed for cohorts; bucket by month if sharing widely country category Keep OK for segmentation churned boolean Keep Outcome variable
Logging: No sensitive data logged.
Reviewer: (Ops/Legal)
Date:
Use this as a PDF you commit with your analysis. It builds trust.
Sanity Checks (So You Don’t Fool Yourself)
Digest stability: Re-hash the same email twice → digests match.
Masking consistency: Every name follows the same mask pattern.
Redaction completeness: Columns marked “redact” are gone before analysis.
No-PII joins: All joins happen on hashed keys only.
Logs audit: Search logs for “@” or 10-digit strings—you should find none.
Real-World Patterns (Copy These)
Healthcare
Patient IDs → Hash
Dates → Keep but consider month bucket
Free-text notes → exclude or pattern-mask emails/phones
Result: you can still analyze LOS, readmission, pathways—without names.
Education
Student IDs/emails → Hash
Names/addresses → Mask
Cohorts by term/grade → fully possible with safe data
Marketing
Emails → Hash for join with CRM
Phones → Consider redaction unless absolutely needed
Campaign reporting works with aggregates—no raw contact exposure
Product Analytics
User/account IDs → Hash
Features/dates/events → Keep
Complete retention funnels and activation analyses—no PII required
A Short “Do This Today” Checklist
Paste headers + 3–5 synthetic rows
Pick Analytics-Safe preset
Set a strong secret (don’t log it)
Transform → Safe Table + Report
Analyze on the safe table only
Ship the data-handling report with your results
Tape this near your monitor.
The Quiz (Because Repetition = Retention)
Best default for
ssn
? Redact.Why HMAC vs plain hash? Harder to guess.
Need to join by email? Hash/HMAC both sides with the same secret.
Show phones in UI but hide most digits? Mask.
Is 5 synthetic rows OK for testing transforms? Yes.
Which is not PII by itself? Month bucket.
Different HMAC secrets across teams? Hashes won’t match.
Very short names even when masked? Re-ID risk remains—handle carefully.
Best place to run paste-only app? Local machine.
What should logs contain? Nothing sensitive.
Why I Care (And Why You Might, Too)
I’ve worked in teams where people worry they’ll “break the rules” just by doing their jobs. That’s a terrible feeling. It slows good work and doesn’t actually make anyone safer.
This approach—no uploads, safe transforms, one-page report—lets you move fast and sleep at night. It’s respect for your customers, your teammates, and your future self who has to explain the decisions six months from now.
If this helped, share it with one teammate who touches sensitive data. You might save them a headache and a compliance email.
CODE (Put at the End)
What you get below:
A. Notebook recipe (pandas) — transforms + one-page report
B. Streamlit local app — paste headers/sample → choose preset → download safe CSV + report
Install once:
pip install pandas python-slugify streamlit
(No internet needed to run. Do this locally.)
A) Notebook Recipe — Safe Transforms + Report (pandas)
# safe_transforms.py
# Run locally. Do NOT upload sensitive data anywhere.
# Usage:
# from safe_transforms import transform_dataframe, presets, write_report_markdown
# df_safe, actions = transform_dataframe(df_raw, preset=presets["analytics_safe"], secret="YOUR-STRONG-SECRET")
# write_report_markdown(actions, "data_handling_report.md")
import re
import hmac
import hashlib
import pandas as pd
from typing import Dict, Tuple, List, Any
from slugify import slugify
# ----------------------
# 1) Detection heuristics
# ----------------------
EMAIL_RX = re.compile(r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
PHONE_RX = re.compile(r"(\+?\d[\d\-\s\(\)]{7,}\d)")
SSN_RX = re.compile(r"^\d{3}-?\d{2}-?\d{4}$", re.IGNORECASE)
def looks_like_email(col_name: str, sample: List[str]) -> bool:
if "email" in col_name.lower():
return True
return any(bool(EMAIL_RX.match(str(x).strip())) for x in sample[:5])
def looks_like_phone(col_name: str, sample: List[str]) -> bool:
if any(k in col_name.lower() for k in ["phone", "mobile", "tel"]):
return True
return any(bool(PHONE_RX.search(str(x))) for x in sample[:5])
def looks_like_ssn(col_name: str, sample: List[str]) -> bool:
if "ssn" in col_name.lower():
return True
return any(bool(SSN_RX.match(str(x))) for x in sample[:5])
def looks_like_id(col_name: str) -> bool:
return any(k in col_name.lower() for k in ["id", "guid", "uuid", "user_id", "account_id", "device_id"])
def looks_like_name(col_name: str) -> bool:
return any(k in col_name.lower() for k in ["name", "first_name", "last_name", "fullname", "full_name"])
def looks_like_address(col_name: str) -> bool:
return any(k in col_name.lower() for k in ["address", "addr", "street", "zipcode", "zip", "postal", "city"])
def looks_like_date(col_name: str) -> bool:
return any(k in col_name.lower() for k in ["date", "dob", "signup", "created", "timestamp"])
# ----------------------
# 2) Actions
# ----------------------
def hmac_sha256(value: Any, secret: str) -> str:
if pd.isna(value):
return ""
msg = str(value).strip().encode("utf-8")
key = secret.encode("utf-8")
return hmac.new(key, msg, hashlib.sha256).hexdigest()
def mask_value(value: Any, mode: str = "generic") -> str:
if pd.isna(value):
return ""
s = str(value)
if mode == "email":
# a***@example.com
parts = s.split("@")
if len(parts) == 2 and parts[0]:
prefix = parts[0][0]
return prefix + "***@" + parts[1]
return "***"
if mode == "phone":
# ***-***-1234 (keep last 4)
digits = re.sub(r"\D", "", s)
last4 = digits[-4:] if len(digits) >= 4 else digits
return "***-***-" + last4
if mode == "name":
# E**** J***
tokens = s.split()
masked = []
for t in tokens:
if len(t) == 1:
masked.append("*")
elif len(t) == 2:
masked.append(t[0] + "*")
else:
masked.append(t[0] + "*"*(len(t)-1))
return " ".join(masked)
if mode == "address":
# Keep city/state/zip hints minimal (very conservative placeholder)
return "[MASKED ADDRESS]"
# generic fallback
if len(s) <= 2:
return "*" * len(s)
return s[0] + "*"*(len(s)-2) + s[-1]
def bucket_month(value: Any) -> str:
try:
dt = pd.to_datetime(value, errors="coerce")
if pd.isna(dt):
return ""
return dt.strftime("%Y-%m")
except Exception:
return ""
# ----------------------
# 3) Presets (policy as code)
# ----------------------
presets: Dict[str, Dict[str, Any]] = {
"analytics_safe": {
"hash": ["email", "phone", "id"],
"mask": ["name", "address"],
"redact": ["ssn"],
"dates": "keep_or_month" # 'keep', 'month', or 'keep_or_month'
},
"marketing_safe": {
"hash": ["email", "id"],
"mask": ["name", "address", "phone"],
"redact": ["ssn"],
"dates": "month"
},
"hipaa_like_strict": { # example only; not legal advice
"hash": ["id"],
"mask": ["name"],
"redact": ["email", "phone", "ssn", "address"],
"dates": "month"
}
}
# ----------------------
# 4) Main transform
# ----------------------
def detect_semantic(col: str, series: pd.Series) -> str:
s = series.dropna().astype(str).tolist()
if looks_like_ssn(col, s):
return "ssn"
if looks_like_email(col, s):
return "email"
if looks_like_phone(col, s):
return "phone"
if looks_like_id(col):
return "id"
if looks_like_name(col):
return "name"
if looks_like_address(col):
return "address"
if looks_like_date(col):
return "date"
return "other"
def transform_dataframe(
df: pd.DataFrame,
preset: Dict[str, Any],
secret: str,
month_bucket_cols: List[str] = None
) -> Tuple[pd.DataFrame, List[Dict[str, str]]]:
"""
Returns (safe_df, actions) where actions is a list of per-column dicts for reporting.
"""
month_bucket_cols = month_bucket_cols or []
out = df.copy()
actions = []
for col in out.columns:
semantic = detect_semantic(col, out[col])
action = "keep"
reason = "Not sensitive or needed as-is"
# Decide action from preset
if semantic in preset.get("redact", []):
action, reason = "redact", f"{semantic} redacted per preset"
out.drop(columns=[col], inplace=True)
elif semantic in preset.get("hash", []):
action, reason = "hash", f"{semantic} hashed for joinability without exposure"
out[col] = out[col].apply(lambda v: hmac_sha256(v, secret))
elif semantic in preset.get("mask", []):
action, reason = "mask", f"{semantic} masked to keep human-readable shape"
mode = "generic"
if semantic == "email": mode = "email"
if semantic == "phone": mode = "phone"
if semantic == "name": mode = "name"
if semantic == "address": mode = "address"
out[col] = out[col].apply(lambda v: mask_value(v, mode))
elif semantic == "date":
dmode = preset.get("dates", "keep_or_month")
if dmode in ("month",) or (dmode == "keep_or_month" and col in month_bucket_cols):
action, reason = "month_bucket", "Date bucketed to YYYY-MM for sharing"
out[col] = out[col].apply(bucket_month)
else:
action, reason = "keep", "Date kept (consider bucketing when sharing broadly)"
else:
# other = keep
action, reason = "keep", "Kept (no direct PII detected)"
actions.append({
"column": col,
"semantic": semantic,
"action": action,
"reason": reason
})
return out, actions
# ----------------------
# 5) Report writer
# ----------------------
def write_report_markdown(actions: List[Dict[str, str]], path: str, title="Data Handling Report", project=""):
lines = [f"# {title}", ""]
if project:
lines.append(f"**Project:** {project}")
lines.append("")
lines.append("| Column | Detected Type | Action | Reason |")
lines.append("|---|---|---|---|")
for a in actions:
lines.append(f"| {a['column']} | {a['semantic']} | {a['action']} | {a['reason']} |")
with open(path, "w", encoding="utf-8") as f:
f.write("\n".join(lines))
# Example tiny run (with synthetic data)
if __name__ == "__main__":
data = {
"user_id": [1,2,3],
"email": ["alice@example.com", "bob@example.com", "charlie@example.com"],
"signup_date": ["2025-01-05", "2025-01-14", "2025-02-02"],
"country": ["US", "CA", "US"],
"churned": [0,1,0],
"phone": ["+1 (555) 111-2222", "+1 (555) 333-4444", "+1 (555) 555-6666"],
"ssn": ["123-45-6789", "987-65-4321", "111-22-3333"],
"full_name": ["Alice K", "Bob M", "Charlie N"],
"street_address": ["123 Pine St", "45 Maple Ave", "9 Oak Blvd"]
}
df = pd.DataFrame(data)
df_safe, actions = transform_dataframe(df, presets["analytics_safe"], secret="CHANGE-ME-SECRET", month_bucket_cols=["signup_date"])
print(df_safe.head())
write_report_markdown(actions, "data_handling_report.md", project="Demo: No-Upload AI Analyst")
B) Streamlit Local App — Paste → Preset → Secret → Transform → Download
# app.py
# Local-only Streamlit app for no-upload transforms + one-page report.
# Run: streamlit run app.py
import io
import json
import pandas as pd
import streamlit as st
from safe_transforms import transform_dataframe, presets, write_report_markdown
st.set_page_config(page_title="No-Upload AI Analyst — Secure Mode", layout="centered")
st.title("🔐 No-Upload AI Analyst — Secure Mode")
st.caption("Transform sensitive columns locally (Hash/Mask/Redact) and analyze the safe table. No file uploads.")
st.markdown("**How it works**")
st.markdown("1) Paste **headers or a tiny synthetic sample** (CSV).\n2) Pick a **policy preset**.\n3) Set your **secret** (for hashing).\n4) Transform & **download** the Safe CSV + Report.")
with st.form("input_form"):
sample_csv = st.text_area("Paste CSV (headers + a few synthetic rows are enough):", height=220, placeholder="user_id,email,signup_date,country,churned,phone\n1,alice@example.com,2025-01-05,US,0,+1 (555) 111-2222")
preset_name = st.selectbox("Preset", ["analytics_safe", "marketing_safe", "hipaa_like_strict"])
secret = st.text_input("Secret (used for HMAC hashing)", type="password")
month_bucket_cols_raw = st.text_input("Month-bucket these date columns (comma-separated)", value="signup_date")
submitted = st.form_submit_button("Transform Securely")
if submitted:
if not sample_csv.strip():
st.error("Please paste a small CSV sample (headers + a few synthetic rows).")
st.stop()
try:
from io import StringIO
df = pd.read_csv(StringIO(sample_csv))
except Exception as e:
st.error(f"Could not parse CSV: {e}")
st.stop()
if not secret:
st.warning("No secret provided. Hashes will not be stable across sessions.")
secret = "TEMP-SECRET"
month_cols = [c.strip() for c in month_bucket_cols_raw.split(",") if c.strip()]
df_safe, actions = transform_dataframe(df, presets[preset_name], secret=secret, month_bucket_cols=month_cols)
st.success("✅ Transformed locally. No uploads performed.")
st.dataframe(df_safe.head(), use_container_width=True)
# Downloads
safe_csv = df_safe.to_csv(index=False).encode("utf-8")
st.download_button("⬇️ Download Safe CSV", data=safe_csv, file_name="safe_table.csv", mime="text/csv")
# Report as markdown
report_md = io.StringIO()
report_lines = ["# Data Handling Report", "" ,"| Column | Detected Type | Action | Reason |","|---|---|---|---|"]
for a in actions:
report_lines.append(f"| {a['column']} | {a['semantic']} | {a['action']} | {a['reason']} |")
report_text = "\n".join(report_lines)
st.download_button("⬇️ Download Report (Markdown)", data=report_text.encode("utf-8"), file_name="data_handling_report.md", mime="text/markdown")
st.markdown("**What’s next?** Analyze the safe CSV locally. Keep the report with your results for auditability.")
How to Use the Code Safely
Run locally. Don’t host this for strangers.
Use synthetic rows when testing your policy.
Keep your secret out of logs and screenshots.
Share only the safe CSV and the report.
If requirements change (e.g., stricter policy), update the preset and re-run.
Final Nudge
If this approach makes sense, try it once this week on a real analysis. Start tiny: headers + 3 fake rows. Pick Analytics-Safe. Transform. Build your chart or cohort from the safe table. Then attach the one-page report. That one small habit can change the way your team ships analytics—fast, useful, and respectful of the people behind the data.
If this helped, consider sharing it with a teammate who handles sensitive data. And if you want the app UI pre-styled and packaged, tell me—I’ll bundle a polished version for teams next.
—Mukundan