Building an AI Research Pipeline to Separate Signal from Noise

How I built an automated research pipeline that captures, transcribes, analyzes, and organizes content from across the internet into a structured knowledge vault, and why the curation matters more than the automation.

Share
Building an AI Research Pipeline to Separate Signal from Noise

I consume a lot of wellness and technology content. Videos, podcasts, research papers, articles, social media posts, screenshots of things that catch my eye while scrolling on my phone. If you are anything like me, you have experienced the same frustration: you encounter something genuinely interesting, a doctor explaining a mechanism, an engineer breaking down a system design, a study contradicting conventional wisdom, and within a week, it is gone. Buried in browser tabs. Lost in watch history. Reduced to a vague memory that you read something about magnesium and sleep, somewhere, from someone credible.

The information exists. The signal is out there. But without a system to capture and organize it, it dissolves back into noise.

So I built one.


The Problem Worth Solving

The core issue is not a lack of information, it is the opposite. There is more content being produced than any person could consume in a lifetime. The challenge is not finding it. The challenge is retaining what matters and being able to retrieve it when you need it.

I wanted a system that could do what I do naturally, watch a video, read an article, notice something worth remembering, but then do what I consistently fail at: capture the substance, organize it properly, and make it findable later.

Not a bookmarking tool. Not a "read later" queue that becomes a guilt list. An actual research pipeline that processes content the way I would if I had unlimited time and perfect discipline.

What It Actually Does

The system I built works like this: I encounter something interesting on the internet, a YouTube video, a web article, a Facebook post, a PDF, a screenshot on my phone, and I send it to a Telegram bot. That is my entire contribution. One forward, one paste, one photo. The pipeline handles everything else.

Here is what happens next, depending on what I sent:

  • Videos, podcasts, and audio (YouTube, Facebook, Instagram, TikTok, Vimeo, or any audio/video file): the system extracts or generates a transcript and sends it to an AI model for structured analysis
  • Web articles: a headless browser loads the page, strips away the navigation and ads, extracts the meaningful content, and sends it for analysis
  • PDFs and documents: text is extracted and processed the same way
  • Screenshots and images: a local vision model reads the image, transcribes any text, identifies what it is about, and generates a structured summary

In every case, the output is the same: a well-structured research note that lands in my personal knowledge vault. Title, summary, key takeaways, notable claims, tags, questions for further research, and for video content, flashcards and the full transcript.

The entire process takes between thirty seconds and a few minutes, depending on the content type. By the time I have moved on to the next thing, the note is already sitting in my vault, tagged and searchable.

The Research Pipeline

📱 Capture

YouTube  ·  Podcasts  ·  Web Articles  ·  Facebook Posts  ·  PDFs
Screenshots  ·  Audio Files  ·  Instagram  ·  TikTok  ·  Documents

→ Sent via Telegram bot or command line

⚙️ Extract

Video/Audio → Extract subtitles, generate transcript
Web → Headless browser strips clutter, extracts article text
Images → Local vision model reads and interprets content
PDFs → Full text extraction across all pages

🧠 Analyze

AI summarization generates structured output:
Title  ·  Summary  ·  Key Takeaways  ·  Notable Claims
Tags  ·  Speaker Attribution  ·  Research Questions  ·  Flashcards

→ Content classified by topic and relevance

🗄️ Store

Structured research note saved to Obsidian vault
with frontmatter metadata, visual summary card,
and full source transcript or text

→ Deduplicated, tagged, and immediately searchable

✍️ Surface

AI-assisted pattern recognition across the growing corpus
identifies clusters, contradictions, and emerging themes
that become candidates for deeper investigation and writing

Why Telegram?

This was a deliberate choice, not a technical convenience. Telegram is the one app that is always available, on my phone, my laptop, my tablet. When I am watching a video and something strikes me as worth keeping, the friction needs to be near zero. Copy the link, paste it to my bot. Done. No context switching to a different app, no logging into a web interface, no "I will save this later" promises I will not keep.

For screenshots, it is even simpler. I take a screenshot on my phone, a chart from an article, a slide from a presentation, and forward it directly. A local vision model reads the image, interprets it, and creates a structured note. That screenshot that would have sat in my camera roll forever is now a searchable, tagged research note.

The AI Layer: What It Does and What It Does Not Do

The AI in this pipeline serves a specific, bounded purpose: structured extraction. It is not generating opinions. It is not deciding what is true. It is reading content and producing a consistent, well-organized summary that makes the material useful later.

For every piece of content, the AI generates:

  • A concise title and description
  • A two-to-three paragraph summary
  • Bullet-pointed key takeaways
  • Notable claims, with specific numbers or study references when present
  • Relevant tags for categorization
  • Questions the content raises that are worth investigating further
  • For video and audio content: flashcards in spaced repetition format

This is the part I find most valuable. The AI is not replacing my thinking, it is doing the tedious work of structuring information so that my thinking can start at a higher level. When I sit down to research a topic, I am not rewatching hours of video or re-reading articles. I am scanning structured summaries, following tagged connections, and reviewing the specific claims and questions that the pipeline surfaced.

The Knowledge Vault

Everything lands in an Obsidian vault, a local, markdown-based knowledge management system. This was another deliberate choice. Obsidian is not a cloud service that might change its terms, limit access, or disappear. It is files on my machine. Markdown files that I own, can search, can back up, and can use with any tool I choose.

Each research note includes structured metadata, source URL, content type, tags, speaker name, processing date, that makes the vault queryable in ways that a folder full of bookmarks never could be. I can pull up everything tagged with a specific topic, find all notes from a particular speaker, or surface content that mentions a specific concept or technique.

Within its first days of operation, the vault accumulated over 250 pieces of processed content, and it continues to grow. That is not a pile of bookmarks. That is a searchable, structured research corpus that expands every time I encounter something interesting.

Privacy as a Design Principle

One decision I am particularly satisfied with: audio transcription runs locally. When the pipeline needs to transcribe a video or podcast that does not have subtitles available, it uses an open-source speech recognition model running on my own machine. The audio never leaves my hardware. The same applies to image analysis, a locally hosted vision model handles screenshots and photos without sending them to any external service.

The AI summarization step does use cloud APIs, that is where the structured analysis happens. But the raw content processing, the transcription, the image reading, that stays local. For research that sometimes involves personal data, this matters.

How It Was Built, and What That Says

Here is a detail I think is worth sharing: this entire pipeline is written in Python, roughly two thousand lines of it, and I never opened an editor to write it.

I have spent decades building software and leading engineering teams. I could have built this myself, the traditional way, and it would have worked. But I chose not to. Instead, I designed the system, defined the architecture, made the technical decisions, and then built the whole thing by describing what I wanted to an AI and iterating on the output.

Not because I could not code it. Because this was a better use of my time and attention.

The architecture decisions, choosing local transcription over cloud APIs, using a queue-based processing model, structuring the Obsidian output for maximum queryability, those required experience and judgment. The actual implementation did not require me to type def process_video(): into an editor. That distinction matters.

What struck me most was how the role shifted. I was not a developer in this project. I was an architect and a director, someone who knew exactly what the system needed to do, understood the tradeoffs, and could evaluate whether the AI's output was correct. The decades of experience did not become irrelevant. They became the thing that made the collaboration effective. I knew what good looked like, which meant I could steer the AI toward it without writing the code myself.

I think this is where things are heading for a lot of builders. The value is not in typing the code, it is in knowing what to build, why, and being able to tell when something is wrong. If you have that, the implementation is increasingly something you can delegate.

What Surprised Me

I expected the transcription and summarization to be useful. What I did not expect was how much the tagging and question generation would change how I think about the material.

When the AI processes a video about, say, vitamin D and autoimmunity, it does not just summarize the claims. It generates tags that connect this content to other notes in the vault, notes I may have forgotten about, on related but distinct topics. And the "questions for further research" section regularly surfaces angles I had not considered.

The vault has started to develop its own gravity. The more content it contains, the more connections emerge between topics. Patterns I would never have noticed by consuming content linearly, one video at a time, one article at a time, become visible when hundreds of structured notes are sitting side by side, tagged and cross-referenced.

This is where the real value lives. Not in any individual note, but in the accumulation and intersection of structured knowledge over time.

From Vault to Voice

This site, Signal and Noise, is the publishing layer that sits alongside this infrastructure. Wellness is where the pipeline started, and right now it is where most of the depth lives. When I write about those topics here, I am not starting from a blank page or from memory. I am drawing on a growing, searchable research corpus that I have been building continuously, one captured insight at a time.

But this site is not exclusively a wellness publication, and the pipeline is not the only path to a published piece. I have other interests, technology, sports, the occasional opinion about how something should be built, and those will show up here too, sometimes backed by a structured research corpus, sometimes not. The pipeline is a tool, not a prerequisite.

What it does guarantee, at least for the wellness content that currently dominates the vault, is traceability. The claims I discuss have a trail back to the original content that introduced them. This does not make them correct, but it means the reasoning chain from source material to published insight is preserved.

Over time, I may extend the same approach to other domains. The architecture is not domain-specific, it is content-specific. But for now, this is where it earns its keep.


The Honest Assessment

Is this overkill? Maybe. There are simpler ways to save bookmarks and take notes. But I have tried those, and they do not work, not for me, not at the volume of content I consume, and not for the kind of cross-referencing and pattern recognition that makes research genuinely useful.

The system is not perfect. Some summaries miss nuance. Some tags are too broad. The transcription occasionally stumbles on technical terminology. But the baseline it provides, structured, searchable, tagged research notes generated automatically from a single forwarded link, is dramatically better than the alternative, which is forgetting most of what I consume.

If this resonates, it is probably because you have the same problem. You consume more than you can retain. You know the signal is in there, somewhere, but you do not have the infrastructure to hold onto it.

This is how I solved that. It is working better than I expected.


The content on this site reflects personal experience and personal research. Nothing here constitutes medical advice or professional recommendations. For the full disclaimer, see the About page.