Thursday, February 20, 2025

RAG AI: “Take Charge Yourself,” Asserts NYC Data Scientist

Organizations should start building their own generative AI tools that focus on retrieval augmented generation (RAG) and utilize open-source products like DeepSeek and Llama. Alaa Moussawi, the chief data scientist at New York City Council, shared this message at the recent Leap 2025 tech event in Saudi Arabia. This event came amid Saudi Arabia’s announcement of a $15 billion investment in AI.

Moussawi highlighted that any organization can experiment with AI at minimal cost. He pointed to the council’s first project from 2018, where they aimed to create a more fact-based legislative process and ease the workflow for attorneys, policy analysts, and elected officials.

In 2018, Moussawi’s team developed its first AI application—a duplicate checker for legislation. Whenever a council member proposes legislation, it’s input into the system, timestamped, and checked for originality. With tens of thousands of ideas logged, ensuring each proposal is unique is crucial. “If it’s been proposed before, the credit goes to the original official,” Moussawi explained, recalling past mistakes where duplicate proposals led to contentious situations.

Looking back, Moussawi described their early model as basic. It utilized Google’s Word2Vec, developed in 2013, to understand word relationships. “It’s not fast,” he admitted, “but it’s quicker than a human, which streamlines their work.”

The core of their duplicate checker lies in vector embedding. Each word’s position is represented by a list of numbers in a multi-dimensional space. Moussawi illustrated this by using vectors for concepts like “royalty” and “woman,” which combine to represent “queen.” The system trains itself using all the available texts to create idea vectors and measures their distances—essentially comparing how similar two legislative ideas are.

Moussawi is passionate about hands-on experience with generative AI. While he recognizes the advancements in neural networks, he emphasizes their limitations. “Current AI models simply predict the next word in a sequence,” he explained. “Unlike a human conveying an idea, AI isn’t genuinely thinking.”

The 2020 breakthrough in the field was significant: researchers realized that with scaled compute power and datasets, performance significantly improved. However, Moussawi stressed that none of the underlying science has changed drastically. There are numerous open-source models available, like DeepSeek and Llama, but the fundamental architecture remains largely the same.

As for why he advocates for DIY AI, Moussawi noted the council’s decision to prohibit third-party LLMs due to security risks. Instead, they opted for open-source models. “With the launch of the initial Llama models, we began experimenting locally,” he said. “You can run models on your laptop and create effective proof-of-concept systems.”

He outlined the first steps for setting up a local AI system: start by indexing documents into a vector database. This process is a one-time backend task, creating the foundation for querying. Next, create a pipeline that retrieves relevant documents based on input prompts, using your vector database to find legal memos or other relevant documents.

This approach is known as retrieval augmented generation, and it greatly reduces inaccuracies by ensuring that the AI references credible sources. “These provide guardrails for your model,” Moussawi explained, ensuring that the end user can rely on the output because it cites sources.

His final message was clear and motivational: as he waits for the council’s data science team to receive their first GPUs, he urges organizations to start their own AI journey: “What are you waiting for?”