Building for Open-Data Commons at Diagram Chasing

Presented at Open Knowledge Initiatives, IIIT-H

Vivek Matthew
Aman Bhargava

Who we are

Vivek

  • Government website hoarder
  • Occasional open data publisher
  • Fan of maps (including OpenStreetMap)

Aman

  • Designer, programmer
  • Maps, data journalism, public technology enthusiast
  • Has too many ideas

Who is my Neta?

Easy and intuitive explorer for browsing affidavits
and parliamentary activity of elected MPs.

This was fun

Data, development, design. We could do more.

We do full-stack data journalism

BLR Water Log

Drainage patterns in Bangalore as a convenient map,
with curated historical context.

Votes in a Name

Building on the previous project, we analyzed election candidate names to find out “namesakes” which could have potentially flipped the election.

How do you find more namesakes like S Veeramani and S .V Ramani?

Now let’s go deeper

You’ve seen what we build. But how do we build it?

We’ll look at

  • CBFC Watch
  • Time Use Survey Explorer

CBFC Watch

India’s largest, and only, archive of film censorship

For the first time, search through thousands of censorship records

Browse cross-referenced keywords

Each film leads you to others like it, creating links between 18,000 movies based on censorship

Understand a film in context

Normally the only way to find a certificate for a movie is to go around searching for something like this in a theatre

On opening the URL, the certificate is displayed with the list of cuts made to the film.

Making it human readable

  • Build a parser to read HTML files
  • Extract only the relevant content
  • Export everything to CSV format

The Government had other plans in store

What I got, I accepted as my fate, What I lost, I gradually forgot.

This data was still unusable because the information that users cared about was hidden in piles of text and timestamps with no context.

Also, modifications alone weren’t enough.

  • Who made the movie?
  • Who acted in the movie?
  • Which studio?
  • When did it release?
  • How can I analyze trends, if any?

Attempt #1:

Manual Classification

If you were going to manually categorize them anyway, it is a subjective decision that can be passed on to an LLM.

Attempt #2:

Large Language Models

Detailed prompts + Edge case examples = Text categorization that also cleans up messy content for better readability.

## Classification Schema
action_types: deletion, insertion, replacement...
content_types: violence, profanity, political...
media_elements: music, visual_scene, text_dialogue...
{
  "cleaned_description": "string",
  "reference": "string" or null,
  "action": "string",
  "content_types": ["string"],
  "media_element": "string"
}
Input: "Muted the word FUCK at 01:12:29"
Output: {
  "reference": "fuck",
  "action": "audio_modification",
  "content_types": ["profanity"]
}

Costs are neglible for the value

  • 100K descriptions processed
  • ₹1,500 total cost
  • ₹0.015 cost per description

What LLMs are good for

  • This kind of repetitive classification at scale
  • Standardizing messy text
  • Pattern matching across similar cases
  • Reducing manual grunt work in development

The interesting work—the analysis, the trends, the insights—that’s entirely (and necessarily) human.

Now we have analyzable metadata!

Before:

01:32:59:00 Replaced the whole V.O. stanza about caste system of Manu Maharaj. Aabadi hain Aabad .Aur unka jivan sarthak hoga To Aabadi hain Aabad nahiazhadi ki

After:

Clean Description: Replaced a voice-over passage discussing the caste system.

Categories: - REPLACEMENT - TEXT DIALOGUE - IDENTITY REFERENCE

Topic: CASTE

Now what?

Safe-keeping the originals

All 1.2 lakh records scrapped from E-Cinepramaan live on Archive.org

Beyond us

People have been able to use this data, and the way it was given, in many ways.

https://www.thehindu.com/data/over-720-hours-of-film-content-altered-by-indian-censor-board-in-recent-years/article70071303.ece

https://www.hollywoodreporterindia.com/features/insight/cbfc-data-malayalam-bhojpuri-u-rated-films

https://www.thehindu.com/entertainment/movies/a-nice-indian-boy-title-censorship-roshan-sethi-karan-soni-zarna-garg/article70193484.ece

https://fortuneiascircle.com/uploads/download/FWD_03rd_November_to_09th_November,_2025.pdf

https://www.google.com/search?q=site%3Agrokipedia.com+%22cbfc.watch%22

National Time Use Survey

How people in India spend their time each day.

  • 5 lakh people surveyed
  • 24 hours of time use
  • 30 minute intervals
  • Last released by MoSPI for January 2024 - December 2024

Visualization by Nathan Yau, FlowingData

Other than creating fancy visuals, it can be used to answer many questions:

  • Who spends more time cleaning up after meals?
  • Where in India do people spend more time on commuting?
  • Is there a correlation between income and time spent on leisure?

Suppose you want to answer any such question, how would you do it?

First you have to create an account on the MoSPI portal.

Then you get when you download the data.

To map the code in the data to an actual activity, you need to go through multiple documents.

What could have been a single step takes 3 steps.

What would fix this?

  • Do not require users to create an account to access the data
  • Do not require users to open an Excel file with lakhs of rows and dozens of columns
  • Do not require users to refer to multiple documents to convert codes to words

Publicly accessible, browseable on the web and a single file

The data pipelines are replicable, do this for yourself!

A web interface for the Excel file is good, but it can be even better.

  • Filters
  • Complex aggregations
  • SQL queries

View the raw data.

Run time analysis queries.

Who spends more time cleaning up after meals?

Linkable and shareable URLs.

Districts with more incidental sleep/naps:

https://diagramchasing.fun/2025/time-use-explorer?viewMode=time_analysis&filters=%5B%22activity_code%7C%3D%7CIncidental+sleep%2Fnaps%22%5D&columns=%5B%22gender%22%2C%22age%22%2C%22state%22%2C%22district%22%2C%22activity_code%22%2C%22education%22%2C%22time_from%22%2C%22time_to%22%5D&demographic=%5B%22district%22%5D&activity=activity_code&agg=%5B%7B%22column%22%3A%22*%22%2C%22function%22%3A%22COUNT_DISTINCT_PERSON%22%7D%5D

Data is useful only when people use it.

How many people would use the data if they had to:

  • Create an account, open an Excel file, refer to multiple documents, know how to code even for the most basic queries.

versus

  • Open a website and click some buttons.

Data commons in India

MoSPI is one government entity that publicly releases data.

There are many such government entities. Most of them are worse than MoSPI.

How many BYD Cars were registered in Bengaluru in 2025?

MoRTH has vehicle registration statistics

https://vahan.parivahan.gov.in/vahan4dashboard/

Problem solved!

Haha, not really.

5 steps on a convoluted UI

…to access a single data point

  • Designed for use in a specific way
  • No/minimal documentation
  • Data quality issues
  • No bulk data access

Open data + 24 hours of website development.

https://india-vehicle-stats.pages.dev/KA/ALL/2025?name=BYD+INDIA+PRIVATE+LIMITED

Open data = freedom to use it as you want.

https://github.com/Vonter/india-vehicle-stats

Survey of India creates the official government maps

They make the process of accessing them very inviting.

OpenStreetMap in comparison…

Open data deserves good design and thought

It’s never been easier to spin up a dashboard for your open-data with a few prompts.

It’s also never been easier for your audience to ignore yet another dashboard.

When the commons are sparse, what we do within them matters even more

The scarcity of open data in India makes poor execution particularly costly.

Design decisions are visual and systemic

Formats dictate participation

PDF / Dashboards are “read only”

You can only participate by observing.

CSV / JSON / API are open

You can participate by observing and building.

We love to bash PDFs, but do we do the same for dashboards?

The platform is always just one view of the dataset.

Most dashboards, including government ones, pretend to be the final answer.

LOOK NO BEYOND ME!

Instead,

Think about the various kinds of users your data might attract, can you make their lives easier?

Show them one way to slice the data. Get them thinking of more!

We extensively document all data releases because we want users to use this data.

No guesswork.

Detailed notebooks! Ready-to-go

Documentation promotes use

Documentation promotes use

Documentation promotes use

Who are we building for?

If there is a national disaster, NDEM website hopes you are on a laptop.

Who are we building for?

Zero-login is a feature

Public data means public. There is already friction to generating interest in information, more barriers lose more people.


https://www.mdpi.com/2076-3387/13/11/229

Archival

Websites regularly disappear.

Indian government websites especially so.

What to do about this?

After CBFC broke their own website, we experimented with mirroring the censor certificates on archive.org

Mirroring to archive.org turned out to be straightforward.

After CBFC, mirrored documents from a bunch of other sites:

  • Parliament
  • State Assemblies
  • Court Judgements
  • Karnataka State Archives

archive.org automatically makes documents findable and searchable.

What was said in Parliament about the Cuban Missile Crisis?

Government data should be about more than just numbers and stats.

Government data is the result of government processes.

  • Government Orders
  • Gazette Notifications
  • Legislative Proceedings
  • State Archives

Open-data and LLMs

How many districts are there in India?

Everyone has a different answer.

Wikipedia says 780

Local Government Directory says 778

Google/Gemini says 780 to 806

Most people don’t go beyond the AI result

  • Good AI result is preferred over a bad AI result
  • AI needs to be trained on data to be able to respond
  • AI trained on open data is preferred over AI trained on unknown data

Risks when using an LLM

  • No attribution, but some people cite the LLM response as a source
  • An LLM is not a fact checker, but some people use it as one
  • LLMs are not “intelligent”, but some people treat them as such

But these issues are not unique to open data, they are general to all data.

Even if the data is not open, they still use it to train the LLM.

LLMs for code help us do more

As a team of two with day jobs, LLM-assitance in code helps us maintain our pace.

But we have strict rules

  1. Grunt work only. We use LLMs for code and data cleaning. We never use it for writing, analysis, or creating art.

  2. Open source. We publish not just the code, but the prompts and the raw data. The pipeline must be reproducible.

  3. Transparency. If a dataset was cleaned or summarized by AI, the metadata and UI must explicitly say so.

Upcoming stories

  • How does India spend its time?
  • Analyzing the frontpages of major Indian newspapers.
  • More tools and stories on the weather, trees, public transport.

Keep us in your bookmarks!

Thanks for listening!

diagramchasing.fun