Building for Open-Data Commons at Diagram Chasing

Presented at Open Knowledge Initiatives, IIIT-H

Vivek Matthew

github.com/Vonter

Aman Bhargava

aman.bh

Who we are

Vivek

Government website hoarder
Occasional open data publisher
Fan of maps (including OpenStreetMap)

Aman

Designer, programmer
Maps, data journalism, public technology enthusiast
Has too many ideas

Who is my Neta?

Easy and intuitive explorer for browsing affidavits
and parliamentary activity of elected MPs.

This was fun

Data, development, design. We could do more.

We do full-stack data journalism

BLR Water Log

Drainage patterns in Bangalore as a convenient map,
with curated historical context.

Votes in a Name

Building on the previous project, we analyzed election candidate names to find out “namesakes” which could have potentially flipped the election.

How do you find more namesakes like S Veeramani and S .V Ramani?

Now let’s go deeper

You’ve seen what we build. But how do we build it?

We’ll look at

CBFC Watch
Time Use Survey Explorer

CBFC Watch

India’s largest, and only, archive of film censorship

For the first time, search through thousands of censorship records

Browse cross-referenced keywords

Each film leads you to others like it, creating links between 18,000 movies based on censorship

Understand a film in context

Normally the only way to find a certificate for a movie is to go around searching for something like this in a theatre

On opening the URL, the certificate is displayed with the list of cuts made to the film.

Making it human readable

Build a parser to read HTML files
Extract only the relevant content
Export everything to CSV format

The Government had other plans in store

What I got, I accepted as my fate, What I lost, I gradually forgot.

This data was still unusable because the information that users cared about was hidden in piles of text and timestamps with no context.

Also, modifications alone weren’t enough.

Who made the movie?
Who acted in the movie?
Which studio?
When did it release?
How can I analyze trends, if any?

Attempt #1:

Manual Classification

If you were going to manually categorize them anyway, it is a subjective decision that can be passed on to an LLM.

Attempt #2:

Large Language Models

Detailed prompts + Edge case examples = Text categorization that also cleans up messy content for better readability.

## Classification Schema
action_types: deletion, insertion, replacement...
content_types: violence, profanity, political...
media_elements: music, visual_scene, text_dialogue...

{
  "cleaned_description": "string",
  "reference": "string" or null,
  "action": "string",
  "content_types": ["string"],
  "media_element": "string"
}

Input: "Muted the word FUCK at 01:12:29"
Output: {
  "reference": "fuck",
  "action": "audio_modification",
  "content_types": ["profanity"]
}

Costs are neglible for the value

100K descriptions processed
₹1,500 total cost
₹0.015 cost per description

What LLMs are good for

This kind of repetitive classification at scale
Standardizing messy text
Pattern matching across similar cases
Reducing manual grunt work in development

The interesting work—the analysis, the trends, the insights—that’s entirely (and necessarily) human.

Now we have analyzable metadata!

Before:

01:32:59:00 Replaced the whole V.O. stanza about caste system of Manu Maharaj. Aabadi hain Aabad .Aur unka jivan sarthak hoga To Aabadi hain Aabad nahiazhadi ki

After:

Clean Description: Replaced a voice-over passage discussing the caste system.

Categories: - REPLACEMENT - TEXT DIALOGUE - IDENTITY REFERENCE

Topic: CASTE

Now what?

Analysis and Trends

Permalinks!

Why would you not want your users to share things? Everything is linkable!

https://cbfc.watch/film/sinners-2025
https://cbfc.watch/browse/actors/fahadh-faasil
https://cbfc.watch/browse/content/religious
https://cbfc.watch/search?q=maps+language%3AEnglish

Safe-keeping the originals

All 1.2 lakh records scrapped from E-Cinepramaan live on Archive.org

Beyond us

People have been able to use this data, and the way it was given, in many ways.

https://www.thehindu.com/data/over-720-hours-of-film-content-altered-by-indian-censor-board-in-recent-years/article70071303.ece

https://www.hollywoodreporterindia.com/features/insight/cbfc-data-malayalam-bhojpuri-u-rated-films

https://www.thehindu.com/entertainment/movies/a-nice-indian-boy-title-censorship-roshan-sethi-karan-soni-zarna-garg/article70193484.ece

https://fortuneiascircle.com/uploads/download/FWD_03rd_November_to_09th_November,_2025.pdf

https://www.google.com/search?q=site%3Agrokipedia.com+%22cbfc.watch%22

National Time Use Survey

How people in India spend their time each day.

5 lakh people surveyed
24 hours of time use
30 minute intervals
Last released by MoSPI for January 2024 - December 2024

Visualization by Nathan Yau, FlowingData

Other than creating fancy visuals, it can be used to answer many questions:

Who spends more time cleaning up after meals?
Where in India do people spend more time on commuting?
Is there a correlation between income and time spent on leisure?

Suppose you want to answer any such question, how would you do it?

First you have to create an account on the MoSPI portal.

Then you get when you download the data.

To map the code in the data to an actual activity, you need to go through multiple documents.

What could have been a single step takes 3 steps.

What would fix this?

Do not require users to create an account to access the data
Do not require users to open an Excel file with lakhs of rows and dozens of columns
Do not require users to refer to multiple documents to convert codes to words

Publicly accessible, browseable on the web and a single file

The data pipelines are replicable, do this for yourself!

A web interface for the Excel file is good, but it can be even better.

Filters
Complex aggregations
SQL queries

View the raw data.

Run time analysis queries.

Who spends more time cleaning up after meals?

Linkable and shareable URLs.

Districts with more incidental sleep/naps:

https://diagramchasing.fun/2025/time-use-explorer?viewMode=time_analysis&filters=%5B%22activity_code%7C%3D%7CIncidental+sleep%2Fnaps%22%5D&columns=%5B%22gender%22%2C%22age%22%2C%22state%22%2C%22district%22%2C%22activity_code%22%2C%22education%22%2C%22time_from%22%2C%22time_to%22%5D&demographic=%5B%22district%22%5D&activity=activity_code&agg=%5B%7B%22column%22%3A%22*%22%2C%22function%22%3A%22COUNT_DISTINCT_PERSON%22%7D%5D

Data is useful only when people use it.

How many people would use the data if they had to:

Create an account, open an Excel file, refer to multiple documents, know how to code even for the most basic queries.

versus

Open a website and click some buttons.

Data commons in India

MoSPI is one government entity that publicly releases data.

There are many such government entities. Most of them are worse than MoSPI.

How many BYD Cars were registered in Bengaluru in 2025?

MoRTH has vehicle registration statistics

https://vahan.parivahan.gov.in/vahan4dashboard/

Problem solved!

Haha, not really.

5 steps on a convoluted UI

…to access a single data point

Designed for use in a specific way
No/minimal documentation
Data quality issues
No bulk data access

Open data + 24 hours of website development.

https://india-vehicle-stats.pages.dev/KA/ALL/2025?name=BYD+INDIA+PRIVATE+LIMITED

Open data = freedom to use it as you want.

https://github.com/Vonter/india-vehicle-stats

Survey of India creates the official government maps

They make the process of accessing them very inviting.

OpenStreetMap in comparison…

Open data deserves good design and thought

It’s never been easier to spin up a dashboard for your open-data with a few prompts.

It’s also never been easier for your audience to ignore yet another dashboard.

When the commons are sparse, what we do within them matters even more

The scarcity of open data in India makes poor execution particularly costly.

Design decisions are visual and systemic

Formats dictate participation

PDF / Dashboards are “read only”

You can only participate by observing.

CSV / JSON / API are open

You can participate by observing and building.

We love to bash PDFs, but do we do the same for dashboards?

The platform is always just one view of the dataset.

Most dashboards, including government ones, pretend to be the final answer.

LOOK NO BEYOND ME!

Instead,

Think about the various kinds of users your data might attract, can you make their lives easier?

Show them one way to slice the data. Get them thinking of more!

We extensively document all data releases because we want users to use this data.

No guesswork.

Detailed notebooks! Ready-to-go

Documentation promotes use

Who are we building for?

If there is a national disaster, NDEM website hopes you are on a laptop.

UK Gov, Live Flooding

Who are we building for?

Computer ownership and usage, Data for India

Archival

Websites regularly disappear.

Indian government websites especially so.

What to do about this?

After CBFC broke their own website, we experimented with mirroring the censor certificates on archive.org

Mirroring to archive.org turned out to be straightforward.

After CBFC, mirrored documents from a bunch of other sites:

Parliament
State Assemblies
Court Judgements
Karnataka State Archives

archive.org automatically makes documents findable and searchable.

What was said in Parliament about the Cuban Missile Crisis?

Government data should be about more than just numbers and stats.

Government data is the result of government processes.

Government Orders
Gazette Notifications
Legislative Proceedings
State Archives

Open-data and LLMs

How many districts are there in India?

Everyone has a different answer.

Wikipedia says 780

Local Government Directory says 778

Google/Gemini says 780 to 806

Most people don’t go beyond the AI result

Good AI result is preferred over a bad AI result
AI needs to be trained on data to be able to respond
AI trained on open data is preferred over AI trained on unknown data

Risks when using an LLM

No attribution, but some people cite the LLM response as a source
An LLM is not a fact checker, but some people use it as one
LLMs are not “intelligent”, but some people treat them as such

But these issues are not unique to open data, they are general to all data.

Even if the data is not open, they still use it to train the LLM.

LLMs for code help us do more

As a team of two with day jobs, LLM-assitance in code helps us maintain our pace.

But we have strict rules

Grunt work only. We use LLMs for code and data cleaning. We never use it for writing, analysis, or creating art.
Open source. We publish not just the code, but the prompts and the raw data. The pipeline must be reproducible.
Transparency. If a dataset was cleaned or summarized by AI, the metadata and UI must explicitly say so.

Upcoming stories

How does India spend its time?
Analyzing the frontpages of major Indian newspapers.
More tools and stories on the weather, trees, public transport.

Keep us in your bookmarks!

Thanks for listening!

diagramchasing.fun

Building for Open-Data Commons at Diagram Chasing

Who we are

Vivek

Aman

Who is my Neta?

This was fun

We do full-stack data journalism

BLR Water Log

Votes in a Name

Now let’s go deeper

CBFC Watch

For the first time, search through thousands of censorship records

Browse cross-referenced keywords

Understand a film in context

Making it human readable

The Government had other plans in store

Also, modifications alone weren’t enough.

Attempt #1:

If you were going to manually categorize them anyway, it is a subjective decision that can be passed on to an LLM.

Attempt #2:

Costs are neglible for the value

What LLMs are good for

Now we have analyzable metadata!

Before:

After:

Now what?

Analysis and Trends

Permalinks!

Safe-keeping the originals

Beyond us

National Time Use Survey

What could have been a single step takes 3 steps.

What would fix this?

Who spends more time cleaning up after meals?

Data is useful only when people use it.

Data commons in India

How many BYD Cars were registered in Bengaluru in 2025?

MoRTH has vehicle registration statistics

Open data + 24 hours of website development.

Open data = freedom to use it as you want.

Survey of India creates the official government maps

OpenStreetMap in comparison…

Open data deserves good design and thought

When the commons are sparse, what we do within them matters even more

Design decisions are visual and systemic

Formats dictate participation

The platform is always just one view of the dataset.

Detailed notebooks! Ready-to-go

Documentation promotes use

Documentation promotes use

Documentation promotes use

Who are we building for?

Who are we building for?

Zero-login is a feature

Archival

After CBFC broke their own website, we experimented with mirroring the censor certificates on archive.org

Open-data and LLMs

How many districts are there in India?

Most people don’t go beyond the AI result

Risks when using an LLM

LLMs for code help us do more

But we have strict rules

Upcoming stories

Thanks for listening!