#19 Let’s Make Something

date

May 23, 2023

slug

19-make

status

Published

tags

summary

Learning xml to parse Wikipedia data dump

time

0h-30m

type

Post

It’s been a while since I began this project. Looking back, I am not particularly satisfied with the progress I’ve made so far. It’s not terrible, but it’s not great either. Sure, I’ve learned a few things but didn’t yet put any of that to a use. I want to move on to start making and breaking stuff soon. I will still continue to learn and finish more spreadsheets courses on DataCamp along the way.

Today, I came across the simple english Wikipedia data dump which is around 1GB (unzipped) containing 230K articles. It would be interesting to work with this and find some interesting results. The data is in a single xml file.

I am not 100% sure what I want to do with it. I think, It would interesting to parse this into markdown to open as obsidian vault to view the connections graph, or get the meta data to run and build visualizations using d3.js on top of it.

XML

Extensible markup language

uses <tags> to format and structure data in a tree like structure.

Largest top level element that contains all other elements is called root.

Attributes are name-value pairs

Each attribute can only have single value
Each attribute can appear at most once

XML submodules

xml.sax → Simple API for XML (doesn’t load file in ram)

xml.dom → Document object model

xml.etree → Element tree

XPath → XML Path Language

ElementTree has .findall() uses xpath and will traverse