#19 Let’s Make Something
date
May 23, 2023
slug
19-make
status
Published
tags
summary
Learning xml to parse Wikipedia data dump
time
0h-30m
type
Post
It’s been a while since I began this project. Looking back, I am not particularly satisfied with the progress I’ve made so far. It’s not terrible, but it’s not great either. Sure, I’ve learned a few things but didn’t yet put any of that to a use. I want to move on to start making and breaking stuff soon. I will still continue to learn and finish more spreadsheets courses on DataCamp along the way.
Today, I came across the simple english Wikipedia data dump which is around 1GB (unzipped) containing 230K articles. It would be interesting to work with this and find some interesting results. The data is in a single xml file.
I am not 100% sure what I want to do with it. I think, It would interesting to parse this into markdown to open as obsidian vault to view the connections graph, or get the meta data to run and build visualizations using d3.js on top of it.
XML
- Extensible markup language
- uses
<tags>
to format and structure data in a tree like structure.
- Largest top level element that contains all other elements is called root.
- Attributes are name-value pairs
- Each attribute can only have single value
- Each attribute can appear at most once
XML submodules
xml.sax
→ Simple API for XML (doesn’t load file in ram)
xml.dom
→ Document object model
xml.etree
→ Element tree
XPath → XML Path Language
- ElementTree has
.findall()
uses xpath and will traverse