Namepy step 6 – Load the data into the database
(This is part of the namepy project. Start at Namepy – on the shoulders of giants)
We will need the following data in the database:
- Name frequencies – baby names by year
- Scrabble™ letter values, by (country) Scrabble™ set
Name frequencies
- Download the data from https://www.ssa.gov/oact/babynames/names.zip
- Unzip it in <project_root>/raw_data/yob1880.txt, etc
- You may want to add raw_data to .gitignore, so it doesn’t get stored in your git repo
- Create some code to read files and store in PostgreSQL – read_name_frequencies.py
def read_frequencies_from_file(filename, names): print(filename) year = int(filename[3:7]) year_frequencies = {} for name in names: year_frequencies[name] = {'F': 0, 'M': 0} with open('raw_data/{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}s' {d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb} filename) as file: for line in file.readlines(): try: name_text, sex, count = line.split(",") except: print("Couldn't parse line") print(line) print continue if name_text not in names: name = Name(name=name_text) db.session.add(name) db.session.commit() names[name_text] = name.id year_frequencies[name_text] = {'F': 0, 'M': 0} year_frequencies[name_text][sex] = int(count) for name, name_frequency in year_frequencies.iteritems(): if name_frequency['F'] + name_frequency['M']: name_id = names[name] frequency_record = NameFrequency(name_id=name_id, year=year, boys_count=name_frequency['M'], girls_count=name_frequency['F']) db.session.add(frequency_record) db.session.commit() def read_name_frequencies(): db.create_all() # Start with an empty list print("Deleting any previous data") db.session.query(NameFrequency).delete() db.session.query(Name).delete() db.session.commit() print("Done") names = {} # Get file list for filename in listdir('raw_data'): if filename[:3] == 'yob': read_frequencies_from_file(filename, names)
- Run the code. Note that this may take a while to run. On my development machine it took about 8 minutes
- Check this in the database, for instance with phpPgAdmin or pgAdmin
Scrabble™ letter values
For a list of Scrabble™ letter values by Scrabble™ set see this Wikipedia entry. The following code will grab this page, extract the letter values and save this in the database
- Add the new tables to models.py
class Set(db.Model): id = db.Column(db.Integer, primary_key=True) name = db.Column(db.String()) scores = db.relationship('LetterScore', backref='set', lazy='dynamic') class LetterScore(db.Model): id = db.Column(db.Integer, primary_key=True) set_id = db.Column(db.Integer, db.ForeignKey('set.id')) score = db.Column(db.Integer) letter = db.Column(db.String(1))
- Write some code to parse this page and store the results in the database. See the source code in my GitHub repo
- Run the code
- Check the results in the database. See above (end of name frequencies section) for some suggested tools
Done
- Source code at https://github.com/CoachCoen/namepy/tree/step6
- There are no changes to the front end, so there is no working webpage for this step
Next
Time to pull it all together and show some real charts
Continue to Step 7 – Bringing it all together