Namepy step 6 – Load the data into the database
(This is part of the namepy project. Start at Namepy – on the shoulders of giants)
We will need the following data in the database:
- Name frequencies – baby names by year
- Scrabble™ letter values, by (country) Scrabble™ set
Name frequencies
- Download the data from https://www.ssa.gov/oact/babynames/names.zip
- Unzip it in <project_root>/raw_data/yob1880.txt, etc
- You may want to add raw_data to .gitignore, so it doesn’t get stored in your git repo
- Create some code to read files and store in PostgreSQL – read_name_frequencies.py
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
def
read_frequencies_from_file(filename, names):
print
(filename)
year
=
int
(filename[
3
:
7
])
year_frequencies
=
{}
for
name
in
names:
year_frequencies[name]
=
{
'F'
:
0
,
'M'
:
0
}
with
open
(
'raw_data/{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}s'
{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb} filename) as
file
:
for
line
in
file
.readlines():
try
:
name_text, sex, count
=
line.split(
","
)
except
:
print
(
"Couldn't parse line"
)
print
(line)
print
continue
if
name_text
not
in
names:
name
=
Name(name
=
name_text)
db.session.add(name)
db.session.commit()
names[name_text]
=
name.
id
year_frequencies[name_text]
=
{
'F'
:
0
,
'M'
:
0
}
year_frequencies[name_text][sex]
=
int
(count)
for
name, name_frequency
in
year_frequencies.iteritems():
if
name_frequency[
'F'
]
+
name_frequency[
'M'
]:
name_id
=
names[name]
frequency_record
=
NameFrequency(name_id
=
name_id,
year
=
year,
boys_count
=
name_frequency[
'M'
],
girls_count
=
name_frequency[
'F'
])
db.session.add(frequency_record)
db.session.commit()
def
read_name_frequencies():
db.create_all()
# Start with an empty list
print
(
"Deleting any previous data"
)
db.session.query(NameFrequency).delete()
db.session.query(Name).delete()
db.session.commit()
print
(
"Done"
)
names
=
{}
# Get file list
for
filename
in
listdir(
'raw_data'
):
if
filename[:
3
]
=
=
'yob'
:
read_frequencies_from_file(filename, names)
- Run the code. Note that this may take a while to run. On my development machine it took about 8 minutes
- Check this in the database, for instance with phpPgAdmin or pgAdmin
Scrabble™ letter values
For a list of Scrabble™ letter values by Scrabble™ set see this Wikipedia entry. The following code will grab this page, extract the letter values and save this in the database
- Add the new tables to models.py
12345678910
class
Set
(db.Model):
id
=
db.Column(db.Integer, primary_key
=
True
)
name
=
db.Column(db.String())
scores
=
db.relationship(
'LetterScore'
, backref
=
'set'
, lazy
=
'dynamic'
)
class
LetterScore(db.Model):
id
=
db.Column(db.Integer, primary_key
=
True
)
set_id
=
db.Column(db.Integer, db.ForeignKey(
'set.id'
))
score
=
db.Column(db.Integer)
letter
=
db.Column(db.String(
1
))
- Write some code to parse this page and store the results in the database. See the source code in my GitHub repo
- Run the code
- Check the results in the database. See above (end of name frequencies section) for some suggested tools
Done
- Source code at https://github.com/CoachCoen/namepy/tree/step6
- There are no changes to the front end, so there is no working webpage for this step
Next
Time to pull it all together and show some real charts
Continue to Step 7 – Bringing it all together