I updated newmusic.jasonrparadis.xyz with more lists – It now scrapes the following:
John Peel Festive 50 1976 – 2004
NME, Select, Sounds, The Face, The Wire – End of year lists
punk-disco.com – German/Swiss/Austrian punk/new wave lists
rateyourmusic.com – some DIY/80s Cassette Culture lists, and some other users best of lists.
Scraping the last one caused some issues – any issues with the URL would get your IP instantly banned and a manual request required to get unbanned. So far I’ve been banned for 3 days with no fix just because I forgot to add a user-agent to the headers ONE TIME. Silly. It’s not even their lists, it’s their users. Luckily I have some cheap VPN so I used that to get around my ban, and kept scraping times low so it didn’t happen again. Here’s my script:
import requests
from bs4 import BeautifulSoup
import time
import random
#url pattern is easy to figure out, just set range to the number of pages from the list you choose
pages = [f'https://rateyourmusic.com/list/king0elizabeth/king0elizabeths-obscure-music-recommendations/{x}/' for x in range(1,101)]
for each in pages:
response = requests.get(each, headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0'})
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find_all('tr')
for td in trs:
if td.h3 is None or td.h2 is None:
continue
bandname = td.h2.text
albumname = td.h3.text.replace('\n',' ')
print(f'?. {bandname} - {albumname}')
time.sleep(random.randrange(10,25))
This will print the list on the site to console in the form of “?. Artist – Album / Single”. Since there’s so much checking that needs to be done, it’s easier to just copy and paste from the console into the file I use to create a database – some sites like rocklist are inconsistent in the format of the lists. I created another script to convert those to the proper format:
'''
swaps format "<rank>. <song/album> - <artist>" to "<rank>. <artist> - <song/album>" so it will import
into the database the same as the other lists
'''
with open('converter.txt', 'r', encoding='utf-8') as f:
imports = f.readlines()
data = []
for each in imports:
if '~' in each:
print(each)
continue
each = each.replace("\n","")
ranksplit = each.split('. ')
rank = ranksplit[0]
namesplit = str(ranksplit[1]).split(' - ')
title = namesplit[0]
artist = namesplit[1]
print(f'{rank}. {artist} - {title}')
data.append(f'{rank}. {artist} - {title}')
and this is the script I used for punk-disco.com – luckily they don’t care about scraping, the site looks like it’s from the angelfire/geocities era lmao
import requests
from bs4 import BeautifulSoup
'''
scrapes and prints punk-disco.com for bands and album names to console in a format that works for my import script -
there's still some minor formatting errors, like odd spaces and some characters not encoding properly,
but I think that's just because it's an old site or something to do with it being in german. fixing it in notepad++ works good enough
'''
germanlists = ['http://www.punk-disco.com/NDW%20A-E.htm',
'http://www.punk-disco.com/NDW%20F-J.htm',
'http://www.punk-disco.com/NDW%20K-O.htm',
'http://www.punk-disco.com/NDW%20P-S.htm',
'http://www.punk-disco.com/NDW%20T-Z.htm',
'http://www.punk-disco.com/A-E.htm',
'http://www.punk-disco.com/F-J.htm',
'http://www.punk-disco.com/K-O.htm',
'http://www.punk-disco.com/P-T.htm',
'http://www.punk-disco.com/U-Z.htm',
'http://www.punk-disco.com/Compilations.htm',
'http://www.punk-disco.com/Punk-Disco-CH/A-M%20CH.htm',
'http://www.punk-disco.com/Punk-Disco-CH/N-Z%20CH.htm',
'http://www.punk-disco.com/Punk-Disco-CH/Compilations%20CH.htm',
'http://www.punk-disco.com/Punk-Disco%20AUSTRIA.htm']
for each in germanlists:
print(f'~\n{each}') #used for splitting lists in import.py
response = requests.get(each)
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find_all('table')
for row in trs:
if '<p>' in str(row):
if row.text.strip() is '':
continue
bandalbum = row.find_all('b')[0].text.encode('utf-8').decode('utf-8')
if '(' in str(bandalbum) and ')' in str(bandalbum):
cleantext = bandalbum.split('(')[1].split(')')[0]
bandalbum = str(bandalbum).replace(f"({cleantext})", '')
bandalbum = bandalbum.strip().replace('\n','').replace(':','-').title()
if '\n' in bandalbum:
bandalbum = bandalbum.replace('\n','')
print(f"?. {bandalbum}")
WordPress code blocks really messed up Python formatting. Oh well. Something else to fix in the future, I guess.
This is my import script – it turns import.txt into a database, which is what I copy the result of the scrapers above into, then run this and fix any errors it gives until there’s none.
import sqlite3
with open('import.txt', 'r', encoding='utf-8') as f:
imports = f.read()
sql = sqlite3.connect('musiclists.db')
cur = sql.cursor()
count = 0
for alist in imports.split('~\n'):
table = alist.split('\n')
tablename = table[0]
cur.execute('CREATE TABLE IF NOT EXISTS "' + tablename + '"(rank TEXT, artist TEXT, albumorsong TEXT)')
for albumsingle in table[1:]:
try:
if albumsingle == '':
pass
else:
ranksplit = albumsingle.split('. ')
rank = ranksplit[0]
namesplit = ''.join(ranksplit[1:]).split(' - ')
artist = ''.join(namesplit[0])
title = namesplit[1]
cur.execute(f'INSERT INTO "{tablename}" VALUES(?,?,?)',[rank,artist,title])
#print (f"inserted {rank}, {artist}, {title}")
except Exception as e:
print(f'error in {table[0]}: {albumsingle}, {e}')
count += 1
print (f"{tablename} finished")
if count == 0:
print ("success, all good format")
else:
print('errors')
sql.commit()
sql.close()
Originally I might have had an issue with a goofy SQL query here, but after ~reading the documentation~, I think I’ve got it sorted out as best I can without redoing table names.
from flask import Flask, render_template, request
import sqlite3
import random
app = Flask(__name__)
@app.route('/database', methods=['GET', 'POST'])
def main():
sql = sqlite3.connect('musiclists.db')
cur = sql.cursor()
if request.method == 'POST':
tablefix = f"{request.form.get('tables')}"
data = cur.execute(f"SELECT * FROM '{tablefix}'").fetchall()
elif request.method == 'GET':
data = []
tablefix = ''
alltables = cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
alltables.sort()
sql.close()
return render_template('index.html', current=tablefix,output=data,
tables=list(sum(alltables,())))
@app.route('/')
def makeplaylist():
sql = sqlite3.connect('musiclists.db')
cur = sql.cursor()
alltables = cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
playlist = []
for each in range(0,15):
tablefix = random.choice((sum(alltables,())))
data = cur.execute(f"SELECT * FROM '{tablefix}'").fetchall()
playlist.append(list(random.choice(data)) + [tablefix])
sql.close()
return render_template('playlist.html', data=playlist)
#just for local debugging
if __name__ == '__main__':
app.run()
This is my main Flask script that runs the whole thing. I should probably try making my own lists now, I already have https://music.jasonrparadis.xyz/bestof which could be turned into a list easy enough. Adding a button for favorites to the new site & a best of page from that might be another thing to add.
I run all this in a docker image and just rebuild the image every time I do an update. It works quite well, and it again reminds me of how great docker and containers are. I do occasionally seem to be getting some lag, I’m not sure if it’s my cheap VPS (it’s only 2gb of ram) or some type of hibernation thing, since it seems like it’s only after it hasn’t been used for a while, then it works fine. It’s still better than it was running on my old Windows VPS, that’s for sure.
here’s some more music I’ve found from using this tool:
I really like music. All of it, with amateur/”folk” (i.e. talent is mostly irrelevant if it’s catchy or interesting) music being held in high regard. The only thing I really don’t like is sappy ballads & overproduced modern music – but that might be obvious considering I’m listening to digital rips of cassettes in 2021, haha. Switching my computer to 5.1 sound seems to help with the noise on some stuff, I’m not sure why that is.