Welcome to the new platform of Programmer's Heaven! We apologize for the inconvenience caused, if you visited us from a broken link of the previous version. The main reason to move to a new platform is to provide more effective and collaborative experience to you all. Please feel free to experience the new platform and use its exciting features. Contact us for any issue that you need to get clarified. We are more than happy to help you.
Hi all and thanks in advance for the help,
I am trying to learn python but have little to no knowledge writing programs (I do use some SAS and STATA). A while back I took a crack at writing a python program to extract some data off of the International Trade Commission's website and ended up getting help from someone on this forum. They pretty much threw out my old code and rewrote it. Thus, I used it but did not really know what it was doing. I am now trying to edit this code to fix a few problems and I was wondering if someone could go through the somewhat menial task of telling me what everything in the code actually does. I would greatly appreciate it. It is a very short loop so I don't think it would take much time, but knowing what the syntax and operators meant and did would be invaluable to me. Thanks if you can help out!
outfile = open("1.Raw_Data.txt", "w")
reg_title = re.compile(r'Inv.s+#(.*):')
reg_info = re.compile(r'#0000ff">(.*?)')
initseg = "http://info.usitc.gov/ouii/public/337inv.nsf/56ff5fbca63b069e852565460078c0ae/
endseg = "?OpenDocument"
CASES = 
for start in range(1,752,30):
htmlAll = urllib2.urlopen("http://info.usitc.gov/ouii/public/337inv.nsf/All?OpenView&Start=%d
" % start).read()
reg_url = re.compile(r'<a href="/ouii/public/337inv.nsf/56ff5fbca63b069e852565460078c0ae/(w+)')
found = reg_url.findall(htmlAll)
for case in CASES:
complete_url = initseg + case + endseg
page = urllib2.urlopen(complete_url).read()
outfile.write ("|" + "|".join([reg_title.search(page).group(1)] + reg_info.findall(page)) + "
print "|" + "|".join([reg_title.search(page).group(1)] + reg_info.findall(page))
0 · ·