File Server Scraping

Within Python the best way to do file server scraping (or dumpster diving) is something like

def find_OpPt_CDFs(obs,ver,targ_dir):
    file_pattern = re.compile('%s_d[ei]s\d_engr_l1_sigthrsh_\d{14}_v%s.cdf' % (obs,ver))
        
    print 'scraping in %s' % targ_dir
 
    filtered_files = []
    for root, dirs, files in os.walk(targ_dir):
        for file in files:
            if file_pattern.match(file):
                full_filename = (root+'/'+file).replace('\\','/')
                filtered_files.append(full_filename)
    return np.array(filtered_files).flatten()

Note that the regex compile step uses:

    [ei] - either 'e' or 'i'
    \d   - digit
    \d{14} - any 14 digits in a row

Leave a Reply

Your email address will not be published. Required fields are marked *