
DATA MINING, WEB SCRAPING, AND USING APIS:
start exploring.
A collection of tools for web scraping, interacting with APIs, and data mining.
Tutorials to get you started |
Road tested |
||
---|---|---|---|
Python | There are many websites for learning Python. | Mode Analytics Code Academy |
|
SQL | Creating, accessing, and manipulating relational databases through SQL is standard practise in industry. There are many websites for learning SQL. | Mode Analytics Code Academy Khan Academy |
Mining the Social Web | A fantastic resource for data mining the social web. Includes chapter on mining Twitter, Facebook, LinkedIn, Google+, Webpages, GitHub and Mailboxes. |
Arcas | Arcas is a python tool designed to help with collecting academic articles from various APIs. | ||
Tabula | Tabula is a tool for scraping data tables locked inside PDF files. | ||
pandas | pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. | ||
pyNASA | pyNASA provides a simple interface to obtain NASA datasets and returns them as a pandas dataframe ready to use. | ||
Apache Tika | The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. | ||
PDFtables | Accurately convert PDF tables to Excel. | ||
morph.io | Over 5500 public scrapers, with lots of data, available for you to reuse, for free. Download data as a CSV or use the super-simple API. Scrapers can be written in Ruby, PHP, Python, Perl or Node.js. | Getting Started | |
Scrapy | An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way | ||
Kimono | Web text scraper - lets you turn websites into APIs in seconds | ||
OpenRefine | A powerful tool for working with messy data, cleaning it; transforming it from one format into another; and extending it with web services and external data. | ||
Paperweight | A Python package for hacking LaTeX documents |
Explore and download data from MAST using the MAST API or astroquery library. Tutorial created by Ivelina Momcheva (@iva_momcheva). MAST API and astroquery modules developed by Clara Brasseur (@cebrasseur).