Description
Common Crawl (CC) is the non-profit organization that builds and maintains the single largest publicly accessible dataset of the world’s knowledge, encompassing petabytes of web crawl data.If democratizing access to web information and tackling the engineering challenges of working with data at the scale of the web sounds exciting to you, we would love to hear from you. If you have worked on open source projects before or can share code samples with us, please don't hesitate to send relevant links along with your application.
Primary Responsibilities
Running the crawl
Crawl engineering
- Providing netiquette features, such as following robots.txt, as required, and load balancing a crawl across millions of domains
- Implementing and improving ranking algorithms to prioritize the crawling of popular pages
- Extending existing tools to work efficiently with large datasets
- Working with the Nutch community to push improvements to the crawler to the public
Other Responsibilities
- Collaborating with partners in academia and industry
- Engaging regularly with user discussion group and responding to frequent inquiries about how to use CC data
- Writing technical blog posts
- Presenting on or representing CC at conferences, meetups, etc.
Qualifications
Minimum qualifications
Preferred qualifications
About Common Crawl
The Common Crawl Foundation is a California 501(c)(3) registered non-profit with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable. Our vision is of a truly open web that allows open access to information and enables greater innovation in research, business and education. We level the playing field by making wholesale extraction, transformation and analysis of web data cheap and easy.