Crawl Engineer and Data Scientist wanted

Anywhere  ‐ Remote
This project has been archived and is not accepting more applications.
Browse open projects on our job board.

Description

Common Crawl (CC) is the non-profit organization that builds and maintains the single largest publicly accessible dataset of the world’s knowledge, encompassing petabytes of web crawl data.

If democratizing access to web information and tackling the engineering challenges of working with data at the scale of the web sounds exciting to you, we would love to hear from you. If you have worked on open source projects before or can share code samples with us, please don't hesitate to send relevant links along with your application.

Primary Responsibilities

Running the crawl

  • Spinning up and managing Hadoop clusters on Amazon EC2
  • Running regular comprehensive crawls of the web using Nutch
  • Preparing and publishing crawl data to data hosting partner, Amazon Web Services
  • Incident response and diagnosis of crawl issues as they occur, e.g.
  • Replacing lost instances due to EC2 problems / spot instance losses
  • Responding to and remedying webmaster queries and issues

    Crawl engineering

  • Maintaining, developing, and deploying new features as required by running the Nutch crawler, e.g.:
    - Providing netiquette features, such as following robots.txt, as required, and load balancing a crawl across millions of domains
    - Implementing and improving ranking algorithms to prioritize the crawling of popular pages
    - Extending existing tools to work efficiently with large datasets
    - Working with the Nutch community to push improvements to the crawler to the public

    Other Responsibilities

  • Building support tools and artifacts, including documentation, tutorials, and example code or supporting frameworks for processing CC data using different tools.
  • Identifying and reporting on research and innovations that result from analysis and derivative use of CC data.
  • Community evangelism:
    - Collaborating with partners in academia and industry
    - Engaging regularly with user discussion group and responding to frequent inquiries about how to use CC data
    - Writing technical blog posts
    - Presenting on or representing CC at conferences, meetups, etc.

    Qualifications

    Minimum qualifications
  • Fluent in Java (Nutch and Hadoop are core to our mission)
  • Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
  • Knowledge the Amazon Web Services (AWS) ecosystem
  • Experience with Python
  • Basic command line Unix knowledge
  • BS Computer Science or equivalent work experience

    Preferred qualifications

  • Experience with running web crawlers
  • Cluster computing experience (Hadoop preferred)
  • Running parallel jobs over dozens of terabytes of data
  • Experience committing to open source projects and participating in open source forums

    About Common Crawl

    The Common Crawl Foundation is a California 501(c)(3) registered non-profit with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable. Our vision is of a truly open web that allows open access to information and enables greater innovation in research, business and education. We level the playing field by making wholesale extraction, transformation and analysis of web data cheap and easy.
  • Start date
    ASAP
    From
    Common Crawl
    Published at
    05.10.2015
    Contact person:
    Freelancer Map
    Project ID:
    994809
    Contract type
    Freelance
    Workplace
    100 % remote
    To apply to this project you must log in.
    Register