Tags: scripting*

104 bookmark(s) - Sort by: Date ↓ / Title / Voting /

  1. Investigating the (often dubious) dealings of businessmen and politicians, our reporters need access to documents and databases from all over the world.

    To make their searches better, we're developing tools that make large amounts of data accessible with a single keystroke. We have built a set of crawlers that combine data from governments, corporations and other media into a search engine.

    However, these crawlers need to deal with uncooperative websites in different languages, formats and structures and they often break when pages are updated.

    After experimenting with some existing solutions, we decided to make a tool that encapsulates our experience with web crawling. The result is a lightweight open source framework named memorious (GitHub).

    memorious is simple and yet allows you to create and maintain a fleet of crawlers, while not forcing too much specific process.

    Schedule crawlers to run at regular intervals (or run them ad-hoc as you need).
    Keep track of error messages and warnings that help admins see which crawlers are in need of maintenance.
    Lets you use familiar tools like requests, BeautifulSoup, lxml or dataset to do the actual scraping.
    Distribute scraping tasks across multiple machines.
    Maintain an overview of your crawlers' status using the command line or a web-based admin interface.

    For common crawling tasks, memorious does all of the heavy lifting. One of our most frequent objectives is to follow every link on a large website and download all PDF files. To achieve this with memorious, all you need to write is a YAML file that plugs together existing components.
    A web-based admin interface allows you to keep track of the status of all of your crawlers.

    Each memorious crawler is comprised of a set of different stages that call each other in succession (or themselves, recursively). Each stage either executes a built-in component, or a custom Python function, that may fetch, parse or store a page just as you like it. memorious is also extensible, and contains lots of helpers to make building your own custom crawlers as convenient as possible.

    These configurable chains of operations have made our crawlers very modular, and common parts are reused efficiently. All crawlers can benefit from automatic cookie persistence, HTTP caching and logging.

    Within OCCRP, memorious is used to feed documents and structured data into aleph via an API, which means documents become searchable as soon as they have been crawled. There, they will also be sent through OCR and entity recognition. aleph aims to use these extracted entities as bridges that link a given document to other databases and documents.

    For a more detailed description of what memorious can do, see the documentation and check out our example project. You can try memorious by running it locally in development mode, and, of course, we also have a Docker setup for robust production deployment.

    As we continually improve our crawler infrastructure at OCCRP, we'll be adding features to memorious for everyone to use. Similarly, we'd love input from the data journalism and open data communities; issues and PRs are welcome.
    Voting 0
  2. In this article we will show you how to use a tool called Duplicity to backup and encrypt file and directories. In addition, using incremental backups for this task will help us to save space.
    Voting 0
  3. So what did I do? I wrote a small web service that parses the HTML of those websites and returns an RSS feed based on that, together with having it update regularly in the background and keeping some history of items. You can find it here: html-rss-proxy. The resulting RSS feeds seem to work very well in Liferea and Newsblur at least.
    Tags: , , by M. Fioretti (2016-08-27)
    Voting 0
  4. Sort of a dramatized headline for what I've accomplished using the command-line Lynx browser, but not too far from the mark. I've described in previous entries how I've used lynx to accomplish similar goals of extracting target information from web pages, so this entry is a continuation along those same lines.

    I recently signed up for a prepaid cellular plan touted as being free, though it is one limited to a certain (unreasonably low, for most) number of minutes per month. The plan has thus far worked well for me. The only real issue I have come across is that I had not yet discovered any way easily to check how many minutes I've used and how many are left. The company providing the service is, of course, not very forthcoming with that sort of information: they have a vested interest in getting you to use up your free minutes, hoping thereby that you'll realize you should buy a paid plan from them, one that includes more minutes. The only way I'd found for checking current usage status is to log in to their web site and click around til you reach a page showing that data.
    Voting 0
  5. using wp-cli to automatically create my posts with whatever meta data I saw fit to give it. After gathering all the information in the script and capturing the rsync transfer output to a string, I could run the following commands to push all that content into the WordPress site (the domain names have been changed to protect the innocent):
    Tags: , , by M. Fioretti (2015-06-21)
    Voting 0
  6. SQLite is a zero-configuration, server-less, file-based transactional database system. Due to its lightweight, self-contained, and compact design, SQLite is an extremely popular choice when you want to integrate a database into your application. In this post, I am going to show you how to create and access a SQLite database in Perl script. The Perl code snippet I present is fully functional, so you can easily modify and integrate it into your project.
    Voting 0
  7. This tutorial explains how to build a RPM package from the source code.
    Voting 0
  8. Creating a daily reading list involves a few manual steps. In only a few minutes, you have the information you need for the day and you can use your mobile device for something more than just checking email and social media. On top of that, you might just be able to trim down the number of sources from which you get information and free up more time to do other things.
    Voting 0
  9. If you have a new Linux VPS and you are not familiar with the procedure how to manage your server, we will show you few simple steps to connect to it for the first time and to be able to use some of its basic operations. If you are ready, we can start.

    To connect to your Linux VPS no matter if it is CentOS VPS, Ubuntu VPS etc., you can use SSH. SSH is a secure network protocol for data communication between two remote computers. This means you can access your server securely, execute commands securely and manage your server securely.

    However, in order to connect to your server via SSH you will need to have an SSH client installed on your local computer. If you have Linux or Mac OS X installed on your computer, you can use the OpenSSH client.
    Tags: , , , by M. Fioretti (2015-01-02)
    Voting 0
  10. you can access all your TV and movies remotely on an Android device over a nice, secure SSH connection.
    Voting 0

Top of the page

First / Previous / Next / Last / Page 1 of 11 Online Bookmarks of M. Fioretti: tagged with "scripting"

About - Propulsed by SemanticScuttle