Tags: scripting*

106 bookmark(s) - Sort by: Date ↓ / Title / Voting /

  1. SQLet is a free, open-source script that allows you to directly execute SQL on multiple text files, right from the Linux command line.
    In one single command, you can read in text files (with or without header line), and perform arbitrary select statements, including joins over several files.
    Voting 0
  2. In my previous article , we have discussed about how we can execute remote command over ssh in Linux/Unix. What if we want to execute multiple commands or shell script? In this article we will be discussing about how to run-execute multiple remote commands-shell scripts using ssh in Linux / UNIX.run-execute multiple remote commands-shell scripts using ssh
    Tags: , by M. Fioretti (2018-01-19)
    Voting 0
  3. Investigating the (often dubious) dealings of businessmen and politicians, our reporters need access to documents and databases from all over the world.

    To make their searches better, we're developing tools that make large amounts of data accessible with a single keystroke. We have built a set of crawlers that combine data from governments, corporations and other media into a search engine.

    However, these crawlers need to deal with uncooperative websites in different languages, formats and structures and they often break when pages are updated.

    After experimenting with some existing solutions, we decided to make a tool that encapsulates our experience with web crawling. The result is a lightweight open source framework named memorious (GitHub).

    memorious is simple and yet allows you to create and maintain a fleet of crawlers, while not forcing too much specific process.

    Schedule crawlers to run at regular intervals (or run them ad-hoc as you need).
    Keep track of error messages and warnings that help admins see which crawlers are in need of maintenance.
    Lets you use familiar tools like requests, BeautifulSoup, lxml or dataset to do the actual scraping.
    Distribute scraping tasks across multiple machines.
    Maintain an overview of your crawlers' status using the command line or a web-based admin interface.

    For common crawling tasks, memorious does all of the heavy lifting. One of our most frequent objectives is to follow every link on a large website and download all PDF files. To achieve this with memorious, all you need to write is a YAML file that plugs together existing components.
    A web-based admin interface allows you to keep track of the status of all of your crawlers.

    Each memorious crawler is comprised of a set of different stages that call each other in succession (or themselves, recursively). Each stage either executes a built-in component, or a custom Python function, that may fetch, parse or store a page just as you like it. memorious is also extensible, and contains lots of helpers to make building your own custom crawlers as convenient as possible.

    These configurable chains of operations have made our crawlers very modular, and common parts are reused efficiently. All crawlers can benefit from automatic cookie persistence, HTTP caching and logging.

    Within OCCRP, memorious is used to feed documents and structured data into aleph via an API, which means documents become searchable as soon as they have been crawled. There, they will also be sent through OCR and entity recognition. aleph aims to use these extracted entities as bridges that link a given document to other databases and documents.

    For a more detailed description of what memorious can do, see the documentation and check out our example project. You can try memorious by running it locally in development mode, and, of course, we also have a Docker setup for robust production deployment.

    As we continually improve our crawler infrastructure at OCCRP, we'll be adding features to memorious for everyone to use. Similarly, we'd love input from the data journalism and open data communities; issues and PRs are welcome.
    Voting 0
  4. In this article we will show you how to use a tool called Duplicity to backup and encrypt file and directories. In addition, using incremental backups for this task will help us to save space.
    Voting 0
  5. So what did I do? I wrote a small web service that parses the HTML of those websites and returns an RSS feed based on that, together with having it update regularly in the background and keeping some history of items. You can find it here: html-rss-proxy. The resulting RSS feeds seem to work very well in Liferea and Newsblur at least.
    Tags: , , by M. Fioretti (2016-08-27)
    Voting 0
  6. Sort of a dramatized headline for what I've accomplished using the command-line Lynx browser, but not too far from the mark. I've described in previous entries how I've used lynx to accomplish similar goals of extracting target information from web pages, so this entry is a continuation along those same lines.

    I recently signed up for a prepaid cellular plan touted as being free, though it is one limited to a certain (unreasonably low, for most) number of minutes per month. The plan has thus far worked well for me. The only real issue I have come across is that I had not yet discovered any way easily to check how many minutes I've used and how many are left. The company providing the service is, of course, not very forthcoming with that sort of information: they have a vested interest in getting you to use up your free minutes, hoping thereby that you'll realize you should buy a paid plan from them, one that includes more minutes. The only way I'd found for checking current usage status is to log in to their web site and click around til you reach a page showing that data.
    Voting 0
  7. using wp-cli to automatically create my posts with whatever meta data I saw fit to give it. After gathering all the information in the script and capturing the rsync transfer output to a string, I could run the following commands to push all that content into the WordPress site (the domain names have been changed to protect the innocent):
    Tags: , , by M. Fioretti (2015-06-21)
    Voting 0
  8. SQLite is a zero-configuration, server-less, file-based transactional database system. Due to its lightweight, self-contained, and compact design, SQLite is an extremely popular choice when you want to integrate a database into your application. In this post, I am going to show you how to create and access a SQLite database in Perl script. The Perl code snippet I present is fully functional, so you can easily modify and integrate it into your project.
    Voting 0
  9. This tutorial explains how to build a RPM package from the source code.
    Voting 0
  10. Creating a daily reading list involves a few manual steps. In only a few minutes, you have the information you need for the day and you can use your mobile device for something more than just checking email and social media. On top of that, you might just be able to trim down the number of sources from which you get information and free up more time to do other things.
    Voting 0

Top of the page

First / Previous / Next / Last / Page 1 of 11 Online Bookmarks of M. Fioretti: tagged with "scripting"

About - Propulsed by SemanticScuttle