Advanced Analytics with Spark


Sandy Ryza - 2015
    

Streaming Systems


Tyler Akidau - 2018
    As more and more businesses seek to tame the massive unbounded data sets that pervade our world, streaming systems have finally reached a level of maturity sufficient for mainstream adoption. With this practical guide, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way.Expanded from Tyler Akidau's popular blog posts Streaming 101 and Streaming 102, this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams. You'll also dive deep into watermarks and exactly-once processing with co-authors Slava Chernyak and Reuven Lax.You'll explore:How streaming and batch data processing patterns compareThe core principles and concepts behind robust out-of-order data processingHow watermarks track progress and completeness in infinite datasetsHow exactly-once data processing techniques ensure correctnessHow the concepts of streams and tables form the foundations of both batch and streaming data processingThe practical motivations behind a powerful persistent state mechanism, driven by a real-world exampleHow time-varying relations provide a link between stream processing and the world of SQL and relational algebra

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark


Holden Karau - 2017
    But if you haven't seen the performance improvements you expected, or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you'll also learn how to make it sing.With this book, you'll explore:How Spark SQL's new interfaces improve performance over SQL's RDD data structureThe choice between data joins in Core Spark and Spark SQLTechniques for getting the most out of standard RDD transformationsHow to work around performance issues in Spark's key/value pair paradigmWriting high-performance Spark code without Scala or the JVMHow to test for functionality and performance when applying suggested improvementsUsing Spark MLlib and Spark ML machine learning librariesSpark's Streaming components and external community packages

Cassandra: The Definitive Guide


Eben Hewitt - 2010
    Cassandra: The Definitive Guide provides the technical details and practical examples you need to assess this database management system and put it to work in a production environment.Author Eben Hewitt demonstrates the advantages of Cassandra's nonrelational design, and pays special attention to data modeling. If you're a developer, DBA, application architect, or manager looking to solve a database scaling issue or future-proof your application, this guide shows you how to harness Cassandra's speed and flexibility.Understand the tenets of Cassandra's column-oriented structureLearn how to write, update, and read Cassandra dataDiscover how to add or remove nodes from the cluster as your application requiresExamine a working application that translates from a relational model to Cassandra's data modelUse examples for writing clients in Java, Python, and C#Use the JMX interface to monitor a cluster's usage, memory patterns, and moreTune memory settings, data storage, and caching for better performance

Data Science at the Command Line: Facing the Future with Time-Tested Tools


Jeroen Janssens - 2014
    You'll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.To get you started--whether you're on Windows, OS X, or Linux--author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.Discover why the command line is an agile, scalable, and extensible technology. Even if you're already comfortable processing data with, say, Python or R, you'll greatly improve your data science workflow by also leveraging the power of the command line.Obtain data from websites, APIs, databases, and spreadsheetsPerform scrub operations on plain text, CSV, HTML/XML, and JSONExplore data, compute descriptive statistics, and create visualizationsManage your data science workflow using DrakeCreate reusable tools from one-liners and existing Python or R codeParallelize and distribute data-intensive pipelines using GNU ParallelModel data with dimensionality reduction, clustering, regression, and classification algorithms

Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale


Neha Narkhede - 2017
    And how to move all of this data becomes nearly as important as the data itself. If you� re an application architect, developer, or production engineer new to Apache Kafka, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.Engineers from Confluent and LinkedIn who are responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. Through detailed examples, you� ll learn Kafka� s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer.Understand publish-subscribe messaging and how it fits in the big data ecosystem.Explore Kafka producers and consumers for writing and reading messagesUnderstand Kafka patterns and use-case requirements to ensure reliable data deliveryGet best practices for building data pipelines and applications with KafkaManage Kafka in production, and learn to perform monitoring, tuning, and maintenance tasksLearn the most critical metrics among Kafka� s operational measurementsExplore how Kafka� s stream delivery capabilities make it a perfect source for stream processing systems

Head First Data Analysis: A Learner's Guide to Big Numbers, Statistics, and Good Decisions


Michael G. Milton - 2009
    If your job requires you to manage and analyze all kinds of data, turn to Head First Data Analysis, where you'll quickly learn how to collect and organize data, sort the distractions from the truth, find meaningful patterns, draw conclusions, predict the future, and present your findings to others. Whether you're a product developer researching the market viability of a new product or service, a marketing manager gauging or predicting the effectiveness of a campaign, a salesperson who needs data to support product presentations, or a lone entrepreneur responsible for all of these data-intensive functions and more, the unique approach in Head First Data Analysis is by far the most efficient way to learn what you need to know to convert raw data into a vital business tool. You'll learn how to:Determine which data sources to use for collecting information Assess data quality and distinguish signal from noise Build basic data models to illuminate patterns, and assimilate new information into the models Cope with ambiguous information Design experiments to test hypotheses and draw conclusions Use segmentation to organize your data within discrete market groups Visualize data distributions to reveal new relationships and persuade others Predict the future with sampling and probability models Clean your data to make it useful Communicate the results of your analysis to your audience Using the latest research in cognitive science and learning theory to craft a multi-sensory learning experience, Head First Data Analysis uses a visually rich format designed for the way your brain works, not a text-heavy approach that puts you to sleep.

The Linux Command Line


William E. Shotts Jr. - 2012
    Available here:readmeaway.com/download?i=1593279523The Linux Command Line, 2nd Edition: A Complete Introduction PDF by William ShottsRead The Linux Command Line, 2nd Edition: A Complete Introduction PDF from No Starch Press,William ShottsDownload William Shotts’s PDF E-book The Linux Command Line, 2nd Edition: A Complete Introduction

Hadoop: The Definitive Guide


Tom White - 2009
    Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud Use Pig, a high-level query language for large-scale data processing Take advantage of HBase, Hadoop's database for structured and semi-structured data Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject. "Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!

Java Cookbook


Ian F. Darwin - 2001
    Whether you're new to Java programming and need something to bridge the gap between theory-laden reference manuals and real-world programs or you're a seasoned Java programmer looking for a new perspective or a different problem-solving context, this book will help you make the most of your Java knowledge. Packed with hundreds of tried-and-true Java recipes covering all of the major APIs from the 1.4 version of Java, this book also offers significant first-look recipes for the most important features of the new 1.5 version, which is in beta release. You get practical solutions to everyday problems, and each is followed by a detailed, ultimately useful explanation of how and why the technology works. Java Cookbook, 2nd Edition includes code segments covering many specialized APIs--like those for working with Struts, Ant and other new popular Open Source tools. It also includes expanded Mac OS X Panther coverage and serves as a great launching point for Java developers who want to get started in areas outside of their specialization. In this major revision, you'll find succinct pieces of code that can be easily incorporated into other programs. Focusing on what's useful or tricky--or what's useful and tricky--Java Cookbook, 2nd Edition is the most practical Java programming book on the market.

Algorithms in a Nutshell


George T. Heineman - 2008
    Algorithms in a Nutshell describes a large number of existing algorithms for solving a variety of problems, and helps you select and implement the right algorithm for your needs -- with just enough math to let you understand and analyze algorithm performance. With its focus on application, rather than theory, this book provides efficient code solutions in several programming languages that you can easily adapt to a specific project. Each major algorithm is presented in the style of a design pattern that includes information to help you understand why and when the algorithm is appropriate. With this book, you will:Solve a particular coding problem or improve on the performance of an existing solutionQuickly locate algorithms that relate to the problems you want to solve, and determine why a particular algorithm is the right one to useGet algorithmic solutions in C, C++, Java, and Ruby with implementation tipsLearn the expected performance of an algorithm, and the conditions it needs to perform at its bestDiscover the impact that similar design decisions have on different algorithmsLearn advanced data structures to improve the efficiency of algorithmsWith Algorithms in a Nutshell, you'll learn how to improve the performance of key algorithms essential for the success of your software applications.

Spark: The Definitive Guide: Big Data Processing Made Simple


Bill Chambers - 2018
    With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals. You’ll explore the basic operations and common functions of Spark’s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark’s scalable machine-learning library. Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets—Spark’s core APIs—through worked examples Dive into Spark’s low-level APIs, RDDs, and execution of SQL and DataFrames Understand how Spark runs on a cluster Debug, monitor, and tune Spark clusters and applications Learn the power of Structured Streaming, Spark’s stream-processing engine Learn how you can apply MLlib to a variety of problems, including classification or recommendation

Java Generics and Collections: Speed Up the Java Development Process


Maurice Naftalin - 2006
    Generics and the greatly expanded collection libraries have tremendously increased the power of Java 5 and Java 6. But they have also confused many developers who haven't known how to take advantage of these new features.Java Generics and Collections covers everything from the most basic uses of generics to the strangest corner cases. It teaches you everything you need to know about the collections libraries, so you'll always know which collection is appropriate for any given task, and how to use it.Topics covered include:• Fundamentals of generics: type parameters and generic methods• Other new features: boxing and unboxing, foreach loops, varargs• Subtyping and wildcards• Evolution not revolution: generic libraries with legacy clients and generic clients with legacy libraries• Generics and reflection• Design patterns for generics• Sets, Queues, Lists, Maps, and their implementations• Concurrent programming and thread safety with collections• Performance implications of different collectionsGenerics and the new collection libraries they inspired take Java to a new level. If you want to take your software development practice to a new level, this book is essential reading.Philip Wadler is Professor of Theoretical Computer Science at the University of Edinburgh, where his research focuses on the design of programming languages. He is a co-designer of GJ, work that became the basis for generics in Sun's Java 5.0.Maurice Naftalin is Technical Director at Morningside Light Ltd., a software consultancy in the United Kingdom. He has most recently served as an architect and mentor at NSB Retail Systems plc, and as the leader of the client development team of a major UK government social service system."A brilliant exposition of generics. By far the best book on the topic, it provides a crystal clear tutorial that starts with the basics and ends leaving the reader with a deep understanding of both the use and design of generics." Gilad Bracha, Java Generics Lead, Sun Microsystems

Windows PowerShell Cookbook: The Complete Guide to Scripting Microsoft's Command Shell


Lee Holmes - 2007
    Intermediate to advanced system administrators will find more than 100 tried-and-tested scripts they can copy and use immediately.Updated for PowerShell 3.0, this comprehensive cookbook includes hands-on recipes for common tasks and administrative jobs that you can apply whether you’re on the client or server version of Windows. You also get quick references to technologies used in conjunction with PowerShell, including format specifiers and frequently referenced registry keys to selected .NET, COM, and WMI classes.Learn how to use PowerShell on Windows 8 and Windows Server 2012Tour PowerShell’s core features, including the command model, object-based pipeline, and ubiquitous scriptingMaster fundamentals such as the interactive shell, pipeline, and object conceptsPerform common tasks that involve working with files, Internet-connected scripts, user interaction, and moreSolve tasks in systems and enterprise management, such as working with Active Directory and the filesystem

Bash Cookbook: Solutions and Examples for Bash Users


Carl Albing - 2007
    Scripting is a way to harness and customize the power of any Unix system, and it's an essential skill for any Unix users, including system administrators and professional OS X developers. But beneath this simple promise lies a treacherous ocean of variations in Unix commands and standards.bash Cookbook teaches shell scripting the way Unix masters practice the craft. It presents a variety of recipes and tricks for all levels of shell programmers so that anyone can become a proficient user of the most common Unix shell -- the bash shell -- and cygwin or other popular Unix emulation packages. Packed full of useful scripts, along with examples that explain how to create better scripts, this new cookbook gives professionals and power users everything they need to automate routine tasks and enable them to truly manage their systems -- rather than have their systems manage them.