Stanton’s open source e-book introduces data science
In a world increasingly muddled by the proliferation of information, how can anyone hope to filter, organize, curate and make sense of it all?
That’s the job of data scientists, an emerging professional specialty in the information field. An introduction to the discipline—along with a sampling of what a data science education is like—is the crux of a new e-book by Jeffrey M. Stanton, School of Information Studies (iSchool) professor and associate dean for research and doctoral programs.
In the recently debuted open source text, Introduction to Data Science, Stanton offers a resource for a range of interests and audiences. The content was designed specifically for use in two iSchool courses, IST 718, “Advanced Information Analytics,” and IST 719, “Data Visualization,” offerings in the certificate in advanced studies in data science, which kicks off formally this fall.
The book doesn’t presume that readers have prior knowledge in computer science or statistics, so that “students who might not otherwise consider themselves analytically oriented, or who haven’t taken a lot of mathematics or computer science before” can look into the field, Stanton says. Structured for novices (each chapter of the book builds on the previous one), its content also will be of interest to those with previous experience and knowledge, given its many chapters on specific topics, the author says.
Uniquely, as an open source offering, the textbook “permits the opportunity for students and professors to add chapters to it and to push the envelope into other, more advanced areas of the topic,” Stanton says. He emphasizes that others are most welcome to add to its chapters as a means to build community, encourage interest in the field, exchange information and boost data science education.
To facilitate collaboration and ease integration of code from different authors, Stanton has also established a version control repository for the text and all the pieces of code it uses on GITHUB. “By putting all the text and code up there, I’m hoping to make it easy for others to come along and add to it and provide a means for people to collaborate over creating new content for the textbooks,” he says. As an open source text, the book is uncommon; “the only one that I know of in the data science area,” according to the author.
“One of the neat characteristics for our educational programming in the iSchool is that we try to make materials and tools as broadly available to students as we can. As the open source movement has increased, a lot of instructors are incorporating open source into the classroom. That’s great for students, because it allows them to use software at low cost or even free. It also allows them, if they choose, to be a community member and potentially to be a contributor to the software. We also support open source for metadata. So for me to extend this concept to a textbook was just a natural next step,” he says.
Stanton came by involvement in data science education—a field he describes as “one part librarianship, one part computer science and one part statistics”—somewhat naturally. His undergraduate degree in computer science was followed by many years of work as a software developer. Then, in graduate school, he specialized in psychology in the quantitative tendencies field. “So as the field of data science emerged, it was a pretty natural fit for me,” he says.
Just as his new book holds a unique place, the iSchool’s data science certificate program is distinctively formulated, Stanton notes. The program fully integrates expertise in unstructured data acquisition and processing, database management and design, metadata generation and data curation. In data science education, “iSchools have a big competitive advantage over monodisciplinary programs,” Stanton says. “It’s in our genes at the iSchool to approach things in an interdisciplinary way.”
To complete the certificate, students take two core courses (database administration concepts and database management; and applied data science) plus three electives. More than 16 different elective courses are available in four focus areas: data analytics; data storage and management; data visualization and general systems management.
The book’s formatted version is available as a pdf download from Stanton’s website.
The iTunes version (with interactive features, including quizzes readers can use to assess their knowledge) is located on Apple’s iTunes store.